– Okay, so I suppose you can start. Course scheduling seems to be all the rage these days. I think this is the third talk on it. Who doesn't know what it is? Where do I start? (laughter) – [Man] You might ask who
doesn't want to know what it is. – I don't want to know. (laughs) Okay, so core scheduling is basically again, scheduling but just for the SMT threads within one core, synchronized to our task selection. This all got really
important when L1TF happened and then MDS happened and it fell apart. But since we're playing with
the core scheduling idea for other reasons, people have come up and said, "We can use this for other purposes." Some of them are creative, some of them actually make sense, but to have the real-time people latched onto the idea, the problem is hyper
threads are not really all that deterministic or they're very deterministically awful (laughs) for real-time interference for latencies and all that things, so a number of people have basically disabled SMT on their systems to gain better latency.
Clark waves, yay. So a number of people have played with hot-plug and CPU sets to partially disable SMT so the non-real-time tasks can still get some through put benefit and there's all sorts
of intermediate options. Core scheduling can give you another mix where you allow a real-time task to force IDLE-D the SMT siblings and thereby avoid the interference. Of course this brings up a whole bunch of interesting questions, I mean it wouldn't be
real-time if there weren't any. Admission control will be impacted, yeah, so things like that. Daniel was studying some of that. – [Man] Regarding the animation controls, what we were talking about was that when we disable SMT we lose some CPUs. Let's say that we have two
cores and with four threads CPU.
When we disable HT, we end up with only two
cores running, right? But in the current scattered line code, and Yuri might help me as well, so with the admission control assuming that we have these four CPUs when we have HT enabled, but, and the idea is
that we can use at most by default 95% of CPU time, but it's 95% of all CPUs. If we turn two of the CPUs off we will be able to use 190% of CPU.
– So the trivial hack
would be of course to then account double the bandwidth, and for however many siblings there are on the system to admission control, but then we need per
CPU admission control, which we were already looking into, but which is slightly non-trivial. So this just has some very nice upsides, but it's also difficult again. – The one point that
we need to consider is that on Sky Deadline we
have the computation time the maximum execution time that we're expecting from one task. And if you run with SMT disabled, we have a better precision on that number, right? – [Man] Correct. – And when we have a core scheduling, is it good to have at all hyper trade-ins in the sky deadline? What do you think what people think? – [Presenter] So, for me,
I guess it's good for the non-real-time part of the tasker system because if you disable hyper threading I mean forth not hyper threading then you basically have the available cores. If you let it on you can use part of the system for, for example, deadline tasks and force
idle the other siblings and maybe on other cores that don't really interact with those
you can run more stuff.
– Right Daniel says, "Does it ever make sense to not force idle siblings for deadlines, and
I think some of the soft real-time might actually want SMT on. I think people actually leave SMT on. – Yeah, I guess if you
you're not into hard real-time type of R&D's I guess, yeah. – So yeah it should be policy, and not, I think strictly enforced. – So I wanted to try and
understand something here which I know is gonna be difficult for me this early in the morning. Are you saying that if you've got, we've got core scheduling and SMT enabled, we can take a real-time task that we feel like has very hard deadlines or something and tag it so that it has no siblings? – [Presenter] Correct. – So that when we schedule it, it forces idle the other thread, and that gives us a, is that the mechanism? – So you avoid EDS empty interference by some random other thread that just happens to be co-schedulers on that core. – You just add the OT overhead of bringing the other thread into idle.
– [Presenter] There is that. It doesn't come for free. There's no free lunch ever. Sadly. – [Man] No ponies? – No ponies. – [Man] Unicorn? – Talk to Case. (laughter) He has unicorns for you. – So I just wanna understand this. So it's basically it's
whenever the RT task is scheduled onto the
cores when you then– – Yeah you do it on scheduling. – Okay so, would it be possible just, would it be a flag, I mean
one way I could see is just you know, don't let or make ISO CPU type of thing or CPU set, never let that one just shut down one of the things– – So that's something
that people currently do.
They do CPU sets, and they do the partial
offline of siblings, but it's for the mixed workload where there is a lot
of non-real-time story with some real-time stuff in, and you still want to get some of that benefit, so then fully disabling the hyper threads will be a bigger performance impact than necessary, and in these cases the core scheduling might help with that. – Okay so basically it sounds like something that they don't really care about true determinism. I mean we'll have a little, or well it could be deterministic, but the latency is like okay well we'll take the hit for the overhead cause we need the throughput anyway, so it's for that workload where you have, actually there's one
right next to you too. – No, actually the thing is that you take the overhead for bringing the other thing out into idle, the other sibling, but you get in the advantage that you're having all the CPU shared resources for yourself while you're computing.
– Yeah, there's no contention on instruction issue and there's no contention on your L1 cache, things like that, so
you get the entire core, just for your real-time workload without the typical SMT interference. – So if you can stand the work and delay it takes to get the other
sibling idle then… – Yeah, so that selection
gets a little more expensive because we need to do that selection for two CPUs, and you need to send an IPI which is not entirely free. – [Man] Right. – Well then it's also not true that you don't have contention on your l1 cache if it's empty of anything you're interested in when you get the CPU.
So that's a cost of setup, too. – Correct but that's something your real-time tasks already have to live with anyway. – [Man] So it's a little bit more. – No, I was saying, what my idea was so if you actually have the scheduling, if you're expecting a quick wake-up, this could cause a little bit or at least overhead on the wake up,
and so that's why the jitter and then– – The thing is, you don't care about the quick wake-up. – Right that's what I'm saying– – You care about the end. – [Man] Yeah once you start
running, you want to make sure that you are running as smoothly and– – See you ought to take the, let's quantify it, two microseconds hit on wake-up, then getting a runtime jitter which is non-deterministic. – And I guess this also
probably would be useful for the no Hertz full, so if you have an RT task that's by itself, kicks off on the thing, runs in and doesn't want any interference from the kernel having the– – The HPC people switch off SMT anyway.
– They turn off, oh. – They switch off SMT,
HPC people turn off SMT. – [Man] Are they the only
ones using the Hertz full or? – Yeah, that's the guys
who use no Hertz full. – Yeah, so I'm trying, so who's, do you know are people are asking for this type of feature then? Or is it just something that we're helping or who brought up the problem? – No I mean the the core scheduling question comes obviously from the whole speculation crap, but people want to utilize SMT for several reasons, and then, but we have
the same thing where, I was talking to RT people
who would love to keep SMT on if they could isolate the core for the time the
real-time task is running, and then give it back to random background workload.
Because they see that the problem what they see is if they run, and you can easily figure it out by just offlining the CPU or in, out of the scheduling domain, same your runtime becomes, your execution time
becomes deterministic and– – [Presenter] And more deterministic. – Yeah it's still a random generator but, but it becomes pretty good deterministic while when you have random other workloads on the other thread, then it depends what they are doing. – [Presenter] It's absolute chaos. – Completely out of the window, and that's especially true if you need to get on the resources which
are only available once. So a lot of the issue ports are multiple, but we have things like ABX.
– [Presenter] I mean, yeah,
there's other ports and things and then the reaction. – Is this is going to be a config feature? So all RT tasks? Or do you have to actually do
something to say tag this task once this ability, does
it have to be an RT task? – No, no, it doesn't, so, so, the current implementation
that is posted in, so I posted it once, and then I think the guys from (mutters) took it over. Thank you! That selection is outside of the scheduling classes. It works across all
the scheduling classes. It worked for deadlines, FIFO, round robin, and other. Currently the only
interface that is there, it's a C group interface, and this is because I was absolutely lazy. It was the easiest one to hack in, and the C group you can use, operate with Bash, I can echo into it. – So I can't complain, can't complain about other people, who is coming up with
the C group thing first, instead of thinking
about a proper interface? I blame it all on you. – You're welcome. (laughter) The pet shot, "This is a heck." (laughter) Anyway but the thinking
is that we'll do a PR CTL for tasks.
– [Man] So why why can't we have a solution like few cores are offlining one
HT and then the scheduler only schedules real-time tasks on the cores that have HT disabled? Like that can be our
solution for this, right? – Yeah and that's something that some people actually do, and I already mentioned that. As the CPU sets, and then
you disable partial HD. – [Man] Is that not good enough? – There's always complaining people. I mean if most of your workload is non-real-time but you do care about those few real-time workloads, this might be a better solution than– Yeah, in terms of utilization, and always just killing some of the HD.
– [Man] Yes, so on the interface for that then I guess we can have flagged also because we already have a flag to, for example I enable the runtime sharing. – Yeah so we still need
to have this bike shed. API of ABIs yay! The thing I focused on while doing that pachet was just getting a mechanisms there and the general structure of it set up. The way it fundamentally works is every task gets a unsigned long cookie, and when you schedule, if the cookie is nonzero, all the siblings need to have a task of that same cookie
or get forced idle. I don't care how the cookie gets set.
So for the C Group we can tag all C Group and say all these tasks
get the same cookie, and this is what I did, and then I basically stuck
in the C group pointer. You can stick in a task
pointer of its own task, and then it will only ever be the one which is for the real-time, or you can have it inherit, and then it becomes difficult, what do you point it to because the leading task and dying get reused. Fun questions like that
still need to be answered. So yeah we'll have to bike shed on the specific APIs later.
I am afraid we'll have to merge something like this because there's so many people that actually want it, even though– – [Man] For the wrong reasons. – For various reasons. Currently it is not a complete solution for any of the site channel stuff, but since we were working on it, enough people have shown interest for the feature for other reasons that I think there's a fair case to be made to merge some of this. – [Man] Going to the admission control, maybe the easiest thing to do would be to use just a half of the CPUs, just one per core, as the admission control to avoid the, yeah, but there is one
question, when a task run, what would we do? So we have the admission
control for one CPU in the core and none for the sibling let's say, and that should be discounted
runtime of the one running in the sibling or not? And how this will interfere on group? – So for admission control,
there's no runtime accounting or at least only D.
– No, yeah, let's say that we decrease the admission control– – [Presenter] Oh, so, admission control. Let's get that line. The idea is to not allow more tasks on the system than there is time to run. Morton had a slot yesterday that brought up DVFS, and then it gets really complicated, but basically if, what scattered line tries to guarantee is a timely execution of your workload. Ideally it is before the deadline expires, with global EDF like we have
now this is not a given. There are a number of
workloads that do provide that, but what it does provide is that the missing of the deadlines is bounded. It will not get infinitely long misses, and we provide this
guarantee by ensuring that we can never compute more than
there is compute time for. And this is what admission control does. If we request a task, we tell it, this task wants this much run time over this period, and this is the bandwidth or utilization, and we add it up until some certain cap, currently that's in 95% wool time.
If you were to exceed this, everything goes out the window and becomes an absolute mess. So at some point we need to say, this much and no more. This is what admission control does. So deadlines tasks you cannot fork them. Fork will fail always. You can only use get set after, which is the new system call on an existing task to request it be placed in the deadline class, and this will either succeed after which you can run
the thing or it will fail if we say the system cannot
support your request. So there's a whole, yeah so there's a whole debate on how
good admission control is and how many guarantees it provides, currently not many guarantees, but it does provide that
the system doesn't run away. It doesn't meltdown which is I think a very
good guarantee to have. Not that, not that. (laughter) – Thinking on this, the
safe thing to do in this way in this case would be disabling the or reducing the 95% for only those CPUs that we actually have online.
– So currently we have a
global admission control. So we can double the or quadruple or eight times for some of the power chips, the requested bandwidth and leave it at that. I was thinking that should be enough. Because we've basically consumed the resources of, if we were to run this task for however many SMT threads there are concurrently. – When running the admission
control just multiply it by how many threads there are. – Just consider the task, and
they will take the whole core.
– So in this case you are
expanding the computational time that the task can use,
right, did I get it? Okay but now think on this way. The idea of core scheduling is trying to make more
use of CPU available when it's safe to use those HD, right? But in the admission control,
I'd say that the current that we have now, if you have disabling for hyper-threading, we're not on a system with eight CPUs. Four cores and eight threads. The safest thing to do now,
it would be to use only the 95% of the four CPUs not the eight, right? – But that is only if you
do this for every scheduling tasks, but it's still configurable.
– I like your idea because then you can do the inflate for only the task you want to run on, I mean to force the other siblings idle. But maybe don't do that for the software. – But I'm kind of lost,
can you repeat, like, what was the idea that you said you liked the idea? I'm kind of– – So this will give
interaction between the two, but basically for admission control, we currently only consider the utilization of the one request. We add it to what it's already given out, see if it goes over the threshold, if it goes over we reject,
if it fits we accept, but because with core scheduling it would consume the entire core, however many threads there are instead of just the one, we simply multiply the utilization by however many threads
we will now consume and use that value instead of, of the single value.
– So basically if there's
if it's two threads then it would be– – Then it would be times two. – So it looks like you could use the, you could pass that to set data. – This is one of the things Yuri, but also it becomes interesting, if we do the PR CTL after we've already gained deadlines, so we would probably want to deny the PR CTL to
deadlines tasks in this. Maybe for deadline, but then there's two
interfaces to set it on tasks. – I thought it was already extended. – It is extensible. – [Man] So let's use
it for others as well. It's not–
– For everything? – Yes. To make one interface, if you configure deadline, then you configure all the
other nonsense with it, but you say, hey I also need one core, and then you get it naturally for everything else which then just says, "I need one core." – So yeah, we need to– – [Man] Go away with PR CTLs now.
– But yeah we need to bike
shed on what specifics 'cause you might for example want to say core schedule this process, and core schedule this task so the request for core scheduling might be like some of the, the nice system core has something weird, there's a number of controls that iterate the task list and do weird stuff and are in inherited on fork, some are not. So we need to find a interface that works for all the glorious use cases.
– [Man] So the other 3P? – Davao wants a cube or John. – I think when you just do
the multiply the request time with the number of cores, you can connect and run into fragmentation problems. Like one process requesting a full core for 50% and another process requesting a single CPU for 50%. – So currently we do
global admission control, and we do global EDF, and then it will magically work. (laughter) In as far as that global EDF works. Yeah so it provides a weak guarantee. It does it provides bounded tardiness. So we do want to support
affinities first get deadlines and then it becomes really interesting and then your point is absolutely valid. – But can't we consider the usage of the sibling core as some optimistic utilization of that core-like group? And not counting maybe we can divide by the number of the cores, the admission control, and then let the other
sibling deadline tasks to run not accounting the runtime or using to trying to, we could travel to it, but we could use like a
greedy approach for it.
– So for admission control
we do not account runtime. We only used the worst case. – Yeah but while running, as we are using a research that wasn't actually accounted in that admission control we don't necessarily need to reduce the amount of execution time of the task in runner. – [Man] You mean let the tasks
run for say more run time just because it was
kind of consuming the– – Disparate run time, yeah.
– Runtime on the other CPUs? – That's another way to
try to see the problem. – I think that the other tricky bit would be actually load balancing because if you have a mix of tasks that can take full cores and others that can run basically everywhere else possibly we'll actually, so currently we don't distinguish and actually currently we don't consider the let's say spare bandwidth in the CPUs while deciding where to put a task, we
only consider deadlines, so that's the other thing
that we'll have to fix right? (mumbles) I think in this respect look us past the capacity awareness patches
since they start considering the remaining capacity or
bandwidth on the CPUs, maybe that might help because then you'll have a notion of where the bandwidth is actually allocated.
And then you can say okay in
the core there is the task that wants bandwidth for all the cores so I won't schedule anything
there because I already know it's full but then I can use the other cores that are basically free, but yeah. – Yeah and then we do. – We learned it last time,
but just to recall it, it's not only deadline, the problem, we also have the RT Real-Time sharing stuff in the FIFO scheduler. – I mean for we already
killed that for real time, and I thought I killed it from mainline, but apparently I didn't, so let's just disable that.
– Okay. The RT real time share option that you– – That was a gross hack when
I did it, and it still is. (stammers) – No, that's for trouble, so– – Yeah, yeah, we'll also kill that eventually when we do the server thing. Yeah you gather back
in a different flavor. – Okay unfortunately we're done. – Well I really don't want to have to go outside of cores, I mean it is expensive, and doing it within a
core is relatively cheap because it shares the L1 and it shares the instruction
pipelines and all that. But if you cross that it just gets horribly more expensive, so it's fundamentally not that much harder but oh my god! (laughter) – This is kind of embarrassing, because I'm one of the organizers. You can see my green lanyard, and I'm at a microconference, so I've got slides. On the other hand, there
was an alternative, and I chose not to take the alternative. I'll run the alternative by you, and you can let me know
if I made a mistake. The alternative was all of you study Linux kernel RCU.
You get up to speed on it. So that I didn't have to kind of provide a little background. So for as advice for next time, how many people would like me to not have slides? (laughter) All right, yeah! But it was more fun this way, but sure how many people would rather me have slides next time? Oh, okay. Sorry Thomas you're outvoted. (laughs) Okay let's see here. Let's try that. Oh look something's working, must be some mistake. Okay so I'm gonna go, these are the topics I have, we'll go through what we go through. As usual yell something out if you see something. I'm gonna go through some background and ask advice. That's kind of the way this works, that's what I'm trying to do here. So I'm being, I'm denying my citizenship by trying to make an informed choice. (laughter) Okay anyway, what we have is we have the default setup for callbacks is set up to be kind of self-throttling, and there's a caveat here which is the fixes going into five four but generally the idea is if you have some CPU here that
decides it wants to do a loop in current mainline doing access paron paron on a file with an ACL, which means that every
pass through the loop, there's a RCU callback generated, and it can do this really quickly.
Well over a million callbacks per second per CPU, and you do that on all the CPUs. Well just hold on it here a minute Thomas, and we'll see who gets to admit what. But anyway so what happens in mainline right now is if this CPU's doing it and the other ones aren't let's say CPU Zero they got some guy that wants to just hammer RCU so he does, right? What will happen is is that the RCU infrastructure with scheduled clock tick if we're not using over its full width soft IRQ with other things will make grace periods
happen and eventually it'll do a software key and say hey we got to call bunch of callbacks to invoke, and that invocation by default with normal configuration
if you aren't doing hot plug and blah blah blah whole bunch of caveats
most of which hold in the common case the same cpu is going to invoke the callbacks that posted it While it's invoking
callbacks it can't post more, and so the admission control, if I want to say you're right, which I wouldn't.
(laughs) Sorry, the admission control
happens during the invocation. While it's invoking, it
can't post more callbacks. Peter throw him a box. Throw the man a box! Or a bike or something, yeah. – [Man] So your callback can
actually do co-RCU again. – It could! – I've been there, done that. – Yeah, it's fun, you can do that. So if somebody, and that's that's an important thing for people to understand, and that's one of the caveats. If somebody makes a system call that under user control can create a callback and the callback creates another callback each time it invokes and they can invoke
this over and over again and keep creating callbacks, please enact the patch form or at least send it to
me or something, okay? So yeah the one of the caveats is that when you use the space as a callback it does a finite number of callbacks, preferably one, and then gets on with life.
So yeah it's a good point. Please don't do that. And you can also make one they did two callbacks do a memory allocation do callbacks each of those two callbacks that would be a callback bomb I guess please don't do that either that's another one of the caveats anyway enough caveat there is another caveat which is being the, is the fix is on its way in.
Well anyway this is just an explanation of what I did graphically he madly posts, posts callbacks but while he's invoking them he can't post them only when he's done invoking them can he start posting them again, so we have kind of an implied Mission Control this is mainline non RT but– – [Man] Steve. – My question was wait so he said you can't whether they tried
to do RCU call back or something like that, what happens? – [Man] If the call back does a call back? – Yeah.
– Oh it just does another call back
and that's, go ahead. – No I'm saying what's the
question it doesn't invoke we mean by the definition of
it doesn't invoke anymore. – Oh okay so yeah let me I probably didn't say that very well so we got the CP madly posting callbacks doing call arson call our co car sitting tight loop at this point after a grace periods happened while it's doing this kind of behind the scenes at this point it says hey we got this five billion callbacks RCU invoke them for me and so it's in a loop if it wants it it'll kind of meter them out for a little bit and at some point they'll say okay fine I'm just gonna vote callbacks and forget everybody else right so I got five million of them and at that point it's invoking callbacks as long as those callbacks to Peter's point don't contain caller RC themselves and please do that yeah okay please keep that true for user space hosted callbacks please then the user spaces and running.
– Okay so basically
you're saying it just goes through and just– – Yeah eventually I mean it tries to be nice for a while says no more mr. nice guy I'm
getting this done, right? Of course that does lead into the five the five four issue this of course has just terrible real-time properties because you you're grabbing a CPU by the throat in soft air queue in real time of course you could set up for pride a priority for soft air can make your critical stuff happen at a higher priority you could do that. – Could we try to prevent
somebody from stupidly calling call RCU inside the what why we're invoking? State say sorry you can't do that.
So how do I so it's something that's necessary for some it's a valid use case in some cases so an interesting question is can we somehow figure out or mark
or do something that's a great question I don't I don't
know an answer to make it happen but it would be valuable. If we had a if we had some way of saying knowing this callback was posted in the system call that can be done in high-rated be cool to be able to say no you don't get to do call RCU and I'm gonna error or something I don't know somebody drop a floor. – There are valid you said there's valid. – Yes there are valid.
– You seem, you looked at at and said okay that's valid
instead of saying can you do this something differently. – So it's kind of a, so
one of the things that one of the use cases involves
reference counting where you have somebody doing the last reference and there's some other stuff going on and you could use tie knot timers I suppose but if you do a call RCU you just do the call RCU, the callback and it checks
some condition and says no I can't do this right now so do another call RCU.
Just like timers repost themselves okay same class of use case. – If it's a soft RQ then
what happens if inner comes in into the call RCU? – It gets posted and then
sometime later a grace period happens and in the although
if you've got five million of them it might actually
get the grace period done and queue it before you get done with them. But yeah. – So are these cases somewhat countably limited? – Say again? – Are they accountably limited? As in you can count them. – Yeah I heard what you
said I was just recovering. (laughter) I don't know I think
it's a small percentage but there's a fair number of call RCU's in the kernel and I haven't tried counting them. – What I'm thinking about. – Please free to. – What I'm thinking about is you know just just slap a one on there if you call RCU and at that point in time if
you have a case where hey this is actually valid you annotate that and don't trigger the born on.
– Okay but yeah that's a good way to do it what I what I need to do is come up with a good way of somehow annotating it. So one thing is that anyway there's some ways of doing that I'm Clark's idea is a good one you you've got some good points let me take that offline. Because I've got some other questions I need to ask you guys. Okay anyway so the
problem that had is that and I said this before if it had five million callbacks eventually say okay guys out of the pool I'm doing all five million at once and it's kind of overkill because you do a hundred thousand at once that's with an epsilon as efficient as just doing it continuously and that lets go to the cpu every few milliseconds unless k free is causing you trouble like it was with the ACL access thing I was seeing two millisecond average K frees.
I don't know why. But anyway so what Erik's patch does is it takes a more graduated thing instead of just doing them in small chunks and then saying and then panicking all at once it kind of slowly increases the number it does and then eventually it gets up to limit says I'm going to look at this many at once and the limit is fairly large so anyway with that with main line with non real time this works nicely okay and but with offloading
if you're doing real time you're offloading the callbacks the assumption when I wrote this up when was that 2012, 2010, I can't remember. Anyway the idea was is this a real time things are embedded and we have a tightly controlled software stack so people are you know limiting themselves
and being disciplined about how they use the software and how they make how these system two calls and life should be good and that means that RCU can safely assume the saintly low rate of callback queuing in that configuration so RCU does rely on this assumption currently in mainline and what happens and this is a kind of a cartoony diagram we are CPU's again this think of these is the user space thing is running along and we have a set of kernel threads and these are grouped for
four CPUs you'd actually it be more complicated but let's keep it simple you have one of the K threads so these the RCO and in current main lighted bp4 preemptable and normal real-time usage and cpu zero has one I've put it in two spaces it has two rules it takes on and CPU one and two and three all would be if they're all offloaded they would all have this kernel thread this kernel threads job is to invoke callbacks the roll down here is just invoking callbacks the role here and CPU zeros is gonna play that dual role is to wait for grace periods to get these guys set up you know collect the callbacks up wait for grace period when the grace period lapses hand the callbacks down to the kernel threads including itself and then wake them up and they go go at getting callbacks for as long it takes their use they're just a k thread so all this stuff about holding out of the CPU too long doesn't apply Peter will preempt me if he wants
to and that's fine okay but and then the administrators
job is to figure out where these things run okay
and again this is tightly controlled the administrator
if the administrator and messes up they're the ones that have to fix it and I've given them some rope and they can you know use it however they want I would I would advise they choose wisely but that's their choice at the end of the day okay and is with this assumption that is a control software stack and all this works nicely but with that assumption
violated what can happen is that CPU 0 might just dump a whole pile of callbacks because my assumption is violated and the guy is just being an idiot or or has a really bizarre workload I haven't thought of or however you want to put it okay.
And then so this guy says
great grace period and then he hands the call backs off to himself and there's millions of them by this time and so he's stuck in a loop invoking callbacks meanwhile CPU one and two
three decide hey it's my turn to stuff callbacks in
there and they're being studiously ignored by cpu 0 who is stuck invoking callbacks and they just keep piling up and piling up and piling up eventually cpu zero gets done doing the current set of this callbacks this is okay it's time to go up and buy that but by that time it's game over you know before you run out of memory
and life is hard okay. My observation over the past
level second years is that my assumption might not be valid anymore and even if it is valid right
now it won't be very long because people are people are
using this nowhere is full uses HPC applications usually
they've run just in user space but you know people are making
less discipline use of this and it's gotten to the point
where in my own self-defense I need RCU to defend
itself or it can alright and so what that means
is forget that assumption and that's one recent reason
why I think I need to forget the assumption and what's not quite yet in mainline be getting there because I consolidated the RCUs I only have one of these things per CPU period I used to have three of them in an RQ kernel three per CPU so I've cut them
down by a factor of three so you guys should be plenty happy with me adding a square root of N worth of them which I done okay you don't like it come talk to me and we'll have a discussion all right? So what happens is that CPU zero has two of these guys now okay and one of them just does grace periods that all he does and the other one just to devote callback so that's all he does.
– [Man] And the GP one has a slightly higher priority than the other one. – That may be necessary it
hasn't been yet in my testing so far I've it's worked
fine with these guys doing whatever. – Yeah I would give it slightly higher priority so that it can still detect grace periods and still hand that work to the other threads even though your callback thread is busy processing to millions and millions and millions. – Okay so um that's an interesting point my model of the world was that this guy wouldn't be affinity as tightly as these guys were and so it would go wherever that was but you're right somebody might just slam them all on a CPU zero but that's we'll get to
that in a later slide.
Yeah. – I mean but you're saying
by changing the the global the man basically I call
it like the manager or whatever you want to call it or what you call the grace period Handler threatened you're saying that you could that's you could change the affinity of that guy as well or? – Well what I was I was assuming something I think Peter gave me some good input that I need to think about and and it's quite possible he's right. Okay which is which I hate to admit hey you know people it was so the input was that I need to take the prior to this guy slightly higher then these guys in case they're sharing a CPU so that I keep up with the grace periods even if one of these guys was tearing away at the CPU. Because what Peters
telling me is he has enough information to know which
one's more important.
– But I would also say we might assume that the grace
period the GP zero guy or they would be pinned to the CPU is no reason that the man – [Man] That's up to the administrator. – Why would you want to
because it's like if you have CP zero do it handling that
you just pass it off to that thread may not
worry about it so it's– – That was my mental model
you've got my mental model which is that there's so– – The CPU zero I'm saying should
be per CPU threads they're just binded to the CPU the
other ones could be moved around it because that's where you
that's the offloading but the managing should just because you're going to say okay this CPU zeros my mate is what I'm going manage is what manage my callbacks I call the GPU zero thread knowing that it will be on CPU zero.
– Except that some people might want to put might want to have a set of housekeeping CPUs and put all of these threads on those housekeeping CPUs because they don't want any reference or– – The GPU the GP the GP ones I can see not just would be I will just be a thread that's to sidle you know I mean it's the manager it's going to be doing managing don't need to of that
two of those on one CPU. You just need one of them on one CPU. Oh you only have one in the system? (laughter) – All right, he's got aim.
(laughter) – [Man] Who was that? (laughter) – Who do you think? I couldn't hear you. (laughter) – [Man] So what there is is
there's square N of these where N is the number of CPUs
and the reason they're square at n is because if I if I only have one of them then the wake up delays start killing me if I have if I have hundreds of CPUs which is not an uncommon case these days then this guy ends up having to wake the whole the whole pile of people up and then while he's doing that he's not checking for the rest of stuff now that may be a wrong optimization and that that's a good point I need to look at that hopefully you're getting these in notes because I'm not gonna remember them all.
– [Man] I wasn't talking,
I'm hoping someone else does the writing. – Okay well that's a
good point so we also do the writing or whatever but so that's the but that's that's a good point right now my assumption is that I need I can't just have one of them on a really big system just because I'm the fan out on the wake ups would would start killing me. – My question was why not just have one per CPU and just having pinned and only have a mute like they'd just be idle or they do nothing basically if they're not being used. They just that was my question. – The thing is what do you want to– should be able to move or see you call back processing completely away from– – Right but like here if you have this basically my point what I'm trying to say is okay if GP zero
they're that they're going CPU one's going to talk directly to GP zero knowing that
it's on CPU zero and but the thing is you could have a thread where you have one of those for each but only if you're saying that this is going to be the housekeeping one so they're just otherwise the ones that are not they're being offloaded they're just gonna be I mean unless you do a kernel.
– Let me let me let me clarify something let's say that I've got a system that where I'm where I've got a big workload in is half off load in half not and it's like 100 CPUs so 50/50 right what's gonna happen is that for the non offloaded ones the ones that are running soft irq there aren't going to be no GP k threads it only makes them if at least one of the CPUs in its domain and it's square root of n number of CPUs is offloaded. – GP zero is only made when– – [Man] The well forget the zero, the GP. – When there's actually some
CPU which is that optimal like – [Man] Within its domain. – Well then you could just create I mean but my point when saying is you just create that guy for it to CPU that's going that's gonna be that's going to clean it up.
– [Man] Then people leave to
things people yell at me for having lots of K for its again. – Well so you're basically saying– – [Man] And the other thing
is that if they all are active then I start then I have
that big wake up delay on large systems in the grace period K thread 'cause the grace period ends
I got to wake them all up and that's what got me in
trouble back in 2010. – So let me just go back so
basically what you're saying let's do a little more complex
example because it's modeling why I'm getting confused because
it's kind of too much of a simple.
(laughs) is like so let's say if
we have 100 CPUs and if 50 of them are offloaded
in 50 of them are going to do the work for it so how many of these GP threads would you create then? – [Man] 50 or five excuse me five but it would be it would be 10 if they
were all offloaded but only half of them assuming
that the ones that are offloaded are dense in
the cpu number space. – And five and these
five would then just be– – [Man] The square root of 100 is 10 and 50 that so there's
100 CPUs the squared of 100 is 10 but only 50 of them have are being offloaded and they're dense and therefore five of those ten are created. That would be a valid choice
it's just that I didn't make that choice.
– Oh I see so you just look at how many available CPUs there are and then you say my max is oh total CPUs and square root of total CPUs that's the max limit. – [Man] Yes. – Someone probably should
add that in the notes. Anyway so basically you make
10 and then you only use half because there's only half
and then you do a percentage I guess we're doing a percentage
of how many are doing it so you have five I guess
the affinities are these– – [Man] Well it depends how
you did it if you I'm assuming that say zero through 49 are not offloaded and 50 through 99 are in which case that's what it would happen if you offloaded every other one then you'd get the full set of 10
because each of the chunks would have a have at
least one offloaded CPU.
– A compass example of a simple
topology where it's a single flat there's no numa 100 CPUs. (laughter) Okay my point is so they
basically these five, seven or whatever are now the affinity of them is for all the 50 that– – [Man] That's the administrator's
choice that's not my choice. – So what do you put what's know what but if you just turned it on and they didn't do anything what's a default? – [Man] Not affinity at all
they just they're just– – So they can actually run on
the CPUs that actually want to call or want the offloading. – [Man] They might or they might
run wherever that's Peter's job not mine. (laughs) I'm assuming that if people are using this they would control what we got the box. Give him the box, give the man the box. – I was just trying to get back to the housekeeping question so we're so if I limit like if I have a four core system and I designate one core for housekeeping I only want to– – [Man] Hold that thought we're
gonna have a slide with that in a little bit thank you
though thank you for getting a leading question because we need to move ahead 'cause we're almost out of time.
Okay I'm not gonna go
through this in detail but this is some extra stuff and
after doing that I got a two or three X reduction in the RCU
test case and the number of callbacks it was outstanding
so this did help is all that really means of course nothing
is ever perfect this is something you might do and in
fact it's your example, right? All right good I managed
to anticipate one question. (laughs) Okay so this is something
you really want to do or some people really want to do except that if we have this this thing where people are doing ugly things CPUs one two three can just easily bury CPU zero I mean CPU zero just does not have a chance they're fairly evenly matched when it comes to posting callbacks and
doing a K free so you know this is a flaming disaster all right? – [Man] Why is there four RCUOPX threads on why wouldn't there just be one? – [Man] Because that I
have to V then I'd have to respond to somebody doing a sketch that affinity and rewiring my internals to make only one K thread get it and then they do another skill affinity to have to rewire I so again I'm not doing that I'm sorry.
– And also what would be the
point of having the the GP in that case anyway because there's
only one place to send it right it doesn't matter right? – Look you have to tell me
why it hurts to have this. Is there a significant
loss no there isn't. – Yeah, okay.
– Okay, you're right it is kind of pointless
but I have to if I don't if I do what you're doing, Peter
and I have to have a lot of back and forth every time this
somebody does a sketch set affinity and that's gonna be complicated. – If you march them, there
is no point to get 'em– – [Man] Taking apart would be really hard. – Because then there is only the one PID so who do you tell to move somewhere else? – Yeah thank you, that's a better answer. – [Man] This scenario is useful for the case that this is not always where you're going to carve it you may move them around elsewhere is what you're saying.
– Yeah. Okay and of course what do we do about
this and I'm gonna not, I was going to pull
the audience but we're, we're just doing this
one thing and that's it. Currently we do is number one. Let it be, let it OM. Okay which is, how many people think think
that's the optimal approach? All right, yeah. (laughs) And Linus Turtles agrees with you so there's some indications
the two might be favored. Yeah I think I know a
way to do that but it's hard to make it splat when you want it to when it's the problem and not splat when this just
life is just hard, right? To Thomas's point the admission
control I could detect an overload and tell call RCU, you know do a I can't make it sleep and I could check to see whether there's legal asleep and and just sleep you know or I could just do a spin delay
right and you delay, right? And I could check one way or the other I mean that's something I could do.
Go ahead Thomas, give him a box, he has a loud voice
but it's not that loud. – The problem is with delaying it I mean moving the problem to a different place you go anyway. – Well what yeah the the hope is that the guy that does the call RCU is also the guys to the key Malik but you're right that might not be the case and you yeah. – [Man] I think having a go
home and something like that it's really useful for just
tuning right I mean you just have an idea if I'll maybe one housekeeping core's not
the right thing, you know. – Although if we can do this we can we can give you a little better do that the other the other thing is the thing that I actually is my favor from a theoretical standpoint I think I might be able to do it but I don't know yet is if there's a CPU that's doing massive piles of call RCU and it's causing this problem I say you know buddy you're not offloaded anymore you're doing your own stuff with soft IRQ get over it.
– I disagree I like number two as well administration oh and my machine by just doing stupid crap and it's that's the administration thing I don't think this is something bad. – Yeah I mean if you do stupid– – Sebastian wants to talk back there let's give him a chance. – I remember a while ago there
was something documentation saying that you're not supposed
to allot memory allowed, you have to limit yourself at some point. – [Man] Yeah. – So why are we doing this on that? – Because, okay so the mental
model a lot of people have and it's a nice mental model to have and I'd like to preserve it if we can is that as soon as you do K3 RCU you freed it okay so the mental model would be okay I came out
with it, okay I K3RCU it so life is good and that if I
have to make them account for the fact that they can't see the grace period and they can't see when it's the callbacks being invoked they can't see what the K free happens so it's kind of ugly.
– But they can have some kind of a county put himself and at some point it was synchronized with this view because they're getting out of hand.
– [Man] Well synchronize, right now they can do RCU barrier synchronize RCU would just wait for another grace period there might millions and millions of callbacks still– – [Man] Exactly but they
are dosed at filling up the pile so they that's your limitation do it enough to correct them that they have to wait and then life is good again. – [Man] Okay all right
so certainly in the short term that's the choice because I'm not sure I'm not sure I can make this work right or at least not without causing some other horrible problem but okay but yeah that's the that is the current you're right that is the current stuff you can look at it in documentation RCU and it says by the way.
– Yeah. Anyway okay I think I've got a lot of good advice I thank you very much I think it's also 11 o'clock and somebody else's turn okay. (laughter) Okay so I guess I'll,
you're gonna have to so how much time do I have because you know, okay alright good. – Paul.
– Yeah? – [Man] Can I get back to you
on number four option there? – Number four, door number four. – [Man] I'm stepping back
and are you not punting on the original problem that caused you to do this in the first place? The original problem was oh a
CPU is doing too much I have to offload for it and then
you say oh when it's doing too much, I'll stop floating. Why do we do all this? – [Man] So though the reason
for offloading is because somebody wants really clean
execution environment on that CPU. – [Man] And when they screw up you give it back to them anyway.
– If they screw up then I say
sorry you're out of the pool you screwed up and so rather
than oh I'm the machine I'm gonna shoot your real-time response in the foot because and it may be that you're in something were you where you'd rather just explode so it's probably if I do that I'm probably have to boot parameter where it says you know if you detect that just blow the machine up so that we can have the secondary takeover and reboot and fix it or something I don't know if that's, I don't know what the
best form for that is but I'm sure that if I've managed to get this working and I send patches out I'll get plenty of advice.
– [Man] Yeah I definitely
like number two because the splat like hey this is going on and you know my bad let's clean it up. – Okay all, right cool so Linus will be happy. Okay I'm gonna skip this
because it's kind of there and just it's just I don't have any real question at this point. This one I do have a question on. Somebody had an interesting
method of doing a bug report they posted a blog and I happen to come across it a few years later. (laughter) Which I don't have any way of contacting them so I can't suggest that he improve their bug reporting but you know if you know who they are that's it'd be good advice for them anyway what the thing was is that it used to be that there was a k' config option that said which cpus we're gonna be offloaded okay and you could say offload all of them with just a single configuration and I got a lot of heat for having too many k-kat big options a few years ago so I said fine I'll turn this into a boot parameter and what that means is the boot parameter you just used to give it a CPU list and what happens is if the number of CPUs if you say I want you know CPU's one through seven offloaded for an a CPU system and then you plop in a 16 CPU system you have to change your boot arguments okay well yeah but people
were complaining people are playing about this you know people are complaining about this because they used to be able to just say offload everything with a K config option and they have no way they had no way at that points they upload anything anymore.
They had to you know
before they didn't have to change the boot arguments now they do. Much of the might sympathize with your viewpoint on the matter anyway so what I did is I made it so that it it just checks if there's an if there's a non-numeric and if it is it sees is it all and if it's all it does it but that kind of raises the question there's a generic thing I call that says parcel of CPU mask okay where you CPU list okay I never can the you know 1 – 5 comma 3 whatever it is yeah and I become a stupid CB list but whatever anyway should we allow a trailing dash with nothing after it to say the rest of the CPUs. I don't know because I was
lazy okay all right okay all right cool so that's and then there's I have a I have a thing if you do an RC CPU stall warning as a chroma RAM reading set so it'll automatically do an F trace dump and pop it out the console has been really useful to me maybe it's useful you guys too I don't know okay.
Give the man a cube he's got two hell. (laughter) – Too late no the number of times I've actually had to go in and modify my kernel and put into the stall an F-trace dump that would be so much useful. – Okay well it's there now.
– So yes please thank you. – All right there's also a
kick K threads thing what happens is that what can happen is that if you configure your system in funny ways you can prevent the gray streak a person running at all and eventually oh I am there is a this guy here RC tree our tsuki K threads if you set that on boot what will happen is when it detects that the grace period K thread right isn't getting any CPU it'll
splat and then kick it okay and it's a– – What do you mean by kick? – I do an unsolicited wake up which helps in some weird cases
it isn't quite I didn't say it quite right just now
but it it means that if you've done something funny and ran out for waking up then you
can give another chance.
Yeah and then let's see this thing here is was a request assist RQY already has a definition but it seemed to me to be the most specific one so I have a kernel parameter this one that says well since our cube Y doesn't do what it used to do for those people that used it for that I can't matter what it was sorry and instead what it does is it dumps the hierarchy of the RC you know tree which can be useful in some cases my thought is to expand that to include the topology the offloading as well if I haven't already I can or whether I did or not but I'll check assuming somebody takes notes and then this is complement so Sebastian if you say RC you tree used off soft irq equals zero then it'll use K threads instead of soft iron Q at that point RC you only softer RQ at all and I think that's hit mainline already if it's not it's coming up in the next emerges window anyway at this point I've used my five and a little bit thank you all very much for your time and attention and maybe next time I'll take a few people up and not use slides.
(laughter) (applause) – Oh just one question
when you say that you have in your mind the model
of RCU do you see it as state machines? (laughter) – That depends what state I'm in. (laughter) – He's my manager I
will not interrupt him. (laughter) So that's me Daniel and the idea here is to talk a little bit more on the latency which is is this to erm a metric even though we would like to have more deadlines one but we're still in love with latency and when we run the cyclic test we usually get this kind of distribution we have almost all the latencies here in the low and average and then we have like a tail that goes very long and we last two years ago Julia I started talking about using probabilistic methods to try to figure out what would be the worst case latency for example using extreme value analysis and well the latency is good it helped us it's a very good metric for us but in the end it's an opaque value because it's composed of many things that okay I know that Thomas knows I know that Peters know I myself know more less how it worked but still it's not clear to everybody which are the the what composes the latency and what could be the worst-case scenarios to give us a worst case latency and this turns hard if not impossible to use and probably worst case execution time analysis right but as we know but inside our minds we know it's composed by many code paths there are somehow independent right how can we improve this we could break the latency into small pieces that are independent right I know that these this piece of code is independent of the scheduler and I are accused are independent of the previews task that was running and disable interrupts so one good thing would be to break the latency into independent variables then it would be easier for us to apply those probabilistic methods to figure out the worst case of these small pieces because they would give us distribution curves that are more that are how can I say that makes more sense right then rather than having a composition of oh and we could not take any correlation between it so we break measure these values try to observe the worst case values of them try to use a probabilistic method and then we sell them back again trying to find a possible worst case scenario that we somehow observe it but not necessarily in one after the other when we sum all these independent variables that could happen terrifically we can have a possible worst case latest that we might not have seen while you're using cyclic tests because they took place in different times so but I to clarify it a little bit I wrote things get better when we draw it.
So let's try to absorb
the latest you think on it the latest is its the delay that the highest priority Fred my suffer indeed activation so we can say that in the end we wanted to see when this happened when the context switch from the task that is running to the task that actually has the highest priority now so this would be the the end of the story. The context which always take place inside the scheduler call and it always take place with IRQ is disabled right and preemption disable this scheduler is always called with preemption disabled but IRQ enable and then in the, the thing that note
fires that we have a new highest priority thread independent of any scheduler is set need risk at right and certainly the risk ad can happen one of the restrictions in on Unicorn it can happen it always happens with interrupts in preemption disable necessarily so we can have in the best case it happen here when another trade already called the scheduler and we where in the way to call the scheduler so it's the best case everything was ready when it arrived we can have it here before the final decision of the scheduling but in a preparation for a scheduling like we have a codes in a read write semaphore or in the coast that flushes data to the disk I think there are some code that actually around here.
Or okay and in this context we can also have IR harder accuse and we can also have NMI's, right. Oh no no no but I'm not these are need riskettes and these are the it's not touching a riskette right. Editing up a latency yeah. – You're just enumerating
the components that make up this big latency where
we're dealing with okay. – Yeah that's the final, ha ha, but we were right there. So step by step okay these were the optimistic case that another task have already called the scheduler for its own reason but in the irregular case we will have another thread running and then we have neat risk and have any on its context right because a low priority test but the good thing about
frame 30 and the property that it provides that gives us the, the determinism that we see is that if we have a need
risk at and if we always take place with frames on our code zebra also have both IRQ and preemption enabled we will not have any other kind of code running here we will directly call the scheduler right and that's the pram 30 if we have a code running here we have a bug and and I know a way to catch it with state machines but I would not talk about it now.
Not today. Oh yeah. Okay so we can have a and also in this part of the code we can have interrupts NMI's trying to push further the they think that we want
to the context switch. The IRQ disabled they
might push up push to later execution of IRQs right but as we know that this scheduler is always called with IRQs enabled even though we can push them they will always execute like the interrupter took place in this time window they will always execute before the context switch so they will end up editing the overhead here right until this point where we disable interrupts and we will have the scheduler. Okay. – You have the same
problem after the context switch because if there
is interrupts piled up they will fire away. – Yeah but then that's
one question I have. He's raising the point that we can have interrupts after the context switch that my added a value to the latest that we observe on cyclic test right? – Right.
– Yeah but what should I or should we model the latency, what is the latency until
the context switch on to you one task starts to run the code and return to user space.
– Yeah actually what I think
cyclic test is doing the right thing that's but because that's what the application cares about. – Okay, so I should observe
until I really turn– – It doesn't help you if
your context switch is on time and then you get
delayed by 500 interrupts and use your– – Perfect that's something
we need to clearly define. What do we consider as latency and to– – I actually agree with
Thomas with what we declare is latency exactly that but what I like about what you're doing it sounds like is when you have like an anomaly you're like it may help
seeing where it happened.
– Of course, I mean decomposing
the elements which can contribute to like latency in order to do some better estimation what it could be in the first case that's okay but for the application of course you can just say this is my scheduling latency but full from the application view the scheduling latency is what do I care about? – No actually I (mumbling) real time tasks that's
worried about, you know, more of a like how often,
okay I could handle a few worst-case scenarios so if you know what it is that causes us so you say okay this happens periodically I it's deterministic that I might have a outlier so this might help with deterministic latency that– – Wait, no, yeah but we can continue. – I don't think your final
answer will actually change depending on where we put the points.
I think it's moot. – [Man] I'm sorry I didn't get that. – I think it's irrelevant
where exactly you placed a boundary I think the
math will in the end be the exact same anyway. – Yeah, yeah that's the
point I just need to know to clearly define until where I need to try to measure things because I can pile up things here right? I can say that after
the context switch I can still have this time here which is with interrupts disabled
then I will necessarily enable the interrupt before enabling the before return from the scheduler before enabling interrupt preemption back again so yeah it I will just
pile something in after it – Just watching so the one thing you seem to be missing there is the time between well for return to user
mode, if you add that– – Yeah that's the that's the agreement yeah and but still these parties is still valid right it will only pile in the end but good so trying to decompose the
things that are somehow independent right? Are you India distractions our sewer just a task a thread right the interrupts our tread so they are included here in the priority because we don't have this software EQ context.
So dealing now with this case right we I will edit the pile the other context later but dealing with this case here so we will have this part here which would
be the worst case interrupts or preemption disabled that we could observe right then we would have this part of the code here that is the scheduler code that's called before we actually disable interrupts to finally do the context switch and here inside we can have we still can have some interrupt enable and disable and other stuff right we still can have here but as we know that until here,
until this last I record disable we can have IRG enabled. We were already here but
that's just the president so at this point any IRQ that
happened here I will delay for later and then for the case you mentioned it would count but assuming this previous step it would not count right so this part is somehow independent this part is somehow independent and it varies if we are prompting or not because the flush of block device I think
it happens here, right? And so these parts are are easy to decompose and then the problem is when we arrive in these two guys so I know that I will have to make a function a mathematical function that gives me the amount of time that IRQ's could consume in this time window in the PD time window because it's the time I win in which the interrupts could postpone us so I need to derive a
function that T's that explains this and for NMIs it will takes all the
windows because we cannot postpone an NMI.
The problem is there when we are in the, when we are in the real-time
world and we try to make these we need to characterize how IRQ's take place in the system right in the theory we can just say ok each IRQ has a period and they takes place periodically all right and then. Sorry? Yeah and I agree with you so let's say– – And the crypto people
would also disagree. – Yeah yeah so what is the point? if we try to measure the minimal inter-arrival time of an IRQ we will have a peers mouth. Right. So this method it would ok it's safe in the real-time theory to added a pessimism but it's bad because the result gets less realistic so but one other approach that is knowing the economy is that ok I can still using the periodic approach but it will be bounded in our window generally right so I could say okay in my real-time system I observed one burst of IRQ's but they don't take place always in a time window let's say that this time a window here would be the max 100 microseconds right.
No just finishing this and then we could try to assume that okay I can have this burst but in this time window I have at most five. – Have you been talking with Wolfram and Daniels talk because it seems like some of the stuff that they're
doing with the analysis. – In the arrival of the IRQ, right? – What not, no they're doing
the jitter analysis for like a long time for statistical
analysis to find out the person, the jitter debugger doing the statistical analysis of a long time to be able to figure out through statistic, what the maximum is, it
sounds very, very similar. – Yeah it's similar, it can
be similar to work that many people are doing but I'm trying to put in the real time academic work because they complain about the latest the box do you first.
– [Man] Do you model your
interrupts per interrupt line or it's like everything is an IRQ? – I think it will be easier
to have each per each interrupt line it's a– – [Man] Because then you could
measure your minimum period for each interrupt line and
consider that your worst case. – Yeah yeah and that was the
original idea to use per IRQ line because you can also do the schedule ability analysis for your cue because for example on Intel there are fixed priorities. – On Intel there's also shared interrupts. – Yeah but wait we do interrupt. – [Man] Interrupt statistics
for idle prediction isn't that the same thing actually? – Yeah but it then then we get the pointer between this idea and that idea I as far as I know the interrupt prediction you consider all the old interrupts as the as
interrupts not each one separated, am I wrong? So it's good.
– [Man] Actually it looks
like this is the same thing except that we done like we
three timers in a special way you know in that code but– – Yeah that's the kind of information– The idea here is not to
implement it, the idea is to describe it.
– Just do one thing for that and for the idle
prediction that looks like the same problem. – And how do do you guys
measure the okay but you guys measure the next occurrence occurrence of the the course of the
next interrupt, right? – [Man] No, we keep track
of when they happen. – Okay.
– And then we try to when next it's going to happen.
– So basically it is a like cute like you know you micro very interrupts and then
when idle when you need to predict you know at the idle duration you just apply statistics to that we have
the data you can use it. – And that's good to me just
to clarify that's good because the more thing we get
ready the easier I will turn these into a paper and get rid of the idea that the latency is something that the Linux guys never cared about in the real time field. – [Man] Were you at the tracing? – No I was finishing the slides of the– – [Man] Of couse, but like I
said the histogram synthetic events give you like you can
actually say the trigger events only when you're in the where
you want it to be triggered. – Yeah yeah then it's how
we will measure, right and then these are tooling
for doing the measurement. – It's already in the main line. – But that's not relevant. Actual description, you're
talking about actually doing stuff that's completely different.
(laughter) – I was just talking about
the implementation of like, you're only caring about
information in a specific window and you can do that today, I
think I've only watched this trigger while I'm in within this window. – So he's talking real versus theoretical. – [Man] No, yeah yeah I know
we have all these tools what I want is to find an agreement
between us to what should I use to put okay we think
that this would be the most realistic thing that we can do. – [Man] Well so the, IRQ, you know statistics for
idle is exactly about the same thing about you know what
you want is about figuring out how the next possibly
interrupt you know.
From the you know from the
start to the idle period. – [Man] But are you guys
trying to be as much pessimistic as possible or trying to find an average value? – [Man] Now being the theoretical– – It's a weighted average. Yeah in the real time
period I know I would add a pessimism but that's the way that they think and the way that they like. – [Man] So we are pessimistic
because we want the worst-case basically. – Perfect even better. – But the other thing I
wanna say is you know this a mathematical theory the whole
idea is what your job is to do is go do the papers do
the research and come back and tell us oh this is what I found and you guys are all wrong
anyway so this theoretically is what we want.
– No I believe, I believe
and I'm sure that Linux works and we know how it works but we don't have the clear description
how things works. – Why it works.
– And why it works. – We know it works but not why. – Yeah that's the– – Part of the job is to get
the theorists to a more refined level of confusion about
what the Linux terminal actually does. – Yes but they do not be an
opaque confusion it will be a clear confusion or a clearer confusion. (laughter) – I'm not willing to accept
that correction it'll be more refined in some way whether
it's clear or not, maybe. I would hope so but–
– Refined confusion? – Yeah refined confusion. – The last problems I need to
edit it for this the better because the more we show that
we are actually have some reasons for Linux to work very well as real-time systems and we know it works empirically it resolves you see many people here talking about Ryota so we would like to to try to clarify this and so we could– Throw box.
– Who needs it? – [Man] I wanted to say this
one use case for having a model that describes some operating system only such as latency and so on that I find myself needing is as a performance analyst is when I have labs results, I do a lot of
experiments and I have to decide if my result is statistically
significant so I have to use formulas to compute the
P value for example and then you need to know the variance
of your random variable and what I do is that I
assume it's a Gauss Bell it's, I assume it's a normal distribution but obviously it's not and an accurate description of this stuff can give you the formula that you plug in in your T test to decide yes this is a from then only part I mean if
if there is a change you know, so models helps doing statistics when you measure stuff to know if you have hopes are observing a difference or not so it's not completely I mean it's having a model is helpful it needs to I think something that you need to ask yourself before you go down and model something is what do
I need this model for? Like what is the question
and trying to answer it 'cause obviously– – [Man] What is the question
you're trying to answer here is The paper will have this
part explain this thing and then we'll go to experimental part doing the measurements and trying to figure out a higher or a latency that could actually happen but not necessarily happen while running cyclic test so it's good for us to define what
would be the most, okay– Okay so we need it to clarify what
is our metrics because on the scheduling theory they assume that the event of waking up the activation of a task is an atomic thing and
it has no delays and they don't consider these on the scheduling development and that's why when we try to feed any scheduler on Linux it doesn't work perfectly, does
this answer your question? – [Man] Yeah, sure, sure.
– So you could do one of two
things you could come up with the closed form function
for what you have there and then show that the measure numbers that people have fit that
equation that you have. – Yeah.
– Or you could use the empirical data that we have and come up with curve fitting strategy to come up with a function that actually reflects the measure so what's the
approach you're going to take? You're going to come up with
an equation first and then show that that actually satisfies? – And that's incremental work
yeah you're reaching the next step so– – You could do first you come
up where you expect it to be do the analysis and you have to go back anyway and say wait why was I wrong? – Yeah no there are things that are things that here that are more or less the terminus in the events state right you have the deterministic part here which is the easy part now we have to model these things and but in the
current latency we have that number that we cannot fit on
anything with any level of insurance but by breaking down breaking the pieces we can probably and I hope try to find better ways to use statistical methods to define what would be the worst case.
– Right but the problem is
this is so much different than the workloads that you're
going to be running. – No sure this is obviously
dependent of the workload. Sure, but in the theory that's
why we have a variables here we don't know the numbers, the
numbers will be dependent on the workload but in the real time theory we try to come up with the model, give the model and the variables and then each system will feed the variables to try to say okay in my system I observed these things and it depends on ba-ba-ba-ba-ba and then using probabilistic methods trying to go, Yeah, the model will be the same. The numbers will be the variables will be independent is that a good answer? – One thing you could show
with this as well I mean one of the key thing you can do
there is to reduce the amount of CPU time it takes to reproduce a worse case so things that could take years to reproduce you could actually perhaps produce that in
seconds or minutes so– Yeah that's another point
because we not necessarily– – Yeah 'cause I'm going up
that I guess if you actually this brings up with looking at different types of interrupts and if you see a period you might be able say wait a minute if all the what's called the wit you say is a wave or say if there's like certain things in blue and then you could say maybe in 10 years is a possibility that these will all line up and then you have this huge latency.
– Yeah that that's one
of the missions in that one of the game for the performance measurements because with the cyclic test we observe the worst case of thing that happening during my measurement. Mike but could be the case that these latency here was caused by these things very long while having very short interrupts interference and these one here could be that these happening very short and these happening very long so we might as well have the case of removing the IRQ's from the worst case IRQs and summing with the preemption that doesn't consider the IRQ's right it doesn't consider the IRQ. We could have it could happen that these worst case preemption disabled would accept the occurrence it would take place at the same a moment of these worst case IRQ and they sum up and it's correct to assume this because these are independent variables and that's the kind of the thing that the model tries to clarify you know to decompose but I agree that these things we have very different ways to moderate we have different ways to apply later extreme value theory to find these values right but we need first
to decompose and make things clear for people to be able to
apply these things because in the current metric the value they would find a value that doesn't make sense and doesn't adhere for further a theory.
– I have a question
about validating some of this not really about your work but is there a way to find out when an IRQ is actually post disabled is this our hardware mechanism just does the hardware keep track of when the IRQ actually is posted in a modern system? – And that was a question. – We can't hook up scopes
because it's all messages across buses. – The device fire off the
message and it's gone so there's no time stamp anyway. – Yeah. – There's some some stuff which actually has the where can be further information when it fired if you have if we have hard times them that practice we can infer the time when
the interrupt actually was fired from the time stamp on the packet.
But other than that no. – Yeah but that's the good thing about they compose anything because this the prediction the hardest part right because we cannot we don't have a clear description but we can deal with this pessimism here while having this clearly more defined. – Right. – And actually we as kernel developers we care about this we will try to reduce this as much as possible. Because that is out of our control. – Right. – It's out it's in is in the society mean control for example he could move some IRQ's to other core to try to reduce the latency here. – So you need to remember
IRQ's are basically can so I feel latency can vary depending on the source on energy efficiency features which can be enabled like a SPM in PCI which if you enable it it can you can you know delay yeah it can extend the worst case by another if I need to.
Essentially so yeah you are right the whatever the admin does will influence the
interrupt, interrupt latency. – [Man] Yeah yeah and that's
good because in the end we could find it we could end
up helping Susana means to reduce the latency based on a more precise information or what is contributing to the latest a not having to add a lot of extra mentation to observe something that you not observe it again. – Be careful here because there are some of the worst cases is not the worst case of the some.
No, no I mean if you really
want to be hard and you so what you get we– – [Man] I'm putting my shepherd
hat on, we're 10 minutes into the break if we want food
and coffee we better leave right now. (laughter) – The talk I think was
supposed to be soft IRQs May 9 in for RT but it's
actually an entirely different it's actually, (laughter) it's actually is some
kind of pet project so, was what?
– A bait and switch? – No, no. Okay so it was about it's about
the patch set I posted like several months ago but before I enter to into the details I'm going to first introduce some current state about the soft IRQ code so I actually only have guesses
because I believe that the design is the same for decades.
Yes, because it has that taste of ancient code which is actually it tastes nice. I like it. So yeah it's a very
straightforward code which so it's essentially an all-in-one switch when you disable the soft IRQ's for example with the local BH disable function it's an all-in-one switch so you disable every vectors so imagine that you want your code to be safe against networking our X vector soft
IRQ vector is going to also disable every other vectors
so high resolution timers vectors block timers RCU so everything and it's the same for a vector execution so when the networking when one
networking vector is executing the other vectors can not execute at the same time on a single CPU at least because of core supplier cues can execute concurrently across CPUs but not in a single CPU.
So what I think RT wants
I'm not going to take much risk about guessing but I think the, so soft IRQ's are annoying just like any soft just like any interrupt for latency sensitive tasks because they are on the way when the task wakes up and has a critical code to execute with latency deterministic expectations interrupts are on the way and
soft IRQs behave just like hard interrupts in this
regard and it's also the same with soft IRQ disabled sections. They are also annoying in the same way. So RT wants, I believe to preempt soft IRQs, you do but more to that after. They want to print or interrupts soft IRQs in order to execute more important
more higher priority code and they want to also make use of priority inheritance such that when a task depends on the lock from held by a soft IRQs
they want that soft IRQs to complete fast so that the high priority tasks can go to the CPU.
Also we want soft IRQs to interrupt other soft IRQs for the same reason of
a priority inheritance. that's also for mainline yeah. But you don't have priority inheritance. No, but the thing that we only have this all in one off on mechanism the thing what people were
complaining about when net the network guys forced made sure that once they switched into soft IRQ threat mode that the next interrupt wouldn't start over in in the return from inter a path again but that also affected every other soft interrupt vector which means tasks let's got delayed as well and whatever the hell got delayed so they are
nasty hacks in there to make this work which is yeah the soft IRQ now mask it's a horrible hacking it needs to die.
He has a slide on that. – I'm going to. It's just a teaser. Okay so now what I believe mainline wants so yeah that's roughly what Thomas just explained some soft IRQ vectors can really eat a lot of CPUs and it's mostly about networking soft IRQs but it could be also the case with tasks let's or yeah and block also when you have large stream of block packets arriving but there is a balance
to find here because you don't want to starve the soft IRQs so you want the soft IRQ to
handle all the packets arriving but at the same time you
also want the user tasks that depend on these packets to also first process those packets so you have to find something in the middle and it's actually very hard to achieve the mainline uses some sort of balance between interrupt processing so soft IRQs are usually executed at the end of hard IRQs and they we try to switch
to a threaded processing when the load becomes too heavy but of course it has the drawback because if you uffload to threading processing you might also suffer from some delays you might lose some packets networking packets so– – [Man] The networking case is not the problematic case the network in case forces it into the threat and then it stores everybody else.
So that's what people were complaining about and that's why we have that nasty thing there. – Yeah so because when you defer the software you keep
processing into a threaded mode every soft IRQs are going to be differed there so for example if you have tons of networking packets arriving we offload the computation to case of IRQD and then every subsequent vector raised are going to execute on that shredded mode and yeah that's a problem because case after IRQD is going to run well it's the scheduler that decides so and since we still need quick
hunting for many of these soft IRQs it's a big issue so this is the hack that
tries to leverage that yeah so tasks let's soft
IRQs often need very quick processing so when you when
every vectors are delayed to case soft IRQD but
still an interrupt fires and enqueues a new soft IRQ and here would be tasks read for example if we want to execute them right now and not wait for the case soft
IRQD we have that hacked that ensures that that is
done right, right on by a and it's not very pretty yeah so the RT solution to cope with most I guess RT needs so this I guess most of you know this is the threaded softirq implementation so most of the what was executed on hard interrupts tale is no executed in shredded mode we also have per-vector granularity so we have I believe one case of IRQD per vector right? No, no? Oh you had, okay.
– [Man] We had and that had
its own set of problems exactly because we cannot separate them. – So how do you, so you have only one–
yeah. – Oh okay so– – We try to split them apart
but because we have no real rules one can run concurrently because we do not have what you wanted want to do we ran into trouble – [Man] Yah so that wasn't
in the next slide, yeah okay. – Do we still have the way of
like when the trigger whoever raised it when we to save or
re-enable, perhaps we execute at that moment? – It's basically the same what we do in mainline when
if you look the part in half enable and no you're
the one which is enabling it then we still handle it but
on activity it's slightly differently we only handle
those we raced ourselves in that section.
– Right and we do it as that
threat, whoever raised it does executes it, it's not done
in any different context. – If the razor basically let's say do you have a friend
submitting a network package and it raises net T
axe then it executes it immediately when it drops the this re-enable spoken Huff's. – It's a nine execution. – Yeah that's what main line is doing as well but the difference is that on main line if– – They switch contacts– – A lot of shit is added during that time while you have partners disabled you do all of it which is so you do also the unrelated ones
not only those which you kicked so the thing is, yes.
– How I mean we do or so that's right we're
thinking about getting him 'cause I thought this was an
idea of putting I mean this is even RT related it sounds like this actually to me it seems like a better solution than what's
currently in mainline.
– Yeah but he comes to
that on his next slide. – So yeah that was maybe– – We would love to go back to
that model but we can't right now until you finally– – I guess I'm here to try. – Do your job. (laughter) – Do my job. – Get your act together. (laughter) – So yeah that was actually my worry I was sitting yesterday evening in the
dark and I was thinking about (laughter) what happens if what happens
if the soft IRQ access per CPU values that's really only accessed
locally and assumes that it, I mean there is no concurrency if the first CPU value is only accessed on by by a vector or whatever soft IRQ there is no concurrency we don't need any locking at least in mainline but if we were to have concurrent vectors that would be a problem I don't know if– – That's why we gave up on it.
– Yeah exactly so that was my guess. – But with the fine
granular control we can say hey I do only care about that particular one and I know nobody else touches data which is protected by this say that let's talk about it once you showed what you want to do. – Okay, okay. – He doesn't want to
go to that next slide. (laughter) – Everything is on the next slide.
(laughs) So it's yeah proposed
solution so it's not to solve but to help RT and mainline. The idea is so this is
yeah I think in case you missed it the soft IRQ perfect or masking you can even find
an article on LWN about that so the goal is to allow
soft IRQ disabled sections to be soft interruptible
which means when you, when you disable a soft IRQs
right now you disable all of them so now the point is to be able to disable just one vector or just one set of vectors and you want the other vectors to be able to interrupt the this soft
IRQ disabled sections so of course this comes with a new set of APIs it's essentially declensions on top of the existing local BH disable and spin lock BH right lock BH read lock BH and all these APIs with underscore mask suffix
and you can pass the vectors you want to disable so it returns to you the, yeah, previews mask exactly
which will restore upon re-enablement so this
disablement can stack they can nest and it's not pretty but it's more granular, right.
– Yeah it makes a lot of sense
even outside of RT because why do I care about some random
task that if I'm networking and the other way around. I mean this– – It's a big kernel. – Yeah you care if there's
sheer data but you should know that they do. – It's equivalent to the big kernel lock. – Should we know that they
do, I mean there are so many vectors– – I mean right now you can't
just tell but people should go and it's soft IRQ disable
like preamp disable is– – A big hammer?
– Big kernel lock on a CPU level.
So and we all know how well semantically defines big kernel lock was. Not at all. Nobody knew what it was protecting
but if you ripped it out things fell apart. – Yeah, yeah. Yeah and we had to cover it side by side. – It's own version of BKL. – [Man] And getting rid of
the BKL, that took what, 10 years?
– Nobody knows what it protects. (laughter) – Like that's his luck. Sorry, no that was bad joke. Yeah. (laughs) But the problem is it might
be a bit different than the big analog because maybe mainline doesn't suffer
that much from these issues I mean we, do we have no symptoms– – Talk to the networking people. – Yeah, oh yeah. They actually like that patch. – Well, I mean right
now 'cause networking, you're getting networking packets
that are faster than a CPU that's coming in, that's
processing it so that's exactly so that's where they're going
to hit all the issues, that's exactly when the trigger is
when the networking's come in so fast that it's just going to
install the whole CPU and the whole thing's just basically live locks.
– Yeah or it's coming
so fast that the current implementation of subsonic use cannot keep up. – Well, no, no– – The problem is that once
they, I mean if they do it on return for interrupt you get never anything else done because you're on return for interrupt. If they push it out to the
thread it's perfectly fine for networking but then
it breaks the other piece. – Yeah.
– That's why you have this make my eyes bleed hack there. – But yeah so for
mainline there would be a solution actually a straightforward solution for that it would be to have a case of IRQD pending mask which only takes all the vectors that
need to be differed and the other one can execute before
you reach the case of IRQD. – Yes but then what people wouldn't like
to have is actually a way to go back to that model which we
had earlier in RT that we had threads per vector.
– Yeah. – Because that would help
other use cases in mainline as well if you have block
and net on the same CPU they get in each other way. – Yeah.
– No matter what you do. – Yeah because I guess most of the time the timers and maybe RCU better– – Yeah RCU should go
out of soft IRQ anyway. – But anyway most timers
soft IRQs don't need to be offloaded I guess, for example,
among other vectors so, but yeah that only partially
solved the thing for RT because vectors with this model of fine granularity masking
vectors are interrupted but they are not preempted,
I mean a task cannot preempt soft IRQ vector, right? – On RT, yes. – On RT, yes. Yeah we would still need the
hybrid solution with your soft IRQ threads.
– Yeah but that's, I mean we just offloaded, full offloaded
into threads and take the penalty for it but that's
the price we pay for having deterministic behavior. We won't change that and
it doesn't matter but if we break up things into more granular entities to protect and then it's generally good not only for RT, it's a general improvement and
that's why I like that thing. – But I need a good selling
argument for a mainline for that really because Linux
doesn't let him convinced about that. – I think the interface
we need to make a cleaner interface somehow 'cause
I think that might be part of it. – Yeah, and we need to– – He actually suggested that interface. – Yeah this interface is not
the problem, we need to find, demonstrate that it actually solves a real world problem.
– Yeah exactly.
– Can we talk to
networking folks about it? – Yeah, we talk to the networking folks. – Also they seem to like
the patch sets, so yeah. – Because they have used cases
where they interfere with block and they can probably come up with a demonstration that it actually
solves something and then it should be a no brainer. – Thomas, do you know who
to talk to Eric or David? So talk to Eric, he's here so, make sure, see if we
come up with a use case. – Okay. But yeah we also need to notice
that it's a lot of long-term work but why not, indeed, if we have good compelling argument to
integrate that on the main line so we have lots of API is to
convert side by side just like we did for a big kernel lock.
– At some point you
have to bite the bullet. – Sure. – It's like we're not used to
long term work I mean how old is the real-time patch? – Yeah. (laughs) But fortunately like that
provides all the informations about which lock is taken or
disables which vectors, so and the lock that support is quite a significant
chunk on this patch set but really this only provides
the runtime information. We don't have static informations
about where our lock– – Yeah because you can't have– – You can't really follow the whole path. – Follow and figure out
from static analysis whether some data is shared
between two vectors or not. – Exactly. – But I mean we have had, basically had the same
problem with the BKL. – Sure, yeah. – And we went there and looked
at it from, on a case-by-case basis and it had the, had their main experts involved
and had them clean their mind if you, if the network
people find something which makes your, their life
easier with that patches, they are going to clean up
most of the ship because they are using it most. – I guess many– – If you find people who have
a vetted interest in that then it's just going to happen.
– Yeah right. I guess there's still some
drivers that will never really be converted but it's not
really important I guess. – No. – Yeah. – And those are probably
pretty easy to understand what they're trying to do. – Yeah, one big drawback was
that patch set is that it only makes soft IRQ disabled
sections interruptible, soft interruptible but it
doesn't make the vector execution soft interruptible, for that
we need some more work because I guess we don't use spin lot
BH, for example, in a vector handler.
We use just pin lock, right?
– Right. – And we would need like spin
lock BH just to know that we only need to– – In the vector handler you already, no, you already have
that vector protected. – Well. – Because you're executing
the vector so it's masked. – You're executing the vector, yeah but very often you cannot enable every other
vectors because many, many handlers are, many, many locks are shared among many other vectors. I've seen some cases in
networking where some locks– – Yeah let the network people fix that. – Yeah but there are many cases to handle. – You have the thing that
you run and you take the lock on in net RX and at some
point you run into time out at which point you run the
time out soft IRQ and then you grab the log but not very often.
– Yeah but those are only a few cases so I talked to Eric about that
earlier and he said that that should be fixable. So right now it's a tangled mess but it's, the points where it's tangled
it's, they are well defined so you can actually rip that apart and figure out how to solve that. So it's not that that there
are too many cases where the timer actually interferes with networking data. They don't have so many places
where they have the timers.
– On the circuit code a lot. – Pardon? – On the circuit code. – Yeah but they have– – But it's probably very localized. – It's very localized and very
well structured so they can't probably fix it at one spot. – I mean we can check that with log that check all the logs that are shared among many vectors, I need to check that maybe that would be interesting. Maybe we only find a handful of logs after all, yeah. Okay so think we answered some questions here. (laughs) So yeah that's all. Unless anyone has a question? – When do you think you'll have this done? – What? – when do you think
you'll have this finished? – I mean it's mostly actually pretty much finished. I mean just the bulk core code of it but if we want to convert every side for the API, that's going to take years.
– As I said before talk to Eric about it and find a
use case which actually has benefits and then.
– That's for setting the core part, yeah. – Yeah and then work from there. – Yeah.
– Yeah if you could show an example with and without it
and you could see a clear advantage with it, Linux would take it. – Yeah, yeah I guess so. – If networking people
love it then he'll take it. – Yeah.
– Twice. – Well David Miller already acted so. – Yeah, yeah sure but, sure but now the only thing
is and you need dear help to come up with a use case
where it actually shows a benefit.
– Sure. Because that's quite a piece of core code modifications. Yeah, yeah, yeah.
– Yeah sure. All right. (applause) (laughter) It's going to be short. But this one is going to be
very short, just like two slides so, actually three with this one. Yeah about full dyntiks isolation just a very small roadmap about the things I need to do there is a CPU stat freeze so that when you, when you isolate a CPU
and disable the tick on it the CPU stat is not going to move forward.
I mean it's going to move forward for the task statistics but not for a proc stat for every CPUs so the user and system and guest fields are not
evolving so I need to, just need to fix that I had
a patch set, I just need to rethink it a little bit more and especially now that
the tasks can have an RCU, appropriate RCU life cycles
thanks to some recent patch I can resurrect that patch that. I also need to clean up the
code in tick sched a bit more because I think it's not that great. I mean, I mean we essentially patched the whole full dynticks code on top of
idle dynticks code and it's not that pretty right now
I guess I just need to revisit a bit that and rethink on text tracking because it's based on that TIF no Hertz flag which is kind of weird I guess we need
something like per CPU switch maybe reuse some slow path
Cisco slow pass thing.
Anyway that's lots of technical thing, boring and of course make no Hertz
full mutable through CPU sets which I say that every year for like at least six years and
I still haven't do it but yeah I keep being sidetracked
but it's going to happen one day eventually but what I wanted to know is what do you need for full
dynticks and isolation on RT because I actually don't know
anything about what you're doing there I know you have
some specific code for that I don't know if you have needs specific.
No? – I mean I know that there
are people out there using full dynticks isolation
in order to stay in a polling loop on their PCI device in user space forever. – That's real isolation yeah. – So they just waste CPU
cycles because they claim that they can't afford
taking the interrupt. – [Man] Yeah and there are
people using virtual machines. – Virtual machines, oh yeah.
– [Man] They have the
real-time kernel isolated CPU. – Watch the speaker. – Yeah, they have the real-time IE company has the real-time kernel and then they have a virtual machine and they pin one CPU to the VCPU and try to run a pooling thing inside and they want to get as less in interference as possible and that's one case for the full dynamic. – [Man] They want the hosts
not to tick essentially. – The host and the virtual machine. – [Man] And the virtual
machine, yeah, both of them. – Is there any special
thing to do for getting, I thought, like, you still
have that four second tick now or is it, did you get rid of all of it? – It's actually offloaded to housekeeping set of CPUs, mostly CPU zero but– – I've been running this to show it and I still use it
as my example in kernel shark to run tracing on it, I put
in the user spin, have all the isolation maybe I'm doing
something, I'm doing everything isolation, every, and
you see a little tick every four seconds.
– Watch dog, I think
it's the timer watch dog. – That might be the timer watch dog. – Oh, okay so I have to
turn off the watch dog then. – The TSC watch dog. – [Man] There's internal
common line that disables it. – TSC equals reliable, you're lying but. – TSC equals reliable. – I think it's a black magic but if you write TSC
equal reliable on the– – Yeah we also added the disable, I mean TSC no– – [Man] Yeah there is something else, it's just the TSC reliable. – You can just disabled the– – [Man] Stable, just disable? – Yes stable something, yeah.
– There's some option
where you can disable that, you lie still but. – Of course it's unsafe because you don't have any more TSC guarantee. – No, actually to be honest, much better with the TSC, so it only took 20 years of pitching but there are still cases where we actually can't trust it and that's if your CPU is, if you have a single socket system and your CPU has the TSC at rest register you're pretty much good. If you have two circuits it kind of works most of the time if the, only if the main book manufacturer and the
buyer's writers didn't screw up completely and about two sockets, all bets are gone so that's the state so on your machine you can do it.
– Oh okay. – Basically if you put your
machine without these options and run your workload and
you don't get TSC warnings, it's a fair bet to say that
you can use this option without too much headaches. – Even on virtual machines? (laughter) – The best option is to just
never use a virtual machine. – I mean if the host is
reliable then the guest is reliable as well. – [Man] You see, there is a way. (laughter) – If the hypervisor
doesn't screw up with the guest TSC. – [Man] So relying the host is enough to rely on the virtual machine. – If you trust. – [Man] Oh good, no that's already good. – But it could feed that
information in from the guest, through a CPU ID. You could actually tell
the guests that you're doing great. – Okay, no, perfect, no that works for us. Thanks. – This is actually one of
the user spins that are using our user spin with the full hertz.
CPU two and three are isolated, CPU twos run the user spin
and this is all the interrupts that I have, so, but this one I think I
had IRQs running on this one. I forgot to isolate the IRQs,
there's one, I think they run but there was another one. – You see the context
tracking exit on every– – Oh here it is. – Are we secure, oh
yeah we still have that. Yeah we had that project from
some guy who wanted to kill when we have any single disturbance, still have no news about that. – It looks quite pretty now it's, this is with a, I moved IRQs
off so I have to probably take a look at what was going on, it like,
has a little tick and then boom, tick, that CPU two is
the one with the user spin running and that's– – Yeah and what's the tick doing? – Yeah.
– That's the more interesting question. – If you have like timers, evens. – V time user exit. – Yeah but it happens on every… Later, it's a high resolution timer. Go below. – Which timer does it expire and which function does it call? – Below, below, below, below. – Further down. – [Man] We need a pointy hair boss thing. – Click it, click it.
– Below, below, below. – Yeah.
– Wait, that's zero. I'm looking for two, let me filter. – Oh it's– – Yeah but it's to run and– – Yeah that probably then raises the timer soft IRQ which then
does the other thing, yeah.
– And if you go later to the
soft IRQ processing maybe you will see the– – Yeah, you should see it. – Soft IRQ rise here, action timer. – Yeah but–
– Yeah but which handler and soft
IRQ processing later. – So we're blaming soft
IRQs is what you're saying. It's his fault.
– Yeah, it's my fault again. – Let's see, I'm trying to find the, I should probably just filter off CPU two.
Refilter. – Yeah, it's right after IRQ exit. – So now case soft IRQD runs. – Yeah okay. Actually it's going to be delayed. – No, no, no, no, no but
you should see it inside of case soft IRQD.
– Yeah. – No, it actually was a stray tick. – You need to go to the case soft IRQD right after. – Just try again. (laughter) It's all fixed now. – Case veteran stuff, you're the… Run case soft IRQD. – There's a little tick
here, it's always like a little tick here. – Yeah, yeah but it's because
there is a timer in queue. And we need to know which one and it's on case soft IRQD processing. – You need to have the callback function. – There's a lot of things going on here. – There is a lot of thing going on.
– I think there's nothing
special about RT in that report. – No, no, I think it's– – I mean–
– I'm pretty sure– – It's still the same thing
you have to find the stupid timer which fires and figure out why. – If it's every five seconds,
I'm pretty sure this is the watch dog, it could be the watch dog hard low cap watch dog or
it could be that you see thing it's yeah but four seconds it's always the same
thing, I remember that. Anyway. – Yeah, four seconds exactly. – Yeah. – Yeah then it's one of the watch dogs. – One of the watch dogs. – You can disable that. – Yeah. That's it.
Yeah? – On your, the previous slide, the four thing where you want to, yeah, make the isolation CPU the name through CPU set. I remember there is also an RCU relation with the callbacks, is that true? – The RCU, the callback offloading? – Yeah. – I think it's a pretty well
handled right now, yeah. – So there is can be dynamic already, or? – It's dynamic, yeah, yeah. I think it's, it follows the no Hertz full set of CPUs. So it's completely transparent. – Okay. – Yeah, yeah. Yeah Paul solves things
much faster than me, yeah. – [Man] No more questions? (applause) – So we have 15 minutes more. Well 15 minutes between
here and the next talk. But might be the case that
next talk might take more than 30 minutes. – I don't think so. – No, so what do you guys think, should we wait for these 15 minutes to– – No, we just throw it and then throw it into random discussion session at the end.
– Yeah and then anyone that's
expected to see this on time, to come in, too bad they should
have been for Frederick's. – Okay. (laughs) – It's recorded, they go watch it later. – Yeah, yeah. – Just a few words about where we are. Most of you might have noticed that there was finally a decision made by emperor penguin. So he pulled the bits and pieces which actually
bring in the convict preempt RTs which, into the main line,
which is not functional yet but it allows us to add all the
dependent code with which makes use of that convict
switch so he basically declared finally that RT should be an first clas mainline citizen. I'm pretty happy about that state.
(applause) At only 15 years now. Almost exactly. 15 years ago in September
2004, the big debate started on alchemy on one of the
greatest flame wars ever and I personally was looking
into that for 20 years so, 1999 I started looking at the options
and I hated most of them. Yeah well what's still missing, what we need to do is we made big progress on the lost
outstanding large cleanup thing we had to do which is print K. So it's one of these code thing which are held together by duct tape
and once you throw RT at it, the duct tape comes apart. So we had a really productive ad hoc buff yesterday with a
couple of people in the room including Linus and we agreed on how this should be, how this should look like and we're actually, basically those who have seen
John's talk, it's roughly the idea will be what print K will be in the future. It should be usable from any
context and it should be less annoying for other reasons. There are still a few details
to be hashed out on the implementation level but the
conceptual level is agreed on which is progress.
So other than that we have a few bits and pieces here,
a few bits and pieces there we have some discussions to resolve where people don't like us to change
their previous code because it's optimized for performance which doesn't even matter because that code is only
cold in the slow pass once in a week, so, but I mean you know
people care about their sand pits. We will eventually get that solved, I don't think it's any
fundamental issue anymore.
So but definitely having
the convict switch in Linus three is helpful because there was a row of arguments
starting where people said, I don't want to change this because I'm not sure whether
RT actually goes mainline or not and if it doesn't then I
have changed it for nothing and okay we can settle that discussion right now. So what else, yeah we're going to bring in the bits and pieces as it goes as we might progress but I expect a big chunk as early as, quite some stuff queued for five four already. Which takes out a significant
amount of stuff from the RT patch set. I expect a real bigger chunk to land with the print case
stuff which we hopefully resolve for five five. I'd rather throw this one. – I'm not catching that one, I'm duckin'. Especially if you throw it. So a miniature shark Mike. When would you expect
us to be able to build a usable RT from a mainline clone? – If everything goes good, five five. – Five five, okay. – But you know.
– Yeah. – It's not– (laughter) – I hate to say it but I
think they were gated on us.
– A discussion brought up something interesting yesterday
which will actually make the year of the Linux desktop happen because, so Daniel Vetter came up with
brilliant idea how to get a blue screen of death. – All right. – And then we are finally
up to the task to be on the desktop. – Yeah we weren't useful like that other operating system. – Right.
– 'Cause we didn't have a blue screen.
– No. – That other operating
system now wants to use us and now we can because
they, we've hit the one requirement that they have.
– You can use us because we now have a BSOD. – Yeah. So we just have to have that two years long
discussion about the colors.
(laughter) – Is it gonna be red or blue? – Blue on blue. – I have a weird question, what do you expect the new mainline kernel with
the RT bits on it be when compiled without the RT bits and should it be exactly
like upstream right now or should it– – It won't be any different. I mean it has to work in the
same way as it worked before. I mean except for the stuff we
changed over time anyway but everybody is happy with that. – We've been doing this for years and the kernel, actually
I would say the kernel would be better.
– Yes, I have this feeling but my question is a little bit broader maybe weirder but because right now if you disable, if you add the preempt RT patch on top of the kernel and you build it without preempt RT you have some things changing, small things or a slightly different behavior so– – No it's not slightly different behavior. What you have is that the
code we had to restructure in order to accommodate RT stays restructured but that's
going to be the same way. – [Man] Okay it's going
to be the official way. – The official way so the
restructuring of the code is going to happen, I mean in a lot of places it already happened. We just sold it for different reasons. – But Thomas did we, every time we've done
this, we, like, lost, the changes have, like, went through, like, four
modifications before it ever went upstream. – Right. – 'Cause we get the input
from the people that we touch. – I mean we had cruel hacks to get CPU or clockwork in RT. Those were really even worse at that. I mean there's this, the
soft IRQ hack is just golden compared to what we had to do to hot block and it never worked reliably up to the point where we
actually ripped out or refactored hot block in
the mainline kernel to, and broke, unbroke everything
what was broken there in the first place.
– I have a question which
is sort of following off from Lewis's question and that is how many people do you expect to actually be building a preempt RT kernel and as a result of that you know how often do you expect preempt RT stuff
to break in mainline? – Not at all because we're
going to make sure that we run preempt RT tests even on next once we are in. – And once it's a configurable option to zero-day, we'll also hit it. So as soon as you compile break RT config options, zero-day will yell at you.
– Yeah what about behavior though? – No, that's what we are
going to do so we have to test infrastructure already
and just point it at next at some point. – Locked ups could be modified
to be able to catch things like spin locks. – We'll be adding, you know,
once it's in main line and Thomas, you know, blesses us to turn it on so to speak
we'll be running tests on it on zero day, right, so it'll
be a part of it just like, and my goal with that, with zero day is when you submit your patches
I'm also looking at other ways to kind of maybe through
cut your nail on other things to look at those patches
as they're coming in and catch those gotchas like
you know local disable IRQs or things like that to
send a soft warning.
– There's some patches that are coming in so that
we can be somewhat aggressive to preempt so to speak
the potential breakages. – Yes, yeah that's definitely part of the plan. So it's going to be the year end, it's going to be used, I mean, so I expect that the actual users which put them, put our team products, they're either using totally broken franken kernels anyway. I don't know. (laughter) Is there anybody in the room
who is related to those things? No, I don't care. No, but I mean normal and reasonable people just use LTS kernels. I didn't even try to say that
you're close to reasonable. So throw a mic around. Just throw it into the crowd, the guy who is hit has to ask a question. (laughter) And if you're not throwing
I'm going to throw.
(laughter) See, everybody's happy already, no. – [Man] That's the beer. – [Man] I'm not reasonable. – So, after we get to the point where we've got a mainline curl that will build a functional, one we like preemptive T kernel, what do you see the effort as beyond that, do you have
areas where you say, this is not the way we want it and so what are we gonna
be lookin' at later? – There's a couple of things. One we have a not very well specified
but we're on a long list of functionality we disabled
with RT where people actually want to have it. One of them. Oh yes.
– EBPF will fix it all, right? – Yeah, yeah we– If EBPF is disabled because
EBPF wants to run with preemption disabled forever. Usually not. Most of the time it's not a problem. I mean Daniel will put it in
his formula and it just works. (laughter) Yeah, no there's other things which are, can or currently disabled on RT and– No, that's more in the more
functional areas like transparent huge pages. He was asking about admission control but that should just work. At least nobody complained
about it not working. Group skid RT should go away. So we should just rip it out
out of mainline completely. It's been broken forever not
only on RT but on RT it just falls completely apart.
Mark. – And so I think it was
just mentioned in passing about we're now gonna have yet
another kernel configuration that fundamentally changes the behavior of certain things like locks whatever there's been lots work that's gone on historically in the RT projects whose joins the mainline like lock dep and that kind of
thing to make it easier to analyze and identify– – We have unlocked a patch
for that which is coming up. – Sure, I was gonna ask is
there anything more coming up beyond that to help with
actually identifying issues, is there anything planned? – So a lot of the, except for the lock dep part, a lot of the missing bits, a lot of the infrastructure for us to see abuse of interfaces is already in place so I've reused the existing stuff.
We might add some extra
bits to scan for new freestanding local IRQ disabled lock, preemption disabled
and things like that because that's where a lot of the trouble comes from because as I said before, all of these mechanisms like local IRQ disabled preempt
disable, partner of disabled and whatever disabled. They are scopeless. They do not tell you what they protect so you just turn it on and it protects the world. Great but you can't tell what it actually does so we have one mechanism in RT which I wanted to avoid it but I probably going not to avoid it
completely and it actually it might help with this particular part of problem pretty well is something like a mechanism it's called slow clock so it's trick per CPU if you define a local lock on a non RT
kernel it just goes, allocates zero space for it.
On RT it allocates actually a lock and that behaves semantically like then preempt disable but on RT, or on a non RT kernel it goes, compiles into a preempt disable and on a RT kernel it becomes a lock but then you have scope and
you know what you're protecting so that's one of the big problems we always had with scopeless protection. And it's probably, it hasn't, it has no advantage for mainline but it, I think it has one as well.
No obvious one but there's one as well because at the moment where we have scope, also lock dep will be able to whack you on the head if you do and we found
actually problems like that so people did a preempt disable and actually attack to
protect per CPU variable access and then actually they took an interrupt and accessed
exactly the same CPU wearable so we have no debugging in
mainline which actually can catch that case because preempt
disable is scopeless. So but if we can actually
put a scope on those sections then locked up can see the lock and say, hey you took
it with spin lock here and then you got it you touched it in the interrupt and that gives you scope again so that works. – Okay, yeah I think it could also be useful and main line like with the BPF exit for example like a lot of people have been asking why do we disable preemption is it because of our CU or is it for some other reason and so– – No, it's just because it's BPF. (laughter) – BPF heart relies on preemption
disabled because of the spin locks.
– Okay, so why couldn't it just use
the spin lock API instead? – No, no, the spin lock is encoded in the byte language. – [Man] Oh I see okay. – So it's magic. (laughter) No that chip can't actually. The chip just does the, translates the instructions but it doesn't see that it is a spin lock and because they do spin locks on that stuff. We're protecting their own data. They have this requirement, I
don't know how to solve that. – I guess you could still have
recursion where you have an interrupt come in even in the middle of the preempt disabled section in BPF because they don't disable IRQs, right? – [Man] No. It's so they can't, that BPF thing can't be preempted
by another BPF thing take a whacking at the same lock. – [Man] BPF is a giant pile of nasty. – [Man] Where I was going was, you can have BPF programs run
in the context of interrupts.
– You can, you can even have BPFNMI but it's a giant pile of nasty. It's truly disgusting but they fixed this particular
issue by having different BPF program types and only the BPF program types that run or at least a spin lock is limited to a specific program type. – [Man] Which are only
run in thread context. – Yeah or soft IRQ, I think. – [Man] Yeah, whatever. – Yeah so they're limited
to a single program type. – [Man] Now if the issue is
specifically the BPF program protecting its own data structures, could we solve that by
simply making an exclusion on each individual BPF program? – [Man] No, no, so this
is a lock that is shared with user space.
– Oh okay. – [Man] They have a U32 in the user map and they use spin locks on that. – It's magic. Don't ask why it works. That's one of the things
is which just work and nobody understands why. – [Man] Yeah. – No, the user map spin lock is, I mean– If the user space gets scheduled
out on the wrong place then you spin forever in that arrow thing which is. No, it becomes your problem. (laughs) – [Man] Hi, I have to deal with a lot
of other real-time operating systems and everyone I work with, we what we really want is to use Linux and it's quite nice to hear
that there will be a release in five five. Well, sorry, that's the
goal, right, the goal? – There might.
– There might be.
– Five X. – Five X.
– Five X. (laughter) – So I guess, so we've got automotive, probably military avionics, that sort of thing. What are the major
other blocks you see for industry adopting a Linux? Because they're so used to
proprietary closed software. – Linux is widely used in the
industry and the preempt RT patch as it is, as an out of three thing is used in, widely used in products.
It's all over the place, it's in automation. Pardon? Is there a RT kernel inside, no. I mean there's a lot of– It flies in rocket as rockets– – We did use RT but never in aviation or anything like, yeah. Right one of our automotive products. Portable, yeah. – I mean the whole thing getting the idea of having Linux preempt RT in the control box of your Boeing. – I guess so I have to
deal with things like DO178 which is the aircraft software standards and then of course you have MSRA, so I mean one time–
– No, no. – So I mean if somebody were
to certify one machine with Linux on it that kind of opens the door.
– Yeah there's effort underway
at least in the automotive space for certified things. – Yeah you should have
attended Lucas's talk on safety critical Linux if you're interested in the messiness of that, so. – There's a project at the Linux Foundation, Eliza which actually deals with that. So people are looking into that. – That's the user case for the Baltic. Remotely stuff is
clarifying the dynamics to one day try to certify Linux but I think this will take some time. No, no, no, no. We are far from something. Look at Michael Furman but we still– Yeah it would take a long time but. – You know it only took me 15 years. But you're way younger than me. (laughter) So you might get it
done before you retire. – [Man] Job security. – I think you had the idea about completely removing soft IRQ– – [Man] Pardon? – Completely removing soft
IRQ and everything in stride– – Well we were discussing that at some point but I didn't come up with a better solution.
Soft IRQs completely. What I would love to
get rid of is tasklets because they are ill-defined or not defined at all. It's a random pain so if they should really die and there are not that many uses anymore but we still grow new ones, so. A lot of the tasks, less
usage is gone because it was very, very widely used in things
like ice clear sea and SPI and whatever but those people took the opportunity and switched over to threaded interrupt handlers
which solved the problem proper. So that port mostly the task
let's use it, just go on and then ask quite some in old
thrivers which nobody cares about so we need to find some of them
I think and still whack them and then maybe kill them all. Input is using tasklets for no reason. – So just back to the user map spin locks, I'm just curious, do they
deal with page faults at all? I mean with preemption
disabled, what happens? – I don't know.
– [Man] Is it just broken? – [Man] And for the
spinlock thing actually, user cannot spin lock at the
same time with the kernel. The reason is we don't share the memory. In kernel, it's the kernel memory and the user will get a copy of data and a spin lock will be zeroed in a central user space so there will be never a case, user space has been locked and
the kernel has the same spin lock, that's not the case.
– [Man] Okay. – Yeah, just kernel do
kernel thing and user space, they just get zeroed. – Okay so what you're saying is the BPF spin lock is only protecting–
– For kernel only. For BPF different– – BPF against BPF.
– Yes, that true. – Why is that user and can
the user space modify it? – Why is it in the user map, then? – You use a map because
it's inside the BPF program and there some maps they
need a special flag and to indicate a spin lock there
so they can use the spin lock. – Okay.
– Otherwise they cannot use if they do not have this flag– – [Man] Okay and that's how they share the data between the two BPF programs? – Yes, you have a same
map, you have a different– – Okay.
But that makes the problem much better because then we don't have to disable preemption at all. We just have to make sure that the BPFs can't preempt each other. – [Man] Something like that. – Yeah.
– But it's different the CPUs yeah well.
– Yeah but then the CPU– – That's okay.
– No, I mean we deal a lot of that with
that on RT already that we say okay you can't, we have mechanisms which
keep you on the same CPU otherwise the whole per
CPU mess wouldn't work. Yeah no but if I keep it on the CPU and make sure that no other BPF thing can preempt that BPF thing until it's run out then I still don't have to be fully preemption there disabled preempt I've just have to protect BPF against BPF on that thing. – But if you did a local
lock for BPF local lock, what about BPF local lock
so it just basically, if they only care about BPF for BPF, I mean if something else you just, boom. – I mean then you get local preemption so the one CPU that holds the
lock will be preempted by some other random crap and then the other CPU because it is a spin lock will just sit there spinning, it is a spin lock.
– Yeah but it spins with
preemption enabled, I don't care. Yeah that's, and it's a virtual machine. We just have to look at it differently. Yeah I know. You can put it into the same
bucket you have it anyway. In the buckets you love, Bert. NMI. Felix's are great. Everybody loves system D. (laughter) We tried hating it, it doesn't work. (laughter) Doesn't made it go away. Okay, any other questions? No? Yeah sure. – Okay, so yeah I was just curious how much of RCU or torture testing has preempt RT got? I know Sebastian runs
them quite a bit but– – [Man] Yeah we run it on a regular base. – Okay, so you don't expect much breakage or any at all with– – [Man] No, I mean depends what Paul is up to at the moment. (laughter) He might break it again, no, no. Hasn't been broken for a while. There are a few things we
break occasionally but once it stabilizes, it shouldn't.
– [Man] I mean the magic
moment was 2006 or seven or thereabouts which is when
preemptable RCU came about but ever since, it's been okay for RT. I mean it, like everything,
occasionally it breaks by accident but then it gets fixed. It's really not an issue. – Okay. – No, we have a pretty good
test coverage right now for the interesting ports. – [Man] Are we keeping the
config breakdown between preempt RT base and– – No.
– Okay so it's just goin' straight to full.
– Now it's gone. – Okay. 'Cause I never really
understood what the– – [Man] Oh that was a debugging and development thing which I used when I remodeled the whole tree back in the three O timeframe.
– Because I could test some of the nasty thing is full RT kernel because I had to make sure that those mechanisms work in order
to get a working RT kernel which made it into a
circled bugging problem. – [Man] I'm pretty sure I think I've changed everything that
I, if death fell on your franken kernel from base to full but I guess I'd better go make sure of that, huh? – Yeah you might make this, I think full includes space fully except for an older kernel version, one of the really old ones where base had some other side effects and which full them turned off. So you might have done
something wrong here but you know now it was the franken kernel. It made you do that. – [Man] Everybody loves
the franken kernel. – Yeah, well I don't care. Yeah, next year's t-shirt is, I'm not member of the
kernel necrophiliac cult. (laughter) I see quite some members
of that cult here. (laughter) So, anything else? No? Sounds good.