[HN Gopher] Writing a scheduler for Linux in Rust that runs in u...
___________________________________________________________________
Writing a scheduler for Linux in Rust that runs in user-space
Author : todsacerdoti
Score : 120 points
Date : 2024-02-20 15:26 UTC (7 hours ago)
(HTM) web link (arighi.blogspot.com)
(TXT) w3m dump (arighi.blogspot.com)
| MuffinFlavored wrote:
| > the communication between kernel and user-space is going to add
| some overhead,
|
| What does this look like specifically? Is it that the kernel
| operation needs to copy memory that is in "kernel space" to/from
| "user space" for security reasons? Is that all?
| Analemma_ wrote:
| Making scheduling decisions requires information that only the
| kernel has (for starts, what processes are even running). Since
| userspace programs can't read kernel memory, you have to copy
| this information into their process before they can work.
| cloudwalk9 wrote:
| Would be interesting to see how minimal they could make the
| overhead, since that's one of their goals. I wonder what the
| lower limit might be. That would have broad implications for
| basically everything else in userspace.
| vacuity wrote:
| Ah, the dream of fully-featured, performant M:N threading.
| Spawn tasks left and right, optimize with a small
| declarative config, and sit back and relax. I'm very
| interested to see how close we can get.
| PaulDavisThe1st wrote:
| Making scheduling decisions optimally requires information
| only held on each side of the kernel/user space divide. An
| application knows things about its own operation that the
| kernel cannot. The kernel knows things about the state of the
| system that user space cannot.
|
| Hence the "scheduler activations" concept from the early
| 1990s, in which the kernel calls into user space with
| sufficient information for a user space scheduler to make the
| best possible decision (in theory).
| vacuity wrote:
| I believe I saw a presentation on OKL4 describing this sort
| of abstraction as virtual CPUs capabilities. There is a
| similar mechanism in Barrelfish and Psyche. Last I heard,
| the seL4 team was pursuing a different approach of
| scheduling capabilities.
|
| I think we still haven't realized the optimal design, but
| these are probably good local maxima and points for further
| exploration.
| PaulDavisThe1st wrote:
| My own suspicion is that we have realized the optimal
| design already, but it turns out that it just doesn't
| matter very much.
| vacuity wrote:
| How so?
| PaulDavisThe1st wrote:
| In the 90s, the cost of choosing the wrong thread to run
| when another thread blocked was significant _in specific
| kinds of applications_.
|
| Since then, the cost has (in absolute terms) gone down,
| the number of applications for which this is true has not
| really increased.
|
| In addition, there has been some expansion in the number
| of applications for which SCHED_RT and SCHED_FIFO are
| appropriate for at least some threads (financial services
| high speed trading, real time audio), which has also
| reduced the pressure for a user-space scheduler that
| "gets it right". This is also true for designs which
| require guaranteed scheduling slots.
|
| So, SA might still bring some benefits to large, highly-
| threaded monolithic apps such as RDBMS, but it doesn't
| really provide much to the majority of contemporary
| applications. It certainly doesn't do much for mobile
| environments, which tend not to run applications where
| "the application knows best".
| vacuity wrote:
| Makes sense. Ideally the OS should provide the capability
| to do SA (or similar), but I don't know how the overhead
| would be for programs that opt out of using it.
| topspin wrote:
| That seems unlikely. The dimensions of the problem space
| are huge and getting larger. Although the next Linux
| kernel scheduler I write will be my first, it's not hard
| to infer the increase in complexity due to the evolution
| of CPUs that Linux runs on: "efficiency" cores (now on
| AMD devices as well), multi-chip modules creating cache
| locality problems, critical power and thermal management
| requirements on high power/current devices -- all
| changing concurrently.
| vacuity wrote:
| There certainly isn't one concrete scheduler that works
| for all workloads. It probably doesn't matter for server
| or supercomputer environments; as long as some scheduler
| works, it doesn't have to change. For other use cases,
| the scheduler should be flexible with throughput and
| fairness. Priority, resource usage, realtime, but more
| cohesive. Tuning the scheduler might be harder in such a
| design, but it should allow for users to not need to
| switch to a different base entirely.
|
| It'd be more of a scheduler framework that can be used to
| support various workloads dynamically.
| PaulDavisThe1st wrote:
| Implicit in all this is the idea that if you do not get
| the scheduler behaving optimally that it will make a
| difference sufficient that we can call it a "failure".
|
| That's what I'm questioning. Precisely because the
| dimensions of the problem space have increased, and the
| raw performance of everything except register
| save/restore and TLB invalidation has increased, there's
| wide latitude for scheduling algorithms that are "not
| quite right".
| vlovich123 wrote:
| If I'm reading the page correctly, at the end is this:
|
| > while the communication of tasks happen using the bpf()
| syscall, accessing the queued and dispatched maps.
|
| > NOTE: we could make this part more efficient by using eBPF
| ring buffers, this would allow direct access to the maps
| without using a syscall (there's an ongoing work on this -
| patches are welcome if you want to contribute).
|
| The kernel piece needs to produce information that the user
| space consumes and it doesn't use a 0-copy ring buffer yet.
| cgh wrote:
| He's referring to the syscalls necessary to share data back and
| forth with eBPF. These are wrapped in userspace libraries, eg
| libbpf.c. Data is shared with the kernel in buffers which in
| eBPF parlance are referred to as "maps", which are basically
| key/value pairs accessed via a file descriptor. There's a lot
| of docs on eBPF if you are curious:
| https://www.kernel.org/doc/html/latest/bpf/index.html
| jandrese wrote:
| Of all the things a you can run in userspace, the scheduler is
| the one that I never thought would happen. I guess the scheduler
| has to give itself time to run in userspace? It seems like there
| would be a chicken and egg problem at boot time before userspace
| is brought up. Would be pretty embarrassing if your scheduler got
| killed by a OOM or something.
| anthk wrote:
| Well, GNU Hurd would survive to that right?
| kccqzy wrote:
| You could make the kernel treat the scheduler specially (just
| like today it already treats PID 1 specially) by always making
| the scheduler run whenever the kernel doesn't know what to run
| next.
| wyldfire wrote:
| > Would be pretty embarrassing if your scheduler got killed by
| a OOM or something.
|
| Or if you starved your scheduler process because it wasn't high
| enough priority :/
| thewakalix wrote:
| > in case of bugs the user-space scheduler will just crash and
| sched-ext will transparently restore the default Linux
| scheduler.
| lstodd wrote:
| Happens all the time with the userspace async stuff,
| swapcontext(3), freebsd n:m threads, greenthreads, stackless
| py, it's all ancient, and they're all userspace schedulers.
|
| I think SA takes it a bit too far though. Even if it looks like
| a 'logical progression of them older ideas.'
| PaulDavisThe1st wrote:
| We've been here before, (way) back in the 90s:
|
| "Scheduler activations: effective kernel support for the user-
| level management of parallelism"
| https://dl.acm.org/doi/10.1145/121132.121151
|
| and also:
|
| "Adding Scheduler Activations to Mach 3.0"
| https://dada.cs.washington.edu/research/tr/1992/08/UW-CSE-92...
|
| People put a lot of time into thinking and researching this sort
| of thing. It even made it into Solaris. For one reason or another
| (possibly a lack of sufficient hardware parallelism) it never
| really gained much traction.
|
| Another critical issue to solve is that to do this "properly" (at
| least based on our model of this from the 90s), you have to have
| a re-entrant context switch routine, which is an interesting
| challenge.
| convolvatron wrote:
| my understanding was that schedule activations are really an
| upcall mechanism thats allows a userspace scheduler to manage
| its own threads (green) more effectively and provide more
| robust support for asynchrony across the user/kernel boundary.
|
| isn't this is more of a extreme micro/exo thing where the
| monolithic scheduler is herniated out into userspace?
| PaulDavisThe1st wrote:
| Not "green" threads. Full kernel threads (otherwise it
| doesn't make much sense).
|
| And yes, this is a more extreme thing, but what's the point
| unless there's some capability in user space that can't exist
| in the kernel? And what would that thing be? Application-
| specific knowledge allowing better scheduling decisions to be
| made.
| vacuity wrote:
| Can you elaborate on what it means for them to be kernel
| threads as opposed to green threads (no userspace control
| over when it blocks and yields?) and reentrant context
| switch routines? I read the original paper but I fear I
| don't have enough background to fully appreciate it.
| PaulDavisThe1st wrote:
| Scheduler activations is a design that involves calling
| back up into the user space scheduler whenever the kernel
| decides to: 1. block a thread AND
| 2. allow the task that owns the thread to retain use of
| the CPU
|
| so when a thread (in user space) calls write, it ends up
| in the kernel, and the kernel in a traditional design
| would put the thread to sleep waiting on the write,
| invoke the scheduler to decide what should run next.
|
| With SA, the kernel first assesses whether the task can
| continue to use the CPU, and if so, calls the task's
| user-space scheduler so that it can decide which thread
| to run (since the user-space scheduler has better
| information about this).
|
| Green threads are unknown to the kernel, they play no
| role in such a design.
|
| Reentrant context switching is required because you can
| be in the middle of a context switch invoked from the
| user space scheduler when a h/w interrupt (incl. timers)
| causes the kernel to rethink what is going and rip the
| CPU away from the task.
| vacuity wrote:
| Reentrancy is a nuisance. seL4 allows for the kernel to
| be preempted only at specific points of long running
| operations, which massively reduces concurrent spaghetti
| but also sacrifices some latency. Would the userspace
| program have some way of knowing about global resource
| contention to avoid choosing a thread suboptimally? As I
| understand it, SA by itself only allows userspace
| programs to manage their own resources efficiently.
| PaulDavisThe1st wrote:
| No, the kernel gets to do it's job: decide which _task_ (
| "process" in classical Unix speak) should get the core
| based on kernel-side scheduling policies, then the task
| itself gets to decide which thread should run on the
| core.
|
| The amount of information required to correctly decide
| which task should get the core is too large to sensibly
| move across the kernel boundary, and providing read-
| access to it in a safe way from user space is
| problematic.
| vacuity wrote:
| Perhaps not just any program, but ideally a user could
| tell the OS that a certain program requires minimal
| latency across some cores for some duration, or that
| hardware TCP offload should be prioritized for a server
| process. Like madvise or fadvise to give the kernel
| suggestions for optimization, but broader.
| PaulDavisThe1st wrote:
| For some *nixes that already exists. But it is
| insufficent to provide the OS with information like "if
| thread N blocks for I/O, run thread T instead",
| particularly because in the next quantum, it might be
| "run thread Y instead".
| xoranth wrote:
| I believe something like the mechanism you are describing has
| been in production at Google for at least a decade. See this
| talk [1].
|
| The kernel interface that the article uses (called sched_ext)
| is the result of the attempts to mainstream the Google thing.
|
| [1]: https://www.youtube.com/watch?v=KXuZi9aeGTw
| PaulDavisThe1st wrote:
| Yeah, I'm loosely familiar with sched_ext already. However,
| it is actually the opposite of scheduler activations.
|
| SA means upcalling to user-space when thread scheduling is
| required.
|
| sched_ext means loading a scheduler (coded in BPF) into the
| kernel.
|
| Not dissimilar goals/purposes but wildly different systems.
| xoranth wrote:
| Thanks for the clarification!
| sppfly wrote:
| I'm wondering if this scheduler is for something like user-space
| threads. And What is the relationship between such scheduler and
| go runtime for goroutine and JVM for Java virtual thread?
| matthews2 wrote:
| In this example, the kernel is doing the task switching. They
| are "real" threads. The userspace component is informing the
| kernel which tasks should be run.
|
| goroutines and Java virtual threads are a separate idea. The
| application saves its state and then yields back to a scheduler
| in the application.
| legulere wrote:
| Why not compile directly to ebpf and run your code without
| context switches?
| NegativeK wrote:
| EBPF has strict limitations, like 512-byte stack and 1M
| instructions, that would make it inapplicable to a lot of
| programs. It's intentionally not Turing complete.
|
| Also, because context switches are good?
| BigParm wrote:
| If we schedule in user space does that make it possible to share
| compute workloads between arbitrary computers on a network?
|
| Because kernel scheduling has context and different architectures
| must be considered etc.
| alexchamberlain wrote:
| Are we getting to the point where parts of the kernel could
| benefit from running on a low power, cheap, less featureful
| CPU/co-processor, so that the beefy, suped up, SIMD, floating
| point instructioned super cores can be dedicated to rendering cat
| videos and sending IMs?
___________________________________________________________________
(page generated 2024-02-20 23:01 UTC)