[HN Gopher] Writing a scheduler for Linux in Rust that runs in u...
       ___________________________________________________________________
        
       Writing a scheduler for Linux in Rust that runs in user-space
        
       Author : todsacerdoti
       Score  : 120 points
       Date   : 2024-02-20 15:26 UTC (7 hours ago)
        
 (HTM) web link (arighi.blogspot.com)
 (TXT) w3m dump (arighi.blogspot.com)
        
       | MuffinFlavored wrote:
       | > the communication between kernel and user-space is going to add
       | some overhead,
       | 
       | What does this look like specifically? Is it that the kernel
       | operation needs to copy memory that is in "kernel space" to/from
       | "user space" for security reasons? Is that all?
        
         | Analemma_ wrote:
         | Making scheduling decisions requires information that only the
         | kernel has (for starts, what processes are even running). Since
         | userspace programs can't read kernel memory, you have to copy
         | this information into their process before they can work.
        
           | cloudwalk9 wrote:
           | Would be interesting to see how minimal they could make the
           | overhead, since that's one of their goals. I wonder what the
           | lower limit might be. That would have broad implications for
           | basically everything else in userspace.
        
             | vacuity wrote:
             | Ah, the dream of fully-featured, performant M:N threading.
             | Spawn tasks left and right, optimize with a small
             | declarative config, and sit back and relax. I'm very
             | interested to see how close we can get.
        
           | PaulDavisThe1st wrote:
           | Making scheduling decisions optimally requires information
           | only held on each side of the kernel/user space divide. An
           | application knows things about its own operation that the
           | kernel cannot. The kernel knows things about the state of the
           | system that user space cannot.
           | 
           | Hence the "scheduler activations" concept from the early
           | 1990s, in which the kernel calls into user space with
           | sufficient information for a user space scheduler to make the
           | best possible decision (in theory).
        
             | vacuity wrote:
             | I believe I saw a presentation on OKL4 describing this sort
             | of abstraction as virtual CPUs capabilities. There is a
             | similar mechanism in Barrelfish and Psyche. Last I heard,
             | the seL4 team was pursuing a different approach of
             | scheduling capabilities.
             | 
             | I think we still haven't realized the optimal design, but
             | these are probably good local maxima and points for further
             | exploration.
        
               | PaulDavisThe1st wrote:
               | My own suspicion is that we have realized the optimal
               | design already, but it turns out that it just doesn't
               | matter very much.
        
               | vacuity wrote:
               | How so?
        
               | PaulDavisThe1st wrote:
               | In the 90s, the cost of choosing the wrong thread to run
               | when another thread blocked was significant _in specific
               | kinds of applications_.
               | 
               | Since then, the cost has (in absolute terms) gone down,
               | the number of applications for which this is true has not
               | really increased.
               | 
               | In addition, there has been some expansion in the number
               | of applications for which SCHED_RT and SCHED_FIFO are
               | appropriate for at least some threads (financial services
               | high speed trading, real time audio), which has also
               | reduced the pressure for a user-space scheduler that
               | "gets it right". This is also true for designs which
               | require guaranteed scheduling slots.
               | 
               | So, SA might still bring some benefits to large, highly-
               | threaded monolithic apps such as RDBMS, but it doesn't
               | really provide much to the majority of contemporary
               | applications. It certainly doesn't do much for mobile
               | environments, which tend not to run applications where
               | "the application knows best".
        
               | vacuity wrote:
               | Makes sense. Ideally the OS should provide the capability
               | to do SA (or similar), but I don't know how the overhead
               | would be for programs that opt out of using it.
        
               | topspin wrote:
               | That seems unlikely. The dimensions of the problem space
               | are huge and getting larger. Although the next Linux
               | kernel scheduler I write will be my first, it's not hard
               | to infer the increase in complexity due to the evolution
               | of CPUs that Linux runs on: "efficiency" cores (now on
               | AMD devices as well), multi-chip modules creating cache
               | locality problems, critical power and thermal management
               | requirements on high power/current devices -- all
               | changing concurrently.
        
               | vacuity wrote:
               | There certainly isn't one concrete scheduler that works
               | for all workloads. It probably doesn't matter for server
               | or supercomputer environments; as long as some scheduler
               | works, it doesn't have to change. For other use cases,
               | the scheduler should be flexible with throughput and
               | fairness. Priority, resource usage, realtime, but more
               | cohesive. Tuning the scheduler might be harder in such a
               | design, but it should allow for users to not need to
               | switch to a different base entirely.
               | 
               | It'd be more of a scheduler framework that can be used to
               | support various workloads dynamically.
        
               | PaulDavisThe1st wrote:
               | Implicit in all this is the idea that if you do not get
               | the scheduler behaving optimally that it will make a
               | difference sufficient that we can call it a "failure".
               | 
               | That's what I'm questioning. Precisely because the
               | dimensions of the problem space have increased, and the
               | raw performance of everything except register
               | save/restore and TLB invalidation has increased, there's
               | wide latitude for scheduling algorithms that are "not
               | quite right".
        
         | vlovich123 wrote:
         | If I'm reading the page correctly, at the end is this:
         | 
         | > while the communication of tasks happen using the bpf()
         | syscall, accessing the queued and dispatched maps.
         | 
         | > NOTE: we could make this part more efficient by using eBPF
         | ring buffers, this would allow direct access to the maps
         | without using a syscall (there's an ongoing work on this -
         | patches are welcome if you want to contribute).
         | 
         | The kernel piece needs to produce information that the user
         | space consumes and it doesn't use a 0-copy ring buffer yet.
        
         | cgh wrote:
         | He's referring to the syscalls necessary to share data back and
         | forth with eBPF. These are wrapped in userspace libraries, eg
         | libbpf.c. Data is shared with the kernel in buffers which in
         | eBPF parlance are referred to as "maps", which are basically
         | key/value pairs accessed via a file descriptor. There's a lot
         | of docs on eBPF if you are curious:
         | https://www.kernel.org/doc/html/latest/bpf/index.html
        
       | jandrese wrote:
       | Of all the things a you can run in userspace, the scheduler is
       | the one that I never thought would happen. I guess the scheduler
       | has to give itself time to run in userspace? It seems like there
       | would be a chicken and egg problem at boot time before userspace
       | is brought up. Would be pretty embarrassing if your scheduler got
       | killed by a OOM or something.
        
         | anthk wrote:
         | Well, GNU Hurd would survive to that right?
        
         | kccqzy wrote:
         | You could make the kernel treat the scheduler specially (just
         | like today it already treats PID 1 specially) by always making
         | the scheduler run whenever the kernel doesn't know what to run
         | next.
        
         | wyldfire wrote:
         | > Would be pretty embarrassing if your scheduler got killed by
         | a OOM or something.
         | 
         | Or if you starved your scheduler process because it wasn't high
         | enough priority :/
        
         | thewakalix wrote:
         | > in case of bugs the user-space scheduler will just crash and
         | sched-ext will transparently restore the default Linux
         | scheduler.
        
         | lstodd wrote:
         | Happens all the time with the userspace async stuff,
         | swapcontext(3), freebsd n:m threads, greenthreads, stackless
         | py, it's all ancient, and they're all userspace schedulers.
         | 
         | I think SA takes it a bit too far though. Even if it looks like
         | a 'logical progression of them older ideas.'
        
       | PaulDavisThe1st wrote:
       | We've been here before, (way) back in the 90s:
       | 
       | "Scheduler activations: effective kernel support for the user-
       | level management of parallelism"
       | https://dl.acm.org/doi/10.1145/121132.121151
       | 
       | and also:
       | 
       | "Adding Scheduler Activations to Mach 3.0"
       | https://dada.cs.washington.edu/research/tr/1992/08/UW-CSE-92...
       | 
       | People put a lot of time into thinking and researching this sort
       | of thing. It even made it into Solaris. For one reason or another
       | (possibly a lack of sufficient hardware parallelism) it never
       | really gained much traction.
       | 
       | Another critical issue to solve is that to do this "properly" (at
       | least based on our model of this from the 90s), you have to have
       | a re-entrant context switch routine, which is an interesting
       | challenge.
        
         | convolvatron wrote:
         | my understanding was that schedule activations are really an
         | upcall mechanism thats allows a userspace scheduler to manage
         | its own threads (green) more effectively and provide more
         | robust support for asynchrony across the user/kernel boundary.
         | 
         | isn't this is more of a extreme micro/exo thing where the
         | monolithic scheduler is herniated out into userspace?
        
           | PaulDavisThe1st wrote:
           | Not "green" threads. Full kernel threads (otherwise it
           | doesn't make much sense).
           | 
           | And yes, this is a more extreme thing, but what's the point
           | unless there's some capability in user space that can't exist
           | in the kernel? And what would that thing be? Application-
           | specific knowledge allowing better scheduling decisions to be
           | made.
        
             | vacuity wrote:
             | Can you elaborate on what it means for them to be kernel
             | threads as opposed to green threads (no userspace control
             | over when it blocks and yields?) and reentrant context
             | switch routines? I read the original paper but I fear I
             | don't have enough background to fully appreciate it.
        
               | PaulDavisThe1st wrote:
               | Scheduler activations is a design that involves calling
               | back up into the user space scheduler whenever the kernel
               | decides to:                  1. block a thread AND
               | 2. allow the task that owns the thread to retain use of
               | the CPU
               | 
               | so when a thread (in user space) calls write, it ends up
               | in the kernel, and the kernel in a traditional design
               | would put the thread to sleep waiting on the write,
               | invoke the scheduler to decide what should run next.
               | 
               | With SA, the kernel first assesses whether the task can
               | continue to use the CPU, and if so, calls the task's
               | user-space scheduler so that it can decide which thread
               | to run (since the user-space scheduler has better
               | information about this).
               | 
               | Green threads are unknown to the kernel, they play no
               | role in such a design.
               | 
               | Reentrant context switching is required because you can
               | be in the middle of a context switch invoked from the
               | user space scheduler when a h/w interrupt (incl. timers)
               | causes the kernel to rethink what is going and rip the
               | CPU away from the task.
        
               | vacuity wrote:
               | Reentrancy is a nuisance. seL4 allows for the kernel to
               | be preempted only at specific points of long running
               | operations, which massively reduces concurrent spaghetti
               | but also sacrifices some latency. Would the userspace
               | program have some way of knowing about global resource
               | contention to avoid choosing a thread suboptimally? As I
               | understand it, SA by itself only allows userspace
               | programs to manage their own resources efficiently.
        
               | PaulDavisThe1st wrote:
               | No, the kernel gets to do it's job: decide which _task_ (
               | "process" in classical Unix speak) should get the core
               | based on kernel-side scheduling policies, then the task
               | itself gets to decide which thread should run on the
               | core.
               | 
               | The amount of information required to correctly decide
               | which task should get the core is too large to sensibly
               | move across the kernel boundary, and providing read-
               | access to it in a safe way from user space is
               | problematic.
        
               | vacuity wrote:
               | Perhaps not just any program, but ideally a user could
               | tell the OS that a certain program requires minimal
               | latency across some cores for some duration, or that
               | hardware TCP offload should be prioritized for a server
               | process. Like madvise or fadvise to give the kernel
               | suggestions for optimization, but broader.
        
               | PaulDavisThe1st wrote:
               | For some *nixes that already exists. But it is
               | insufficent to provide the OS with information like "if
               | thread N blocks for I/O, run thread T instead",
               | particularly because in the next quantum, it might be
               | "run thread Y instead".
        
         | xoranth wrote:
         | I believe something like the mechanism you are describing has
         | been in production at Google for at least a decade. See this
         | talk [1].
         | 
         | The kernel interface that the article uses (called sched_ext)
         | is the result of the attempts to mainstream the Google thing.
         | 
         | [1]: https://www.youtube.com/watch?v=KXuZi9aeGTw
        
           | PaulDavisThe1st wrote:
           | Yeah, I'm loosely familiar with sched_ext already. However,
           | it is actually the opposite of scheduler activations.
           | 
           | SA means upcalling to user-space when thread scheduling is
           | required.
           | 
           | sched_ext means loading a scheduler (coded in BPF) into the
           | kernel.
           | 
           | Not dissimilar goals/purposes but wildly different systems.
        
             | xoranth wrote:
             | Thanks for the clarification!
        
       | sppfly wrote:
       | I'm wondering if this scheduler is for something like user-space
       | threads. And What is the relationship between such scheduler and
       | go runtime for goroutine and JVM for Java virtual thread?
        
         | matthews2 wrote:
         | In this example, the kernel is doing the task switching. They
         | are "real" threads. The userspace component is informing the
         | kernel which tasks should be run.
         | 
         | goroutines and Java virtual threads are a separate idea. The
         | application saves its state and then yields back to a scheduler
         | in the application.
        
       | legulere wrote:
       | Why not compile directly to ebpf and run your code without
       | context switches?
        
         | NegativeK wrote:
         | EBPF has strict limitations, like 512-byte stack and 1M
         | instructions, that would make it inapplicable to a lot of
         | programs. It's intentionally not Turing complete.
         | 
         | Also, because context switches are good?
        
       | BigParm wrote:
       | If we schedule in user space does that make it possible to share
       | compute workloads between arbitrary computers on a network?
       | 
       | Because kernel scheduling has context and different architectures
       | must be considered etc.
        
       | alexchamberlain wrote:
       | Are we getting to the point where parts of the kernel could
       | benefit from running on a low power, cheap, less featureful
       | CPU/co-processor, so that the beefy, suped up, SIMD, floating
       | point instructioned super cores can be dedicated to rendering cat
       | videos and sending IMs?
        
       ___________________________________________________________________
       (page generated 2024-02-20 23:01 UTC)