[HN Gopher] The Linux Scheduler: A Decade of Wasted Cores (2016)...
___________________________________________________________________
The Linux Scheduler: A Decade of Wasted Cores (2016) [pdf]
Author : PaulHoule
Score : 113 points
Date : 2023-12-13 14:31 UTC (8 hours ago)
(HTM) web link (people.ece.ubc.ca)
(TXT) w3m dump (people.ece.ubc.ca)
| westurner wrote:
| > Abstract: _As a central part of resource management, the OS
| thread scheduler must maintain the following, simple, invariant:
| make sure that ready threads are scheduled on available cores. As
| simple as it may seem, we found that this invari- ant is often
| broken in Linux. Cores may stay idle for sec- onds while ready
| threads are waiting in runqueues. In our experiments, these
| performance bugs caused many-fold per- formance degradation for
| synchronization-heavy scientific applications, 13% higher latency
| for kernel make, and a 14- 23% decrease in TPC-H throughput for a
| widely used com- mercial database. The main contribution of this
| work is the discovery and analysis of these bugs and providing
| the fixes. Conventional testing techniques and debugging tools
| are in- effective at confirming or understanding this kind of
| bugs, because their symptoms are often evasive. To drive our in-
| vestigation, we built new tools that check for violation of the_
| invariant _online and visualize scheduling activity. They are
| simple, easily portable across kernel versions, and run with a
| negligible overhead. We believe that making these tools part of
| the kernel developers' tool belt can help keep this type of bug
| at bay._
| megaloblasto wrote:
| This paper is from 2016. Anyone know the status of this now?
| pokler wrote:
| The Linux kernel received a new scheduler not too long ago
| (https://lwn.net/Articles/925371/), so I'm not sure how
| relevant the critiques about CFS are.
| whoisthemachine wrote:
| Looks great! According to one article, those changes just hit
| shore last month in kernel 6.6, so it's likely not in most os
| distributions yet.
|
| https://tuxcare.com/blog/linux-kernel-6-6-is-here-find-
| out-w....
| nickelpro wrote:
| It was discussed at the time, and the consensus was the patches
| were unusuable, the testing methodology dubious, and the
| machine configuration rare.
|
| https://lkml.org/lkml/2016/4/23/135
| Agingcoder wrote:
| This paper caused me to start doubting the Linux scheduler on
| medium sized boxes ( 36 physical cores in 2016 ) and helped
| me find an annoying bug in the rhel 7 kernel which prevented
| some processes from being scheduled at all in some cases( it
| was an issue with Numa load balancing if I remember well).
| Note my bug was most likely unrelated to the paper, but it
| marked in my mind a specific class of bugs as possible.
|
| So it helped me at least :-)
| Aerbil313 wrote:
| Didn't read the paper, but Linux kernel and userspace gotten a
| lot of tech debt while it became the most common OS.
|
| I'd love to see something new built with modern practices and
| ideas come and dethrone it. Maybe Theseus OS (not affiliated).
| entropicdrifter wrote:
| I've been keeping an eye on Redox for several years now, as
| well. It'll be interesting to see what floats and what sinks in
| the decades to come.
| adql wrote:
| I hesitate to call "the shit that keeps stuff older than a year
| working on that kernel" a tech debt. It's necessity in any
| general use OS.
|
| The stuff that can be changed without breaking compatibility
| is, well, it was the developer's best idea at the time and some
| of them turned out to be good, some turned out to be bad.
|
| Going "but on paper that idea would be better, let's make OS
| around it" rarely ends up being plain better. It's not one
| dimensional.
|
| For example if your CPU scheduler makes all jobs run faster
| (less cache thrashing while keeping CPUs busy etc.) but is
| unfair, that might be great for people running batch jobs, but
| shit for desktop users.
|
| Scheduler wasting cycles but making sure interactive tasks get
| the right level of priority might be improvement for desktop
| users but not for server use etc.
|
| Or you figure out that the reason something was "unnecessarily
| complex", was actual necessary complexity that wasn't obvious
| at first.
|
| Also Linux isn't exactly stranger to taking the whole bunch of
| code and replacing it with something better, I think we're at
| 3th firewall stack now (ipchains -> iptables -> nftables)
| gerdesj wrote:
| There was another one before ipchains - ipfw.
| throwaway914 wrote:
| Something I would love to find is a practical/succinct guide
| on Linux kernel performance tuning. I'd love it to give
| examples of sysctl's you'd adjust for specific use cases
| like:
|
| - a real-time kernel
|
| - a single user mode kernel
|
| - an embedded device kernel (like an esp 32 or rpi)
|
| - general purpose desktop use (surely there are gems in here)
|
| - to use the linux kernel as a hypervisor hosting others
|
| - tuning the linux kernel for within a vm (guest to another
| hypervisor)
|
| - tuning for gaming performance
|
| I myself do not know enough about sysctls, and I'm sure it's
| a goldmine.
| dang wrote:
| Related. Others?
|
| _The Linux scheduler: A decade of wasted cores (2016)_ -
| https://news.ycombinator.com/item?id=33462345 - Nov 2022 (26
| comments)
|
| _The Linux Scheduler: A Decade of Wasted Cores (2016)_ -
| https://news.ycombinator.com/item?id=15531332 - Oct 2017 (35
| comments)
|
| _The Linux Scheduler: A Decade of Wasted Cores_ -
| https://news.ycombinator.com/item?id=11570606 - April 2016 (38
| comments)
|
| _The Linux Scheduler: A Decade of Wasted Cores [pdf]_ -
| https://news.ycombinator.com/item?id=11501493 - April 2016 (142
| comments)
| mpixel wrote:
| Ultimately -- the thing is, if anyone is both capable and
| willing, they can, and sometimes do, fix it.
|
| Granted, this combination is rather rare. Most people aren't
| capable. Of those who are, they have better things to do and they
| probably have very well paying jobs they could be focusing on
| instead.
|
| With that being said, Linux is _still_ more efficient than
| Windows.
|
| I don't want to say Linux is free, in practice it's not, those
| who are running the big powerful machines are using RHEL and
| paying hefty licenses.
|
| Which are still better than any other alternative.
| adql wrote:
| > I don't want to say Linux is free, in practice it's not,
| those who are running the big powerful machines are using RHEL
| and paying hefty licenses.
|
| Google/Amazon/MS isn't paying RHEL licenses.
|
| Reason for using RHEL is basically "we don't want to hire
| experts", which makes a lot of sense in small/midsized company
| but in big one it's probably mostly the ability to blame
| someone else if something is fucked up, and the fact some of
| the software basically says "run it on this distro or it is
| unsupported".
| deafpolygon wrote:
| That's basically it. As a sysadmin, I supported over hundreds
| of Linux servers (mostly virtual) pretty much by myself. We
| paid the license so that if the shit hit the fan (and we
| really had no way of fixing it) we can call on Red Hat. At
| least, I could be able to tell my bosses that things were
| being handled.
|
| This never happened, of course. But it's CYA.
| ElijahLynn wrote:
| I've used RHEL support too, believe it or not, while
| working as a DevOps engineer at Red Hat on www.redhat.com.
| And the support specialist assigned to my issue was very
| sharp and solved the issue quickly. Easy in hindsight but I
| was stuck so I reached out, and I was glad I did. It
| involved a segfault and grabbing a core dump and analyzing
| it. ++Red Hat Support
| saagarjha wrote:
| There are well-paying jobs that allow smart people to focus on
| exactly this.
| commandlinefan wrote:
| > capable and willing
|
| This is the sort of research that scientific grants are
| _supposed_ to be targeting for the public good. Supposed to be.
| AaronFriel wrote:
| The Linux 6.6 kernel ships with a new default scheduler[1]. Is
| the testing methodology in this paper relevant and used to assess
| the current (EEVDF) scheduler, or is it irrelevant?
|
| [1] - https://lwn.net/Articles/925371/
| tremon wrote:
| Same goes for I/O scheduling. The virtual memory subsystem seems
| to only have two modes: buffering and write-out. The system will
| buffer writes until some threshold is reached, then it switches
| to write-out mode and flushes all dirty pages regardless of their
| age. The problem is that new pages are locked during write-out,
| so pending writes will stall, regardless of the memory pressure
| on the rest of the system.
|
| A better design would be to only lock a subset of dirty pages and
| let new writes to virtual memory continue while the write-out is
| happening, but it doesn't seem like the system can accomodate
| that.
|
| A simple test to see this in action is to use netcat to transfer
| a large file across the network, and monitor the device activity
| with sar or atop (or just check the disk/network activity
| lights). What you'll see is that while the disk is writing,
| network activity drops to zero, and when network activity
| resumes, the disk remains idle again for seconds. It doesn't
| matter how much smaller you make
| vm.dirty_background_{bytes,ratio} with respect to
| vm.dirty_{bytes,ratio}, the network traffic will block as soon as
| the "background" disk write starts. The only effect a low value
| for vm.dirty_background_{bytes,ratio} has is to increase the
| frequency at which the network and disk activity will alternate.
| freedomben wrote:
| Why do you think it's still this way? Why hasn't somebody
| implemented something better? Is it too core that we can't
| afford any instability?
|
| Edit: apparently it has been done, but only very recently and
| hasn't had a major distro ship it yet
| ahartmetz wrote:
| Oh? That sounds important and interesting. Do you have a URL
| or commit hash with more information?
| freedomben wrote:
| I'm very confused now so I have no idea what to think. I
| need to take some time to read through everything, but in
| the mean time here are some relevant links I've seen:
|
| https://lwn.net/Articles/925371/
|
| https://lkml.org/lkml/2016/4/23/135
| ahartmetz wrote:
| Hm yes, maybe you are confused. The articles you posted
| seem to be about CPU, not I/O.
| alyandon wrote:
| Ran into similar pathological behavior on servers with huge
| amounts of ram (>1 TiB). Negative dentries builds up until the
| kernel decides it is time to reclaim and locks up the server
| for >60 seconds scanning pages. :-/
| tremon wrote:
| Yes, indeed. The more ram available, the more pages are
| available as buffer cache, the more pronounced the system
| stalls are. Though in my example, the determining factor is
| available_memory/disk_write_troughput, while in your example
| it is available_memory/single_core_throughput (assuming the
| page scan is a single-thread process).
| anonacct37 wrote:
| I also ran into spooky negative dentry issues. On a fleet of
| seemingly identical physical machines one machine would
| constantly check for files that didn't exist (iirc it was
| somehow part of nss and a quick Google search seems to
| confirm https://lwn.net/Articles/894098/ ).
|
| I'm okayish at systems level debugging but I never figured
| that one out. It caused the kernel to use alot of memory.
| Arguable whether or not it impacted performance since it was
| a cache.
| alexey-salmin wrote:
| A complete nightmare. In the past I had to work on an IO-heavy
| system with 1-5ms read-latency guarantees and the only working
| solution was to completely manage all writes in the userspace.
| Manual buffering, manual throttling, manual fallocate, then
| either DIRECT_IO or mmap+sync_file_range depending on whether
| you need the written data to be readily available for reads or
| not.
|
| There was a kernel patch though that solved around 95% of
| issues with no userspace changes: vm.dirty_write_behind.
| Unfortunately it never made it into the mainline kernel. [1]
| For strong latency guarantees it ws insufficient but it greatly
| improved the typical network/IO spike alternations described
| above.
|
| I'm surprised it was never fixed upstream despite that fact
| that even most basic and simple scenarios like "nginx writes a
| log file to the disk" sometimes explode with seconds-long
| lockups on memory-fat machines.
|
| [1] https://lore.kernel.org/linux-
| fsdevel/156896493723.4334.1334...
| bee_rider wrote:
| Is their LU benchmark just LU factorize and possibly solve? That
| seems like an odd candidate for this kind of problem, because of
| course anybody running that kind of problem that cares about
| performance just doesn't over-subscribe their system... the
| scheduler has a much easier job when there's one big chunky
| thread per core.
|
| I mean, I get that they are testing a _general_ fix for the sort
| of problems that, like, non-scientific-computing users get. So it
| isn't to say the work is useless. It just seems like a funny
| example.
|
| I think coming up with benchmarks for schedulers must actually be
| very difficult, because you have figure out exactly how silly of
| a thing you want the imagined user to do.
| semi-extrinsic wrote:
| Further to this, if you are running a program where compute is
| mainly a sparse linear solve, you are memory bandwidth limited
| and on modern systems like AMD Epyc you will very very likely
| see improved performance if you undersubscribe cores by 2x -
| e.g. using max 64 threads per 128 cores.
| bee_rider wrote:
| That's interesting. I think it must be the case but just to
| triple check my intuition, I guess you want the threads
| spread out, so you can get every chiplet in on the action?
| csdvrx wrote:
| No, I think it's more about how cores may share
| infrastructure (like M cache etc)
| swatson741 wrote:
| And yet Linux has consistently out preformed macOS in terms of
| throughput.
| jerf wrote:
| Yes. It turns out one thing being bad does not make a
| completely different thing good.
|
| (You'd think this was obvious, but I've been on the Internet a
| while and it sure isn't.)
| haberman wrote:
| > As a central part of resource management, the OS thread
| scheduler must maintain the following, simple, invariant: make
| sure that ready threads are scheduled on available cores. As
| simple as it may seem, we found that this invariant is often
| broken in Linux. Cores may stay idle for seconds while ready
| threads are waiting in runqueues.
|
| This oddly lines up with an observation I've made about traffic
| lights. It is very surprising how often you'll be stopped at an
| intersection where everybody is waiting and nobody is allowed to
| go, sometimes for what feels like 10-30 seconds.
|
| It seems like the lowest-hanging fruit for improving traffic is
| to aim never to have degenerate intersections. If someone is
| waiting and nobody is going, the light should change quickly.
| zukzuk wrote:
| Isn't that usually to give pedestrians a head start on
| crossing?
| wmf wrote:
| That sounds like a protected left turn but no one is turning so
| it looks stalled.
| nextaccountic wrote:
| In some places the lights work like this, but you need to
| install relatively expensive sensors
|
| The sensors help gather high quality data for traffic models
| though. That should help traffic engineers to do their job
___________________________________________________________________
(page generated 2023-12-13 23:00 UTC)