[HN Gopher] Linux kernel cgroups writeback high CPU troubleshooting
       ___________________________________________________________________
        
       Linux kernel cgroups writeback high CPU troubleshooting
        
       Author : mesto1
       Score  : 152 points
       Date   : 2025-02-14 08:30 UTC (14 hours ago)
        
 (HTM) web link (dasl.cc)
 (TXT) w3m dump (dasl.cc)
        
       | manuc66 wrote:
       | "This investigation was a collaboration between myself and my
       | colleagues."
        
         | speerer wrote:
         | I think this is a good thing to be writing on a blog site which
         | is named after the author himself. The author doesn't want to
         | give the impression that he's taking all the credit, but he
         | also doesn't want to mix up his own blog with a corporate
         | voice. The team he's talking about is easily identified
         | elsewhere, at the bug tracker comment he links to:
         | https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug...
        
       | steelbrain wrote:
       | The skill required to cut through so many layers, going from php
       | application down to linux kernel. Impressed and jealous!
        
         | slicktux wrote:
         | Think about the dopamine release and persistence of it too!
        
       | boomskats wrote:
       | This is an interesting problem. The OP should have a look into
       | how the vm.dirty_ratio, vm.dirty_background_ratio,
       | vm.dirty_bytes, and vm.dirty_background_bytes (and other
       | similarly prefixed) sysctl parameters control when the kernel
       | starts flushing dirty pages to disk. Last time I checked,
       | different distros defaulted things like dirty_ratio to somewhere
       | between 10 and 50, mostly for legacy reasons.
       | 
       | This is really not great in situations where you're bootstrapping
       | a fresh server. Here's what happens:
       | 
       | - you boot up a server with, say, 1tb RAM
       | 
       | - your default dirty ratio is 10 (best case)
       | 
       | - you quickly write 90gb of files to your server (images,
       | whatever)
       | 
       | - you get mad unblocked throughput as the page cache fills up &
       | the kernel hasn't even tried flushing anything to disk yet
       | 
       | - your application starts, takes up 9gb memory
       | 
       | - starts to serve requests, writes another 1gb of mem mapped
       | cache
       | 
       | - the kernel starts to flush, realises disk is slower than it
       | thought, starts to trip over itself and aggressively throttle IO
       | until it can catch up
       | 
       | - your app is now IO bound while the kernel thrashes around for a
       | bit
       | 
       | This can be tuned by adjusting the vm.dirty_* defaults, and is
       | well worth doing IMO. The defaults that kernels still ship with
       | are from a long time ago when we didn't have this much memory
       | available.
       | 
       | My memory of this next bit is flaky at best, so happy to be
       | corrected here, but I remember this also being a big problem with
       | k8s. With cgroups v1, a node would get added to your cluster and
       | a pod would get scheduled there. The pod would be limited to,
       | say, 4gb memory - way more memory than it actually uses - but it
       | would have a lot of IO operations. Because the node still had a
       | ton of free memory, way below its default dirty writeback
       | ratio/bytes, none of the IO operations would get flushed to disk
       | for ages, but the dirty pages in the page cache would still be
       | counted towards that pod's memory usage even though they weren't
       | 'real' memory, but something completely out of the control of the
       | pod (or kubernetes, really). Before you knew it, bOOM. Pod
       | oomkilled for seemingly no reason, and no way to do anything
       | about it. I remember some issues where people skirted around it
       | by looking off into middle distance and saying the usual things
       | about k8s not being for stateful workloads, but it was really
       | lame and really not talked about enough.
       | 
       | This might seem unrelated, but you guessed it, it was fixed in
       | cgroups v2, and I imagine that the fix for that problem either
       | directly or indirectly explains why OP saw a difference in
       | behaviour between cgroups v1 and v2.
       | 
       | Also, slightly related, I remember discovering a while back that
       | for workloads like this where you've got a high turnover of files
       | & processes, having the `discard` (trim) flag set on your ssd
       | mount could really mess you up (definitely in ext4, not sure
       | about xfs). It would prevent the page cache from evicting pages
       | of deleted files without forcing writeback first, which is
       | obviously the opposite of what it was designed to do
       | (protect/trim the ssds). Not to mention cause all sorts of
       | zombifications when terminated processes still had memmapped
       | files that hadn't been flushed to disk, etc.
       | 
       | AFAIK it's still a problem, though it's been years since I
       | profiled this stuff. At peak load with io-intensive workloads,
       | you could end up with SSDs making your app run slower. Try
       | remounting without the `discard` flag (and periodically fstrim
       | manually), or use `discard=async`, and see what difference it
       | makes.
        
         | baq wrote:
         | I'm looking at why my kvm vms are getting oom killed right now
         | and dear sir this is gold. Sounds like exactly what's happening
         | because they get killed during a nightly db maintenance job.
        
         | tankenmate wrote:
         | Another way to fix this is for your one off write of code,
         | assets, etc at boot time should use O_DIRECT and side step the
         | page and/or buffer cache altogether. The performance will be
         | slower, but you won't have a massive page cache overhang for a
         | one time operation.
        
           | Snild wrote:
           | Or run `sync` after the copy "finishes" -- less focused, but
           | very easy to do.
        
             | mkesper wrote:
             | The article states this doesn't change anything.
        
             | throwway120385 wrote:
             | sync is not guaranteed to do anything. And sometimes it
             | does way more than it should. Direct I/O semantics are the
             | correct thing here because it bypasses the cache entirely.
             | 
             | I've had a lot of issues writing partition and disk images
             | using DD on modern Linux systems because of caching. And
             | these all kept happening even though I would use `sync`
             | like you describe. But setting oflag=direct resolved all of
             | the issues I was having.
        
         | dasl wrote:
         | Hi, I'm the author of the article! Thank you for the awesome
         | description of the various vm.dirty_* sysctls.
         | 
         | The problem described in my post was not _directly_ related to
         | the kernel flushing dirty pages to disk. As such, I'm not sure
         | that tweaking these sysctls would have made any difference.
         | 
         | Instead, we were seeing the kernel using too much CPU when it
         | moved inodes from one cgroup to another. This is part of the
         | kernel's writeback cgroup accounting logic. I believe this is a
         | related but slightly different form of writeback problems :)
        
           | jeffbee wrote:
           | > moved inodes from one cgroup to another
           | 
           | `cgroup.memory=nokmem` avoids this.
        
             | dasl wrote:
             | TIL, thanks for sharing. We ended up solving our problem
             | another way by adding this `DisableControllers` stanza to
             | the service's systemd configuration: https://gist.github.co
             | m/dasl-/87b849625846aed17f1e4841b04ecc...
             | 
             | I believe the kernel's cgroup writeback accounting features
             | are enabled / disabled based on this code: https://github.c
             | om/torvalds/linux/blob/c291c9cfd76a8fb92ef3d...
        
           | boomskats wrote:
           | Hey, I agree that tweaking these probably wouldn't have made
           | much difference, but tuning/reducing the dirty_bytes could
           | calm the writeback stampede and smooth that bump, potentially
           | getting rid of whatever race might have been happening.
           | Regardless, disabling the cgroup accounting there is the
           | right thing to do, especially as you don't need it. Tbh, the
           | main reason I wrote most of that was as background to explain
           | the cgv1 vs v2 differences and why they're there (and because
           | I was stuck in traffic for like 45 mins :/)
           | 
           | If you're ever in the mood to revisit that problem you should
           | try disabling that discard flag and see if it makes a
           | difference. Also, if it was me, I'd have tried setting
           | LimitNOFILE to whatever it is in my shell and seeing if the
           | rsync still behaved differently.
           | 
           | Anyway - thoroughly enjoyed your article. You should write
           | more :)
        
         | jeffbee wrote:
         | > it was fixed in cgroups v2
         | 
         | I would say it was changed in cgroups v2.
         | 
         | Cgroups v1 was written by a company where only one process on a
         | machine is allowed to do block I/O, and that program is
         | carefully written to not use kernel caches.
         | 
         | Cgroups v2 was written by a company that uses lots of off-the-
         | shelf Linux applications that do ordinary block I/O in the
         | usual naive way. That's why v2 focuses so much on "pressure".
        
           | sirjee wrote:
           | BTW company-1 == Google and company-2 == FB/Meta.
           | 
           | In addition, Google has completely removed local storage from
           | their servers, so there is no disk I/O at all.
        
             | betaby wrote:
             | > In addition, Google has completely removed local storage
             | from their servers, so there is no disk I/O at all.
             | 
             | What does that mean? There should be disk somewhere anyway
             | to store gmail messages.
        
               | jeffbee wrote:
               | https://static.googleusercontent.com/media/sre.google/en/
               | /st...
               | 
               | It sounds a bit vanilla on paper, since things like NFS
               | and iSCSI have existed forever.
        
               | sirjee wrote:
               | Usually Google applications uses high level network
               | storage services like collosus, BigTable or spanner and
               | these high level services are backed by dedicated storage
               | appliances where they bypass the kernel for SSDs and use
               | direct block IO for slow disks. For network, they are
               | moving towards userspace network [1].
               | 
               | [1] https://research.google/pubs/snap-a-microkernel-
               | approach-to-...
        
         | loeg wrote:
         | Yeah, even on consumer hardware, dirty ratio of 10% is waaay
         | too much. These settings can also be tuned in bytes
         | (vm.dirty_bytes and vm.dirty_background_bytes), and I tune
         | these to 128-256MB on my desktop.
        
       | eqvinox wrote:
       | It's not the infamous 2.6.32, but ... kernel 3.10 up until very
       | recently? Oof.
       | 
       | (3.10 released 2013-06-30)
        
       ___________________________________________________________________
       (page generated 2025-02-14 23:00 UTC)