[HN Gopher] Linux kernel cgroups writeback high CPU troubleshooting
___________________________________________________________________
Linux kernel cgroups writeback high CPU troubleshooting
Author : mesto1
Score : 152 points
Date : 2025-02-14 08:30 UTC (14 hours ago)
(HTM) web link (dasl.cc)
(TXT) w3m dump (dasl.cc)
| manuc66 wrote:
| "This investigation was a collaboration between myself and my
| colleagues."
| speerer wrote:
| I think this is a good thing to be writing on a blog site which
| is named after the author himself. The author doesn't want to
| give the impression that he's taking all the credit, but he
| also doesn't want to mix up his own blog with a corporate
| voice. The team he's talking about is easily identified
| elsewhere, at the bug tracker comment he links to:
| https://bugs.launchpad.net/ubuntu/+source/linux-oem-6.5/+bug...
| steelbrain wrote:
| The skill required to cut through so many layers, going from php
| application down to linux kernel. Impressed and jealous!
| slicktux wrote:
| Think about the dopamine release and persistence of it too!
| boomskats wrote:
| This is an interesting problem. The OP should have a look into
| how the vm.dirty_ratio, vm.dirty_background_ratio,
| vm.dirty_bytes, and vm.dirty_background_bytes (and other
| similarly prefixed) sysctl parameters control when the kernel
| starts flushing dirty pages to disk. Last time I checked,
| different distros defaulted things like dirty_ratio to somewhere
| between 10 and 50, mostly for legacy reasons.
|
| This is really not great in situations where you're bootstrapping
| a fresh server. Here's what happens:
|
| - you boot up a server with, say, 1tb RAM
|
| - your default dirty ratio is 10 (best case)
|
| - you quickly write 90gb of files to your server (images,
| whatever)
|
| - you get mad unblocked throughput as the page cache fills up &
| the kernel hasn't even tried flushing anything to disk yet
|
| - your application starts, takes up 9gb memory
|
| - starts to serve requests, writes another 1gb of mem mapped
| cache
|
| - the kernel starts to flush, realises disk is slower than it
| thought, starts to trip over itself and aggressively throttle IO
| until it can catch up
|
| - your app is now IO bound while the kernel thrashes around for a
| bit
|
| This can be tuned by adjusting the vm.dirty_* defaults, and is
| well worth doing IMO. The defaults that kernels still ship with
| are from a long time ago when we didn't have this much memory
| available.
|
| My memory of this next bit is flaky at best, so happy to be
| corrected here, but I remember this also being a big problem with
| k8s. With cgroups v1, a node would get added to your cluster and
| a pod would get scheduled there. The pod would be limited to,
| say, 4gb memory - way more memory than it actually uses - but it
| would have a lot of IO operations. Because the node still had a
| ton of free memory, way below its default dirty writeback
| ratio/bytes, none of the IO operations would get flushed to disk
| for ages, but the dirty pages in the page cache would still be
| counted towards that pod's memory usage even though they weren't
| 'real' memory, but something completely out of the control of the
| pod (or kubernetes, really). Before you knew it, bOOM. Pod
| oomkilled for seemingly no reason, and no way to do anything
| about it. I remember some issues where people skirted around it
| by looking off into middle distance and saying the usual things
| about k8s not being for stateful workloads, but it was really
| lame and really not talked about enough.
|
| This might seem unrelated, but you guessed it, it was fixed in
| cgroups v2, and I imagine that the fix for that problem either
| directly or indirectly explains why OP saw a difference in
| behaviour between cgroups v1 and v2.
|
| Also, slightly related, I remember discovering a while back that
| for workloads like this where you've got a high turnover of files
| & processes, having the `discard` (trim) flag set on your ssd
| mount could really mess you up (definitely in ext4, not sure
| about xfs). It would prevent the page cache from evicting pages
| of deleted files without forcing writeback first, which is
| obviously the opposite of what it was designed to do
| (protect/trim the ssds). Not to mention cause all sorts of
| zombifications when terminated processes still had memmapped
| files that hadn't been flushed to disk, etc.
|
| AFAIK it's still a problem, though it's been years since I
| profiled this stuff. At peak load with io-intensive workloads,
| you could end up with SSDs making your app run slower. Try
| remounting without the `discard` flag (and periodically fstrim
| manually), or use `discard=async`, and see what difference it
| makes.
| baq wrote:
| I'm looking at why my kvm vms are getting oom killed right now
| and dear sir this is gold. Sounds like exactly what's happening
| because they get killed during a nightly db maintenance job.
| tankenmate wrote:
| Another way to fix this is for your one off write of code,
| assets, etc at boot time should use O_DIRECT and side step the
| page and/or buffer cache altogether. The performance will be
| slower, but you won't have a massive page cache overhang for a
| one time operation.
| Snild wrote:
| Or run `sync` after the copy "finishes" -- less focused, but
| very easy to do.
| mkesper wrote:
| The article states this doesn't change anything.
| throwway120385 wrote:
| sync is not guaranteed to do anything. And sometimes it
| does way more than it should. Direct I/O semantics are the
| correct thing here because it bypasses the cache entirely.
|
| I've had a lot of issues writing partition and disk images
| using DD on modern Linux systems because of caching. And
| these all kept happening even though I would use `sync`
| like you describe. But setting oflag=direct resolved all of
| the issues I was having.
| dasl wrote:
| Hi, I'm the author of the article! Thank you for the awesome
| description of the various vm.dirty_* sysctls.
|
| The problem described in my post was not _directly_ related to
| the kernel flushing dirty pages to disk. As such, I'm not sure
| that tweaking these sysctls would have made any difference.
|
| Instead, we were seeing the kernel using too much CPU when it
| moved inodes from one cgroup to another. This is part of the
| kernel's writeback cgroup accounting logic. I believe this is a
| related but slightly different form of writeback problems :)
| jeffbee wrote:
| > moved inodes from one cgroup to another
|
| `cgroup.memory=nokmem` avoids this.
| dasl wrote:
| TIL, thanks for sharing. We ended up solving our problem
| another way by adding this `DisableControllers` stanza to
| the service's systemd configuration: https://gist.github.co
| m/dasl-/87b849625846aed17f1e4841b04ecc...
|
| I believe the kernel's cgroup writeback accounting features
| are enabled / disabled based on this code: https://github.c
| om/torvalds/linux/blob/c291c9cfd76a8fb92ef3d...
| boomskats wrote:
| Hey, I agree that tweaking these probably wouldn't have made
| much difference, but tuning/reducing the dirty_bytes could
| calm the writeback stampede and smooth that bump, potentially
| getting rid of whatever race might have been happening.
| Regardless, disabling the cgroup accounting there is the
| right thing to do, especially as you don't need it. Tbh, the
| main reason I wrote most of that was as background to explain
| the cgv1 vs v2 differences and why they're there (and because
| I was stuck in traffic for like 45 mins :/)
|
| If you're ever in the mood to revisit that problem you should
| try disabling that discard flag and see if it makes a
| difference. Also, if it was me, I'd have tried setting
| LimitNOFILE to whatever it is in my shell and seeing if the
| rsync still behaved differently.
|
| Anyway - thoroughly enjoyed your article. You should write
| more :)
| jeffbee wrote:
| > it was fixed in cgroups v2
|
| I would say it was changed in cgroups v2.
|
| Cgroups v1 was written by a company where only one process on a
| machine is allowed to do block I/O, and that program is
| carefully written to not use kernel caches.
|
| Cgroups v2 was written by a company that uses lots of off-the-
| shelf Linux applications that do ordinary block I/O in the
| usual naive way. That's why v2 focuses so much on "pressure".
| sirjee wrote:
| BTW company-1 == Google and company-2 == FB/Meta.
|
| In addition, Google has completely removed local storage from
| their servers, so there is no disk I/O at all.
| betaby wrote:
| > In addition, Google has completely removed local storage
| from their servers, so there is no disk I/O at all.
|
| What does that mean? There should be disk somewhere anyway
| to store gmail messages.
| jeffbee wrote:
| https://static.googleusercontent.com/media/sre.google/en/
| /st...
|
| It sounds a bit vanilla on paper, since things like NFS
| and iSCSI have existed forever.
| sirjee wrote:
| Usually Google applications uses high level network
| storage services like collosus, BigTable or spanner and
| these high level services are backed by dedicated storage
| appliances where they bypass the kernel for SSDs and use
| direct block IO for slow disks. For network, they are
| moving towards userspace network [1].
|
| [1] https://research.google/pubs/snap-a-microkernel-
| approach-to-...
| loeg wrote:
| Yeah, even on consumer hardware, dirty ratio of 10% is waaay
| too much. These settings can also be tuned in bytes
| (vm.dirty_bytes and vm.dirty_background_bytes), and I tune
| these to 128-256MB on my desktop.
| eqvinox wrote:
| It's not the infamous 2.6.32, but ... kernel 3.10 up until very
| recently? Oof.
|
| (3.10 released 2013-06-30)
___________________________________________________________________
(page generated 2025-02-14 23:00 UTC)