[HN Gopher] High System Load with Low CPU Utilization on Linux? ...
___________________________________________________________________
High System Load with Low CPU Utilization on Linux? (2020)
Author : tanelpoder
Score : 59 points
Date : 2022-09-22 09:28 UTC (1 days ago)
(HTM) web link (tanelpoder.com)
(TXT) w3m dump (tanelpoder.com)
| [deleted]
| raffraffraff wrote:
| > The main point of this article was to demonstrate that high
| system load on Linux doesn't come only from CPU demand, but also
| from disk I/O demand
|
| Great article but this summary was zero surprise. I've only ever
| seen high load from disk I/O. When I was first clicking on the
| link I thought to myself "well, it's disk I/O, but let's see how
| we get to the punchline"
| godshatter wrote:
| A bit off-topic, but I'm amazed at how well linux handles large
| numbers of threads. I have a program that runs bots against each
| other playing the game of go, where each go-bot plays every other
| go-bot one hundred times, which is highly cpu intensive. I launch
| one go-bot at a time which is battling every other go-bot it
| hasn't already played against, often with close to 900 threads
| running go-bot battles at once. I have six cores with two hyper-
| threads each and all of them are pegged at 100% usage for hours
| on end by this program. The processes are running at normal
| priority, i.e. not changed via "nice".
|
| When I first tried this, I was prepared to hard-boot since I was
| almost sure it would make my desktop unusable, but it didn't. I
| can even play some fairly cpu- and gpu-intensive games without
| too many hiccups while this is going on. If I wasn't paying
| attention, I probably wouldn't know they were running.
| gfv wrote:
| Great write-up about the troubleshooting process!
|
| Regarding the exact case, there is a slightly deeper issue. XFS
| enqueues inode changes to the journal buffers twice: the mtime
| change is scheduled prior to the actual data being written, and
| the inode with the updated file size is placed in the journal
| buffers just after. If the drive is overloaded, the relatively
| tiny (just a few megs) journal buffers may overflow with mtime
| changes, and the file system becomes pathologically synchronous.
| However, since 4.1something, XFS supports the `lazytime` mounting
| option that delays the mtime updates until a more substantial
| change is written. Without it, the journal queue fills up at
| roughly the speed of your write() calls; with it, at the pace of
| the actual data hitting the disk, so even in highly congested
| conditions your application can write asynchronously -- that is,
| until dirty_ratio stops your system dead in its tracks.
| TYMorningCoffee wrote:
| How does pSnapper translate [kworker/6:99], [kworker/6:98], etc
| into the pattern (kworker/*:*)? I would like similar
| functionality for log analysis.
|
| Edit: Nevermind. I skipped over the key paragraph here:
|
| > By default, pSnapper replaces any digits in the task's comm
| field before aggregating (the comm2 field would leave them
| intact). Now it's easy to see that our extreme system load spike
| was caused by a large number of kworker kernel threads (with
| "root" as process owner). So this is not about some userland
| daemon running under root, but a kernel problem.
| pmontra wrote:
| My old laptop suddenly became very slow many years ago. top told
| me it was 99% in wait status. It was the hard disk failing. I
| think it was trying to read again and again some bad sector.
| Shutdown. Bought a new HDD, restored from backup, solved. BTW,
| that new HDD failed with a bad clack clack clack noise years
| later. Backups saved me again.
| teddyh wrote:
| I thought PSI (Pressure Stall Information) was the modern way?
|
| https://news.ycombinator.com/item?id=17580620
| gfv wrote:
| PSI tells you whether you have a problem or not; it's up to you
| to use the techniques described here to find the cause and the
| reason.
| cmurf wrote:
| IO wait is counted against load by linux. So high IO pressure,
| i.e. found in `/proc/pressure` will cause high load. When
| considering reclaim (dropped file pages which then need IO to
| read the page on demand again) it makes sense because high IO
| wait will delay reads (and writes) slowing everything down.
|
| I'm liking this project
| https://github.com/facebookincubator/below
|
| It's packaged in Fedora.
___________________________________________________________________
(page generated 2022-09-23 23:01 UTC)