[HN Gopher] High System Load with Low CPU Utilization on Linux? ...
       ___________________________________________________________________
        
       High System Load with Low CPU Utilization on Linux? (2020)
        
       Author : tanelpoder
       Score  : 59 points
       Date   : 2022-09-22 09:28 UTC (1 days ago)
        
 (HTM) web link (tanelpoder.com)
 (TXT) w3m dump (tanelpoder.com)
        
       | [deleted]
        
       | raffraffraff wrote:
       | > The main point of this article was to demonstrate that high
       | system load on Linux doesn't come only from CPU demand, but also
       | from disk I/O demand
       | 
       | Great article but this summary was zero surprise. I've only ever
       | seen high load from disk I/O. When I was first clicking on the
       | link I thought to myself "well, it's disk I/O, but let's see how
       | we get to the punchline"
        
       | godshatter wrote:
       | A bit off-topic, but I'm amazed at how well linux handles large
       | numbers of threads. I have a program that runs bots against each
       | other playing the game of go, where each go-bot plays every other
       | go-bot one hundred times, which is highly cpu intensive. I launch
       | one go-bot at a time which is battling every other go-bot it
       | hasn't already played against, often with close to 900 threads
       | running go-bot battles at once. I have six cores with two hyper-
       | threads each and all of them are pegged at 100% usage for hours
       | on end by this program. The processes are running at normal
       | priority, i.e. not changed via "nice".
       | 
       | When I first tried this, I was prepared to hard-boot since I was
       | almost sure it would make my desktop unusable, but it didn't. I
       | can even play some fairly cpu- and gpu-intensive games without
       | too many hiccups while this is going on. If I wasn't paying
       | attention, I probably wouldn't know they were running.
        
       | gfv wrote:
       | Great write-up about the troubleshooting process!
       | 
       | Regarding the exact case, there is a slightly deeper issue. XFS
       | enqueues inode changes to the journal buffers twice: the mtime
       | change is scheduled prior to the actual data being written, and
       | the inode with the updated file size is placed in the journal
       | buffers just after. If the drive is overloaded, the relatively
       | tiny (just a few megs) journal buffers may overflow with mtime
       | changes, and the file system becomes pathologically synchronous.
       | However, since 4.1something, XFS supports the `lazytime` mounting
       | option that delays the mtime updates until a more substantial
       | change is written. Without it, the journal queue fills up at
       | roughly the speed of your write() calls; with it, at the pace of
       | the actual data hitting the disk, so even in highly congested
       | conditions your application can write asynchronously -- that is,
       | until dirty_ratio stops your system dead in its tracks.
        
       | TYMorningCoffee wrote:
       | How does pSnapper translate [kworker/6:99], [kworker/6:98], etc
       | into the pattern (kworker/*:*)? I would like similar
       | functionality for log analysis.
       | 
       | Edit: Nevermind. I skipped over the key paragraph here:
       | 
       | > By default, pSnapper replaces any digits in the task's comm
       | field before aggregating (the comm2 field would leave them
       | intact). Now it's easy to see that our extreme system load spike
       | was caused by a large number of kworker kernel threads (with
       | "root" as process owner). So this is not about some userland
       | daemon running under root, but a kernel problem.
        
       | pmontra wrote:
       | My old laptop suddenly became very slow many years ago. top told
       | me it was 99% in wait status. It was the hard disk failing. I
       | think it was trying to read again and again some bad sector.
       | Shutdown. Bought a new HDD, restored from backup, solved. BTW,
       | that new HDD failed with a bad clack clack clack noise years
       | later. Backups saved me again.
        
       | teddyh wrote:
       | I thought PSI (Pressure Stall Information) was the modern way?
       | 
       | https://news.ycombinator.com/item?id=17580620
        
         | gfv wrote:
         | PSI tells you whether you have a problem or not; it's up to you
         | to use the techniques described here to find the cause and the
         | reason.
        
       | cmurf wrote:
       | IO wait is counted against load by linux. So high IO pressure,
       | i.e. found in `/proc/pressure` will cause high load. When
       | considering reclaim (dropped file pages which then need IO to
       | read the page on demand again) it makes sense because high IO
       | wait will delay reads (and writes) slowing everything down.
       | 
       | I'm liking this project
       | https://github.com/facebookincubator/below
       | 
       | It's packaged in Fedora.
        
       ___________________________________________________________________
       (page generated 2022-09-23 23:01 UTC)