[HN Gopher] Linux Performance Analysis (2015)
       ___________________________________________________________________
        
       Linux Performance Analysis (2015)
        
       Author : benjacksondev
       Score  : 163 points
       Date   : 2025-07-29 13:15 UTC (9 hours ago)
        
 (HTM) web link (netflixtechblog.com)
 (TXT) w3m dump (netflixtechblog.com)
        
       | emmelaich wrote:
       | Nice list. sar/sysstat is underrated imho.
        
         | mmh0000 wrote:
         | Oh man. There's a blast from the past.
         | 
         | Today, you'd want something like:
         | 
         | Prometheus + Node Exporter [1]
         | 
         | [1] https://github.com/prometheus/node_exporter
        
       | mortar wrote:
       | 2015
       | 
       | Previous discussions:
       | https://news.ycombinator.com/item?id=10654681
       | https://news.ycombinator.com/item?id=10652076
        
         | danieldk wrote:
         | Yeah, I skipped the date and then saw Linux 3.13 in the
         | examples.
        
       | whalesalad wrote:
       | I quite like `iotop` as an alternative to iostat.
       | https://linux.die.net/man/1/iotop
        
       | CodeCompost wrote:
       | > At Netflix we have a massive EC2 Linux cloud
       | 
       | Wait a minute. I thought Netflix famously ran FreeBSD.
        
         | craftkiller wrote:
         | My understanding was their CDN ran on FreeBSD, but not their
         | API servers. But I don't work for Netflix.
        
           | diab0lic wrote:
           | Your understanding is correct.
        
             | achierius wrote:
             | Why did they not choose to use it for both (or neither)?
             | I.e., what reasons for using FreeBSD on CDN servers would
             | not also apply to using them for API servers?
        
               | seabrookmx wrote:
               | They are extremely different workloads so.. everything?
               | 
               | The CDN servers are basically appliances, and are often
               | embedded in various data centers (includes those ran by
               | ISP's) to aggressively cache content. They care about
               | high throughput and run a single workload. Being able to
               | fine tune the entire stack, right down to the TCP/IP
               | implementation is very valuable in this case. Since they
               | ship the hardware and software, they can tightly
               | integrate the two.
               | 
               | By contrast, API workloads are very heterogeneous. I'd
               | have to imagine the ability to run any standard Linux
               | software there would also be a big plus. Linux clearly
               | has much more vetting on cloud providers than FreeBSD as
               | well.
        
               | aflag wrote:
               | Can't you fine tune linux as well? Does FreeBSD perform
               | better somehow on a CDN workload? I find it difficult to
               | imagine that the reason is performance. But I don't know
               | what the reason is.
        
               | craftkiller wrote:
               | Netflix discusses their reasons starting at 18:20:
               | https://www.youtube.com/watch?v=veQwkG0WdN8&t=18m20s
               | 
               | tl;dw: the performance, the efficiency of development,
               | the community, FreeBSD is a complete operating system,
               | the code base is smaller, the ports system, and the
               | license.
               | 
               | and this video covers the optimizations Netflix has made
               | to FreeBSD: https://www.youtube.com/watch?v=36qZYL5RlgY
               | 
               | Also potentially a reason: According to drewg123, Linux's
               | kTLS was broken. Which I see drewg123 also commenting in
               | this thread. Is he the "Drew on my team" mentioned in the
               | first video? Is he the speaker in the 2nd video? Idk
               | https://news.ycombinator.com/item?id=28585008
        
         | drewg123 wrote:
         | The CDN runs FreeBSD. Linux is used for nearly everything else.
        
       | __turbobrew__ wrote:
       | If you like this post, I would recommend "BPF Performance Tools"
       | and "Systems Performance: Enterprise and the Cloud" by Brenden
       | Gregg.
       | 
       | I have pulled out a few miracles using these tools (identifying
       | kernel bottlenecks or profiling programs using ebpf) and it has
       | been well worth the investment to read through the books.
        
         | yankcrime wrote:
         | Agreed, highly recommended reading. A slightly more up-to-date
         | post of his which recommends tools in such situations is:
         | https://www.brendangregg.com/blog/2024-03-24/linux-crisis-to...
        
         | wcunning wrote:
         | Literally did miracles at my last job with the first book and
         | that got me my current job, where I also did some impressive
         | proving which libraries had what performance with it again...
         | Seriously valuable stuff.
        
           | __turbobrew__ wrote:
           | Yea it is kindof cheating. I was helping someone debug why
           | their workload was soft locking. I ran the profiling tools
           | and found that cgroup accounting for the workload was taking
           | nearly all the cpu time on locks. From searches through linux
           | git logs I found that cgroup accounting in older kernels had
           | global locks. I saw that newer kernels didn't have this, so
           | we moved to a newer kernels and all the issues went away.
           | 
           | People thought I was a wizard lol.
        
       | babuloseo wrote:
       | he forgot about rusttop
        
         | AnyTimeTraveler wrote:
         | I'm pretty sure that that didn't exist in 2015 ;)
        
       | janvdberg wrote:
       | My first command is always 'w'. And I always urge young engineers
       | to do the same.
       | 
       | There is no shorter command to show uptime, load averages (1/5/15
       | minutes), logged in users. Essential for quick system health
       | checks!
        
         | Propelloni wrote:
         | Me too! So much so that I add it to my .bashrc everywhere.
        
         | mmh0000 wrote:
         | It should also be mentioned, Linux Load Average is a complex
         | beast[1]. However, a general rule of thumb that works for most
         | environments is:
         | 
         | You always want the load average to be less than the total
         | number of CPU cores. If higher, you're likely experiencing a
         | lot of waits and context switching.
         | 
         | [1] https://www.brendangregg.com/blog/2017-08-08/linux-load-
         | aver...
        
         | chasil wrote:
         | Glances is nice. I think it is a clone of HP-UX Glance.
         | 
         | https://nicolargo.github.io/glances/
         | 
         | I have also hacked basic top to add database login details to
         | server processes.
        
       | louwrentius wrote:
       | The iostat command has always been important to observe HDD/SDD
       | latency numbers.
       | 
       | Especially SSDs are treated like _magic_ storage devices with
       | _infinite_ IOPS at Planck-scale latency.
       | 
       | Until you discover that SSDs that can do 10GB/s don't do nearly
       | so well (not even close) when you access them in a single thread
       | with random IOPS, with queue depth of 1.
        
         | wcunning wrote:
         | That's where you start down the eBPF rabbit hole with
         | bcc/biolatency and other block device histogram tools. Further,
         | the cache hit rate and block size behavior of the SSD/NVME
         | drive can really affect things if, say, your autonomous vehicle
         | logging service uses MCAP with a chunk size much smaller than a
         | drive block... Ask me how I know
        
       | rkachowski wrote:
       | it's 10 years later - what's the 60 second equivalent in 2025?
        
         | wcunning wrote:
         | @yankcrime posted it above:
         | https://www.brendangregg.com/blog/2024-03-24/linux-crisis-to...
        
           | BlackLotus89 wrote:
           | PSI (pressure stall information) are missing.
           | 
           | I always use a configured!(F2) htop (not mentioned as well).
           | Always enable PSI information in htop (some red hat systems I
           | work with still don't offer them...).
           | 
           | If you have zfs enable those meters as well and htop has an
           | io tab, use it!
        
       | ImPostingOnHN wrote:
       | Maybe I missed it, but checking available disk space is often a
       | good step in diagnosing misbehaving systems.
        
       | fduran wrote:
       | shameless plug: you can practice this in a free VM
       | https://docs.sadservers.com/docs/scenario-guides/practical-l...
       | (there's a typo there to keep you on your feet)
        
       | ch33zer wrote:
       | Almost all of these have been replaced for me with below:
       | https://developers.facebook.com/blog/post/2021/09/21/below-t...
       | 
       | It is excellent and contains most things you could need. Downside
       | is that it isn't yet a standard tool so you need to get it
       | installed across your fleet
        
         | benreesman wrote:
         | Oh man nostalgia city. I vividly remember meeting atop time
         | travel debugging at 3am in Menlo Park in 2012, wild times.
        
       | 5pl1n73r wrote:
       | After this article was written, `free -m` on many systems started
       | to have an "available" column that shows the sum of reclaimable
       | and free memory. It's nicer than the "-/+" section shown in this
       | old article.                 $ free -m                      total
       | used        free      shared  buff/cache   available       Mem:
       | 3915        2116        1288          41         769        1799
       | Swap:            974           0         974
        
       | tomhow wrote:
       | Previously:
       | 
       |  _Linux Performance Analysis in 60,000 Milliseconds_ -
       | https://news.ycombinator.com/item?id=10652076 - Nov 2015 (11
       | comments)
       | 
       |  _Linux Performance Analysis_ -
       | https://news.ycombinator.com/item?id=10654681 - Dec 2015 (82
       | comments)
       | 
       |  _Linux Performance Analysis in 60k Milliseconds (2015) [pdf]_ -
       | https://news.ycombinator.com/item?id=44070741 - May 2025 (1
       | comment)
        
       ___________________________________________________________________
       (page generated 2025-07-29 23:01 UTC)