[HN Gopher] Prometheus metrics saves us from painful kernel debu...
       ___________________________________________________________________
        
       Prometheus metrics saves us from painful kernel debugging (2022)
        
       Author : goranmoomin
       Score  : 77 points
       Date   : 2024-07-21 10:34 UTC (12 hours ago)
        
 (HTM) web link (utcc.utoronto.ca)
 (TXT) w3m dump (utcc.utoronto.ca)
        
       | lordnacho wrote:
       | Sounds like they have node on their machines. Not the js
       | framework, but the prometheus/grafana package that gives you all
       | the meters for a generic system monitoring dashboard. Disk usage,
       | CPU, memory, it's all set up already, just plug and play.
       | 
       | In fact, I found a memory leak this way not long ago.
       | 
       | Super useful having this on your infra, saves a lot of time.
        
         | gouthamve wrote:
         | Yup, https://github.com/prometheus/node_exporter is the
         | standard way to monitor machine metrics with Prometheus.
        
         | alfons_foobar wrote:
         | FYI, it's called "node_exporter" ;)
        
       | tass wrote:
       | This awakened a memory from last year where a colleague and I
       | were trying to understand where an increase in Linux memory was
       | coming from in machines that hadn't been rebooted in a while.
       | This was alerted to by Prometheus metrics.
       | 
       | Even after all apps had been restarted, it persisted. Turned out
       | to be a leak of slab memory allocations by a kernel module. That
       | kernel module had since been updated, but all previous versions
       | were still loaded by the kernel so the leak persisted until the
       | next reboot.
       | 
       | The leaky kernel module was CrowdStrike's falcon sensor. It
       | started a discussion where engineering had no option but to run
       | these things for the sake of security, there were no instances
       | where it actually caught anything, but it had the potential to
       | cause incidents and outages.
        
         | josephcsible wrote:
         | Do you mean "engineering had no option but to run these things
         | for the sake of compliance"?
        
           | tass wrote:
           | Compliance was the reasoning they raised, yes.
           | 
           | Whether you require this particular tool to meet compliance,
           | and whether that tool needs to be deployed across the entire
           | stack might be open to interpretation though.
        
             | atomicnumber3 wrote:
             | Open to interpretation but it often ends up something like
             | this:
             | 
             | 1. "Do we need to use this shovelware? Can't we achieve
             | compliance via <a few sane steps>?
             | 
             | 2. CTO/CEO: We can either use this software and you don't
             | have to think about it, or we can do your thing and I'm
             | going to hang you out to dry the second literally anything
             | goes wrong with it.
             | 
             | 3. "Ah. Ok nevermind." [starts updating resume]
             | 
             | IMO it's part of this toxic culture against DIYing
             | literally anything. Yes NIH can be bad, but so is the
             | complete opposite where you make insane decisions
             | (technical or money-wise) because of a culture where people
             | are punished for trying to DIY when it does make sense.
             | 
             | At a particularly toxic past org of mine, we actually had a
             | "joke" (not the ha-ha kind of joke, the i can't believe we
             | actually do this kind) where for a few key decisions we
             | appointed someone to be the "Jesus" for that decision -
             | someone who was already planning to quit, so they could
             | make a decision that was correct but would get them
             | punished, then "die for our sins", leaving us with the
             | benefits of a correct decision having been made but without
             | any of the political fallout of having done something that
             | offends the sensibilities of leadership.
             | 
             | Yes eventually you run out of messiahs to sacrifice and yes
             | it sucked to work where a lot. But damn did they pay really
             | well, so, yknow.
        
               | znpy wrote:
               | > against DIYing literally anything
               | 
               | I get your feeling but I largely disagree: I've seen too
               | many DYI mini-projects that implemented just enough and
               | became too much important, and when they broke no one was
               | around to fix that (the implementer had left a long ago
               | and left scarce to no documentation at all).
               | 
               | The nice thing about using an off-the-shelf product is
               | that usually either it comes with support or there's some
               | kind of community where you can go and ask for help).
               | 
               | There is this insidious cognitive bias in (software?)
               | engineers to only consider the happy path when thinking
               | about the consequences of their actions.
        
           | nsguy wrote:
           | As far as I'm aware there's never a requirement to use a
           | specific product/technology for compliance. The standards
           | will require processes or something very general and the way
           | a company complies is their choice. It's certainly possible
           | that a company will buy a certain product because it checks
           | some compliance box but I would expect there are many other
           | ways to check that box.
        
         | znpy wrote:
         | > CrowdStrike's falcon sensor
         | 
         | Oh god, is that falcon thing from crowdstrike as well?
        
       | spiffytech wrote:
       | My team spent weeks using log-aggregated metrics to gradually
       | figure out why servers' clocks would go out of whack.
       | 
       | It turned out Docker Swarm made undocumented+ use of a UDP port
       | that some VMware product also used, and once in a while they'd
       | cross the streams.
       | 
       | We only figured it out because we put every system event we could
       | find onto a Grafana graph and narrowed down which ones kept
       | happening at the same time.
       | 
       | + I think? It's been a while, might have just been hard to find.
        
       | louwrentius wrote:
       | As a side note, I'm into storage performance and the node-
       | exporter data is absolutely spot on. I performed storage
       | benchmarks with FIO and the metrics matched the loads and
       | reported os metrics (iostat) perfectly.
       | 
       | I actually made a Grafana dashboard[0] for it, but haven't used
       | this in a while myself.
       | 
       | [0]: https://grafana.com/grafana/dashboards/11801-i-o-statistics/
        
         | nsguy wrote:
         | It's all standard Linux metrics that node-exporter exposes.
         | Mostly from procfs.
        
       ___________________________________________________________________
       (page generated 2024-07-21 23:06 UTC)