[HN Gopher] Prometheus metrics saves us from painful kernel debu...
___________________________________________________________________
Prometheus metrics saves us from painful kernel debugging (2022)
Author : goranmoomin
Score : 77 points
Date : 2024-07-21 10:34 UTC (12 hours ago)
(HTM) web link (utcc.utoronto.ca)
(TXT) w3m dump (utcc.utoronto.ca)
| lordnacho wrote:
| Sounds like they have node on their machines. Not the js
| framework, but the prometheus/grafana package that gives you all
| the meters for a generic system monitoring dashboard. Disk usage,
| CPU, memory, it's all set up already, just plug and play.
|
| In fact, I found a memory leak this way not long ago.
|
| Super useful having this on your infra, saves a lot of time.
| gouthamve wrote:
| Yup, https://github.com/prometheus/node_exporter is the
| standard way to monitor machine metrics with Prometheus.
| alfons_foobar wrote:
| FYI, it's called "node_exporter" ;)
| tass wrote:
| This awakened a memory from last year where a colleague and I
| were trying to understand where an increase in Linux memory was
| coming from in machines that hadn't been rebooted in a while.
| This was alerted to by Prometheus metrics.
|
| Even after all apps had been restarted, it persisted. Turned out
| to be a leak of slab memory allocations by a kernel module. That
| kernel module had since been updated, but all previous versions
| were still loaded by the kernel so the leak persisted until the
| next reboot.
|
| The leaky kernel module was CrowdStrike's falcon sensor. It
| started a discussion where engineering had no option but to run
| these things for the sake of security, there were no instances
| where it actually caught anything, but it had the potential to
| cause incidents and outages.
| josephcsible wrote:
| Do you mean "engineering had no option but to run these things
| for the sake of compliance"?
| tass wrote:
| Compliance was the reasoning they raised, yes.
|
| Whether you require this particular tool to meet compliance,
| and whether that tool needs to be deployed across the entire
| stack might be open to interpretation though.
| atomicnumber3 wrote:
| Open to interpretation but it often ends up something like
| this:
|
| 1. "Do we need to use this shovelware? Can't we achieve
| compliance via <a few sane steps>?
|
| 2. CTO/CEO: We can either use this software and you don't
| have to think about it, or we can do your thing and I'm
| going to hang you out to dry the second literally anything
| goes wrong with it.
|
| 3. "Ah. Ok nevermind." [starts updating resume]
|
| IMO it's part of this toxic culture against DIYing
| literally anything. Yes NIH can be bad, but so is the
| complete opposite where you make insane decisions
| (technical or money-wise) because of a culture where people
| are punished for trying to DIY when it does make sense.
|
| At a particularly toxic past org of mine, we actually had a
| "joke" (not the ha-ha kind of joke, the i can't believe we
| actually do this kind) where for a few key decisions we
| appointed someone to be the "Jesus" for that decision -
| someone who was already planning to quit, so they could
| make a decision that was correct but would get them
| punished, then "die for our sins", leaving us with the
| benefits of a correct decision having been made but without
| any of the political fallout of having done something that
| offends the sensibilities of leadership.
|
| Yes eventually you run out of messiahs to sacrifice and yes
| it sucked to work where a lot. But damn did they pay really
| well, so, yknow.
| znpy wrote:
| > against DIYing literally anything
|
| I get your feeling but I largely disagree: I've seen too
| many DYI mini-projects that implemented just enough and
| became too much important, and when they broke no one was
| around to fix that (the implementer had left a long ago
| and left scarce to no documentation at all).
|
| The nice thing about using an off-the-shelf product is
| that usually either it comes with support or there's some
| kind of community where you can go and ask for help).
|
| There is this insidious cognitive bias in (software?)
| engineers to only consider the happy path when thinking
| about the consequences of their actions.
| nsguy wrote:
| As far as I'm aware there's never a requirement to use a
| specific product/technology for compliance. The standards
| will require processes or something very general and the way
| a company complies is their choice. It's certainly possible
| that a company will buy a certain product because it checks
| some compliance box but I would expect there are many other
| ways to check that box.
| znpy wrote:
| > CrowdStrike's falcon sensor
|
| Oh god, is that falcon thing from crowdstrike as well?
| spiffytech wrote:
| My team spent weeks using log-aggregated metrics to gradually
| figure out why servers' clocks would go out of whack.
|
| It turned out Docker Swarm made undocumented+ use of a UDP port
| that some VMware product also used, and once in a while they'd
| cross the streams.
|
| We only figured it out because we put every system event we could
| find onto a Grafana graph and narrowed down which ones kept
| happening at the same time.
|
| + I think? It's been a while, might have just been hard to find.
| louwrentius wrote:
| As a side note, I'm into storage performance and the node-
| exporter data is absolutely spot on. I performed storage
| benchmarks with FIO and the metrics matched the loads and
| reported os metrics (iostat) perfectly.
|
| I actually made a Grafana dashboard[0] for it, but haven't used
| this in a while myself.
|
| [0]: https://grafana.com/grafana/dashboards/11801-i-o-statistics/
| nsguy wrote:
| It's all standard Linux metrics that node-exporter exposes.
| Mostly from procfs.
___________________________________________________________________
(page generated 2024-07-21 23:06 UTC)