[HN Gopher] Monitoring My Homelab, Simply
       ___________________________________________________________________
        
       Monitoring My Homelab, Simply
        
       Author : Bogdanp
       Score  : 84 points
       Date   : 2025-07-10 11:28 UTC (3 days ago)
        
 (HTM) web link (b.tuxes.uk)
 (TXT) w3m dump (b.tuxes.uk)
        
       | Tractor8626 wrote:
       | Even in homelab you should totally monitor thing like
       | 
       | - raid health
       | 
       | - free disk space
       | 
       | - whether backup jobs running
       | 
       | - ssl certs expiring
        
         | ahofmann wrote:
         | One could also look every Sunday at 5 pm manually through this
         | stuff. In a homelab, this can be enough.
        
           | tough wrote:
           | One cold also just wait for things to not work before to try
           | and fix them
        
             | dewey wrote:
             | For backups that's usually not the best strategy.
        
           | sthuck wrote:
           | Look I agree but one can also manage with an always on pc and
           | an external hard drive instead of a homelab. It's part hobby
           | part learning experience.
           | 
           | Also if you have kids 0-6 you can't schedule anything
           | relaibly
        
       | KaiserPro wrote:
       | I understand your pain.
       | 
       | I used to have sensu, but it was a pain to keep updated (and
       | didn't work that well on old rpis)
       | 
       | But what I did find was a good alternative was telegraph->some
       | sort of time series (I still really like graphite, influxQL is
       | utter horse shit, and prometheus's fucking pull models is
       | bollocks)
       | 
       | Then I could create alert conditions on grafana. At least that
       | was simple.
       | 
       | However the alerting on grafana moved from being "move the handle
       | adjust a threshold, get a a configurable alert" to craft a query,
       | get loads of unfilterable metadata as an alert.
       | 
       | its still good enough.
        
         | cyberpunk wrote:
         | Why is the pull model bollocks? I've been building monitoring
         | for stuff since nagios and zabbix were the new hot tools; and I
         | can't really imagine preferring the oldschool ways vs the
         | pretty much industry standard of promstack these days...
        
           | KaiserPro wrote:
           | Zabbix is bollocks. so is nagios. Having remote root access
           | to all your stuff is utter shite.
           | 
           | Prometheus as a time series DB is great, I even like its QL.
           | What I don't like is pull. Sure there is agent mode or
           | telegraf/grafana agent. But the idea that I need to hold my
           | state and wait for Prometheus to collect it is utterly
           | stupid. The biggest annoyance is that I need to have a
           | webserver somewhere, with a single god instance(s) that can
           | reach out and touch it.
           | 
           | Great if you have just one network, but a bollock ache if you
           | have any kind of network isolation.
           | 
           | This means that we are using influxdb and its shitty flux QL
           | (I know we could upgrade, but thats hard)
        
             | cyberpunk wrote:
             | eh in a standard three tier you're usually okay to pull up
             | and push down aren't you? Run it in the lower network..
             | 
             | Were all kubernetes these days so i guess i didn't think
             | about it a lot in recent years.
        
           | mystifyingpoi wrote:
           | Both models are totally fine, for their specific use cases.
        
       | jamesholden wrote:
       | ok.. so your solution is using at minimum a $5/month service.
       | Yikes, I'd prefer something like pushover before that. :/
        
         | tough wrote:
         | or a shell script
        
         | faster wrote:
         | You can self-host ntfy.sh but then you need to find a place
         | outside of your infra to host it.
        
       | Scaevolus wrote:
       | I use Prometheus + Prometheus Alertmanager + Any Free Tier paging
       | system (currently OpsGenie, might move to AlertOps).
       | 
       | Having a feature-rich TSDB backing alerting minimizes time adding
       | alerts, and the UX of being able to write a potential alert
       | expression and seeing when in the past it would fire is amazing.
       | 
       | Just two processes to run, either bare or containerized, and you
       | can throw in a Grafana instance if you want better graphs.
        
       | JZL003 wrote:
       | I do kinda similar. I have a node express swrver which has lots
       | of little async jobs, throw it all into a promise.all, and if
       | they're all good, send 200, if not sent 500 and the failing jobs.
       | Then free uptime monitors check every few hours and will email me
       | if "the site goes down"=some error. Kinda like a multiplexer to
       | stay within their free monitoring limit and easy to add more
       | tests
        
       | loloquwowndueo wrote:
       | Did he reinvent monit?
       | 
       | Even a quick Prometheus + alert manager setup with two docker
       | containers is not difficult to manage - mine just works, I seldom
       | have to touch it (mainly when I need to tweak the alert queries).
       | 
       | I use pushover for easy api-driven notifications to my phone,
       | it's a one-time $7 fee or so and it was money well spent.
        
         | atomicnumber3 wrote:
         | I have a similar setup, prometheus and grafana (alertmanager is
         | a separate thing from the normal grafana setup, right? I'm not
         | even using that), and I use discord webhooks for notifications
         | to my phone (I just @ myself or use channel notification
         | settings).
        
       | frenchtoast8 wrote:
       | At work I use Datadog, but it's very expensive for a homelab:
       | $15/mo per host (and for cost I prefer using multiple cheap
       | servers than a single large one).
       | 
       | NewRelic and Grafana Cloud have pretty good free plan limits, but
       | I'm paying for that in effort because I don't use either at work
       | so it's not what I'm used to.
        
         | SteveNuts wrote:
         | The Datadog IoT agents are cheaper, but still probably more
         | than you'd want to spend on a lab.
         | 
         | You also only get system metrics, no integrations - but most
         | metrics and checks can be done remotely with a single dedicated
         | agent
        
       | Evidlo wrote:
       | My solution is to just be OK with http status checking (run a
       | webserver on important machines), and use a service like
       | updown.io which is so cheap it's almost free.
       | 
       | e.g. For 1 machine, hourly checking is ~$0.25/year
        
       | Havoc wrote:
       | I personally found uptime kuma to be easiest because it has a
       | python api package to bulk load stuff into it.
       | 
       | Much easier to edit a list in vscode than click around a bunch in
       | an app
        
       | bonobocop wrote:
       | Quite like Cloudprober for this tbh:
       | https://cloudprober.org/docs/how-to/alerting/
       | 
       | Easy to configure, easy to extend with Go, and slots in to
       | alerting.
        
       | jauntywundrkind wrote:
       | There's an article-bias towards rejectionism, towards single shot
       | adventures. "I didn't grok so and so and here's the shell scripts
       | I wrote instead".
       | 
       | Especially for home cloud, home ops, home labs: that's great!
       | That's awesome that you did for yourself, that you wrote up your
       | experience.
       | 
       | But in general I feel like there's a huge missing middle of
       | operations & sys-admin-ery that creates a distorted weird
       | narrative. There's few people out there starting their journey
       | with Prometheus blogging helpfully through it. There's few people
       | mid way through their k8s work talking about their challenges and
       | victories. The tales of just muddling through, of the
       | perseverance, of looking for information, trying to find signal
       | through the noise are few.
       | 
       | What we get a lot of is "this was too much for me so I wrote my
       | own thing instead". Or, "we have been doing such and such for
       | years and found such and such to shave 20% compute" or "we needed
       | this capability so added Z to our k8s cluster like so". The
       | journey is so often missing, we don't have stories of trying &
       | learning. We have stories like this of making.
       | 
       | There's such a background of 'too complex' that I really worry
       | leads us spiritually astray. I'm happy for articles like this,
       | it's awesome to see ingenuity on display, but there's so many
       | good amazing robust tools out there that seem to have lots of
       | people happily or at least adequately using them, but it feels
       | like the stories of turning back from the attempt, stories of
       | eschewing the battle tested widely adopted software drive so much
       | narrative, have so much more ink spilled over them.
       | 
       | Very thankful for Flix language putting Rich Hickey's principle
       | of _Simple isn 't Easy_ first, for helping re-orient me by the
       | axis of Hickey's old grand guidance. I feel like there's such a
       | loud clambor generally for _easy,_ for scripts you throw
       | together, for the intimacy of tiny systems. And I admire a lot of
       | these principles! But I also think there 's a horrible
       | backwardsness that doesn't help, that drives us away from more
       | comprehensive capable integrative systems that can do amazing
       | things, that are scalable both performance wise (as Prometheus
       | certainly is) and organizationally (that other other people and
       | other experts will also lastingly use and build from). The
       | preselection for _easy_ is attainable individually quickly, but
       | real _simple_ requires vastly more, requires so much more thought
       | and planning and structure.
       | https://www.infoq.com/presentations/Simple-Made-Easy/
       | 
       | It's so weird to find myself such a Cathedral-but-open-source fan
       | today. Growing up the Bazaar model made such sense, had such
       | virtue to it. And I still believe in the Bazaar, in the wide
       | world teaming with different softwares. But I worry what lessons
       | are most visible, worry what we pass along, worry about the
       | proliferation of software discontent against the really good open
       | source software that we do collaborate together on em masse. It
       | feels like there's a massive self sabotage going on, that so many
       | people are radicalized and sold a story of discontent against
       | bigger more robust more popular open source software. I'd love to
       | hear that view so much, but I want smaller individuals and voices
       | also making a chorus of happy noise about how far they get how
       | magical how powerful it is that we have so many amazing fantastic
       | bigger open source projects that so scalably enable so much.
       | https://en.m.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar
        
         | cgriswald wrote:
         | This is sort of a ramble, so I apologize in advance.
         | 
         | I love the idea of writing up my ultimately-successful
         | experiences of using open source software. I'm currently
         | working on a big (for me anyway) project for my homelab
         | involving a bunch of stuff I've either never or rarely done
         | before. But... if I were to write about my experiences, a lot
         | of it would be "I'm an idiot and I spent two hours with a valid
         | but bad config because I misunderstood what the documentation
         | was telling me about the syntax and yeah, I learned a bit more
         | about reading the log file for X, but that was fundamentally
         | pointless because it didn't really answer the question." I'd
         | also have to keep track of what I did that didn't work, which
         | adds a lot more work than just keeping track of what did work.
         | 
         | There's also a social aspect there where I don't want to
         | necessarily reveal the precise nature of my idiocy to strangers
         | over the internet. This might be the whole thing here for a lot
         | of people. "Look at this awesome script I made because I'm a
         | rugged and capable individualist" is probably an easier self-
         | sell than "Despite my best efforts, I managed to scrounge
         | together a system that works using pieces made by people
         | smarter than me."
         | 
         | I think I might try. My main concern is whether it will ruin
         | the fun. When I set up Prometheus, I had a lot of fun, even
         | through the mistakes. But, would also trying to write about it
         | make it less fun? Would other people even be interested in a
         | haphazard floundering equivalent to reading about someone's
         | experience with a homework assignment? Would I learn more?
         | Would the frustrating moments be worse or would the process of
         | thinking through things (because I am going to write about it)
         | lead to my mistakes becoming apparent earlier? Will my ego
         | survive people judging my process, conclusions, _and_ writing?
         | I don 't know. Maybe it'll be fun to find out.
        
       | reboot81 wrote:
       | Just love https://healthchecks.io I set it up on all my boxes
       | with these two scripts: win
       | https://github.com/reboot81/healthchecks_service_ps/ macos
       | https://github.com/reboot81/hc_check_maker_macos linux
       | https://healthchecks.io/docs/bash/
        
       | danesparza wrote:
       | Uptime Kuma (https://github.com/louislam/uptime-kuma). With email
       | notifications. So much simpler, and free.
        
         | seriocomic wrote:
         | Love this - tried it. The problem as I see it is that these
         | still require hosting - ideally (again, as I see it) self-
         | hosting a script that monitors internal/homelab things also
         | requires its own monitoring.
         | 
         | Short of paying for a service (which somewhat goes against the
         | grain of trying to host all your own stuff), the closest I can
         | come up with is relying on a service outside your network that
         | has access to your network (via a tunnel/vpn).
         | 
         | Given a lot of my own networking set-up (DNS/Domains/Tunnels
         | etc) are already managed via Cloudflare, I'm thinking that
         | using some compute at that layer to provide a monitoring
         | service. Probably something to throw next at my new LLM
         | developer...
        
       | justusthane wrote:
       | I've been facing a similar search for an ultra-simple but ultra-
       | extensible monitoring solution for my homelab. I've had the idea
       | to write a Python program where the main script is just
       | responsible for scheduling and executing the checks, logging, and
       | alerting based on set thresholds.
       | 
       | All monitoring would be handled via plugins, which would be
       | extremely easy to write.
       | 
       | It would ship with a few core plugins (ping, http, cert check,
       | maybe snmp), but you could easily write a plugin to monitor
       | anything else -- for example, you could use the existing Python
       | Minecraft library and write a plugin to monitor your Minecraft
       | server. Or maybe even the ability to write plugins in any
       | language, not just Python.
       | 
       | I'm not a developer and I'm opposed to vibe coding, so it'll be
       | slow going :)
        
       | Spooky23 wrote:
       | I'm using node exporter to Prometheus and grafana. I also use
       | uptime kuma, and send alerts via pushover.
       | 
       | It's shockingly easy to setup. I have the monitoring stack living
       | on a GCP host that I have setup for various things and have it
       | connected via tailscale.
       | 
       | It actually paid for itself by alerting me to low voltage events
       | via NUT. I probably would have lost some gear to poor electrical
       | conditions.
        
       ___________________________________________________________________
       (page generated 2025-07-13 23:00 UTC)