[HN Gopher] Monitoring My Homelab, Simply
___________________________________________________________________
Monitoring My Homelab, Simply
Author : Bogdanp
Score : 84 points
Date : 2025-07-10 11:28 UTC (3 days ago)
(HTM) web link (b.tuxes.uk)
(TXT) w3m dump (b.tuxes.uk)
| Tractor8626 wrote:
| Even in homelab you should totally monitor thing like
|
| - raid health
|
| - free disk space
|
| - whether backup jobs running
|
| - ssl certs expiring
| ahofmann wrote:
| One could also look every Sunday at 5 pm manually through this
| stuff. In a homelab, this can be enough.
| tough wrote:
| One cold also just wait for things to not work before to try
| and fix them
| dewey wrote:
| For backups that's usually not the best strategy.
| sthuck wrote:
| Look I agree but one can also manage with an always on pc and
| an external hard drive instead of a homelab. It's part hobby
| part learning experience.
|
| Also if you have kids 0-6 you can't schedule anything
| relaibly
| KaiserPro wrote:
| I understand your pain.
|
| I used to have sensu, but it was a pain to keep updated (and
| didn't work that well on old rpis)
|
| But what I did find was a good alternative was telegraph->some
| sort of time series (I still really like graphite, influxQL is
| utter horse shit, and prometheus's fucking pull models is
| bollocks)
|
| Then I could create alert conditions on grafana. At least that
| was simple.
|
| However the alerting on grafana moved from being "move the handle
| adjust a threshold, get a a configurable alert" to craft a query,
| get loads of unfilterable metadata as an alert.
|
| its still good enough.
| cyberpunk wrote:
| Why is the pull model bollocks? I've been building monitoring
| for stuff since nagios and zabbix were the new hot tools; and I
| can't really imagine preferring the oldschool ways vs the
| pretty much industry standard of promstack these days...
| KaiserPro wrote:
| Zabbix is bollocks. so is nagios. Having remote root access
| to all your stuff is utter shite.
|
| Prometheus as a time series DB is great, I even like its QL.
| What I don't like is pull. Sure there is agent mode or
| telegraf/grafana agent. But the idea that I need to hold my
| state and wait for Prometheus to collect it is utterly
| stupid. The biggest annoyance is that I need to have a
| webserver somewhere, with a single god instance(s) that can
| reach out and touch it.
|
| Great if you have just one network, but a bollock ache if you
| have any kind of network isolation.
|
| This means that we are using influxdb and its shitty flux QL
| (I know we could upgrade, but thats hard)
| cyberpunk wrote:
| eh in a standard three tier you're usually okay to pull up
| and push down aren't you? Run it in the lower network..
|
| Were all kubernetes these days so i guess i didn't think
| about it a lot in recent years.
| mystifyingpoi wrote:
| Both models are totally fine, for their specific use cases.
| jamesholden wrote:
| ok.. so your solution is using at minimum a $5/month service.
| Yikes, I'd prefer something like pushover before that. :/
| tough wrote:
| or a shell script
| faster wrote:
| You can self-host ntfy.sh but then you need to find a place
| outside of your infra to host it.
| Scaevolus wrote:
| I use Prometheus + Prometheus Alertmanager + Any Free Tier paging
| system (currently OpsGenie, might move to AlertOps).
|
| Having a feature-rich TSDB backing alerting minimizes time adding
| alerts, and the UX of being able to write a potential alert
| expression and seeing when in the past it would fire is amazing.
|
| Just two processes to run, either bare or containerized, and you
| can throw in a Grafana instance if you want better graphs.
| JZL003 wrote:
| I do kinda similar. I have a node express swrver which has lots
| of little async jobs, throw it all into a promise.all, and if
| they're all good, send 200, if not sent 500 and the failing jobs.
| Then free uptime monitors check every few hours and will email me
| if "the site goes down"=some error. Kinda like a multiplexer to
| stay within their free monitoring limit and easy to add more
| tests
| loloquwowndueo wrote:
| Did he reinvent monit?
|
| Even a quick Prometheus + alert manager setup with two docker
| containers is not difficult to manage - mine just works, I seldom
| have to touch it (mainly when I need to tweak the alert queries).
|
| I use pushover for easy api-driven notifications to my phone,
| it's a one-time $7 fee or so and it was money well spent.
| atomicnumber3 wrote:
| I have a similar setup, prometheus and grafana (alertmanager is
| a separate thing from the normal grafana setup, right? I'm not
| even using that), and I use discord webhooks for notifications
| to my phone (I just @ myself or use channel notification
| settings).
| frenchtoast8 wrote:
| At work I use Datadog, but it's very expensive for a homelab:
| $15/mo per host (and for cost I prefer using multiple cheap
| servers than a single large one).
|
| NewRelic and Grafana Cloud have pretty good free plan limits, but
| I'm paying for that in effort because I don't use either at work
| so it's not what I'm used to.
| SteveNuts wrote:
| The Datadog IoT agents are cheaper, but still probably more
| than you'd want to spend on a lab.
|
| You also only get system metrics, no integrations - but most
| metrics and checks can be done remotely with a single dedicated
| agent
| Evidlo wrote:
| My solution is to just be OK with http status checking (run a
| webserver on important machines), and use a service like
| updown.io which is so cheap it's almost free.
|
| e.g. For 1 machine, hourly checking is ~$0.25/year
| Havoc wrote:
| I personally found uptime kuma to be easiest because it has a
| python api package to bulk load stuff into it.
|
| Much easier to edit a list in vscode than click around a bunch in
| an app
| bonobocop wrote:
| Quite like Cloudprober for this tbh:
| https://cloudprober.org/docs/how-to/alerting/
|
| Easy to configure, easy to extend with Go, and slots in to
| alerting.
| jauntywundrkind wrote:
| There's an article-bias towards rejectionism, towards single shot
| adventures. "I didn't grok so and so and here's the shell scripts
| I wrote instead".
|
| Especially for home cloud, home ops, home labs: that's great!
| That's awesome that you did for yourself, that you wrote up your
| experience.
|
| But in general I feel like there's a huge missing middle of
| operations & sys-admin-ery that creates a distorted weird
| narrative. There's few people out there starting their journey
| with Prometheus blogging helpfully through it. There's few people
| mid way through their k8s work talking about their challenges and
| victories. The tales of just muddling through, of the
| perseverance, of looking for information, trying to find signal
| through the noise are few.
|
| What we get a lot of is "this was too much for me so I wrote my
| own thing instead". Or, "we have been doing such and such for
| years and found such and such to shave 20% compute" or "we needed
| this capability so added Z to our k8s cluster like so". The
| journey is so often missing, we don't have stories of trying &
| learning. We have stories like this of making.
|
| There's such a background of 'too complex' that I really worry
| leads us spiritually astray. I'm happy for articles like this,
| it's awesome to see ingenuity on display, but there's so many
| good amazing robust tools out there that seem to have lots of
| people happily or at least adequately using them, but it feels
| like the stories of turning back from the attempt, stories of
| eschewing the battle tested widely adopted software drive so much
| narrative, have so much more ink spilled over them.
|
| Very thankful for Flix language putting Rich Hickey's principle
| of _Simple isn 't Easy_ first, for helping re-orient me by the
| axis of Hickey's old grand guidance. I feel like there's such a
| loud clambor generally for _easy,_ for scripts you throw
| together, for the intimacy of tiny systems. And I admire a lot of
| these principles! But I also think there 's a horrible
| backwardsness that doesn't help, that drives us away from more
| comprehensive capable integrative systems that can do amazing
| things, that are scalable both performance wise (as Prometheus
| certainly is) and organizationally (that other other people and
| other experts will also lastingly use and build from). The
| preselection for _easy_ is attainable individually quickly, but
| real _simple_ requires vastly more, requires so much more thought
| and planning and structure.
| https://www.infoq.com/presentations/Simple-Made-Easy/
|
| It's so weird to find myself such a Cathedral-but-open-source fan
| today. Growing up the Bazaar model made such sense, had such
| virtue to it. And I still believe in the Bazaar, in the wide
| world teaming with different softwares. But I worry what lessons
| are most visible, worry what we pass along, worry about the
| proliferation of software discontent against the really good open
| source software that we do collaborate together on em masse. It
| feels like there's a massive self sabotage going on, that so many
| people are radicalized and sold a story of discontent against
| bigger more robust more popular open source software. I'd love to
| hear that view so much, but I want smaller individuals and voices
| also making a chorus of happy noise about how far they get how
| magical how powerful it is that we have so many amazing fantastic
| bigger open source projects that so scalably enable so much.
| https://en.m.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar
| cgriswald wrote:
| This is sort of a ramble, so I apologize in advance.
|
| I love the idea of writing up my ultimately-successful
| experiences of using open source software. I'm currently
| working on a big (for me anyway) project for my homelab
| involving a bunch of stuff I've either never or rarely done
| before. But... if I were to write about my experiences, a lot
| of it would be "I'm an idiot and I spent two hours with a valid
| but bad config because I misunderstood what the documentation
| was telling me about the syntax and yeah, I learned a bit more
| about reading the log file for X, but that was fundamentally
| pointless because it didn't really answer the question." I'd
| also have to keep track of what I did that didn't work, which
| adds a lot more work than just keeping track of what did work.
|
| There's also a social aspect there where I don't want to
| necessarily reveal the precise nature of my idiocy to strangers
| over the internet. This might be the whole thing here for a lot
| of people. "Look at this awesome script I made because I'm a
| rugged and capable individualist" is probably an easier self-
| sell than "Despite my best efforts, I managed to scrounge
| together a system that works using pieces made by people
| smarter than me."
|
| I think I might try. My main concern is whether it will ruin
| the fun. When I set up Prometheus, I had a lot of fun, even
| through the mistakes. But, would also trying to write about it
| make it less fun? Would other people even be interested in a
| haphazard floundering equivalent to reading about someone's
| experience with a homework assignment? Would I learn more?
| Would the frustrating moments be worse or would the process of
| thinking through things (because I am going to write about it)
| lead to my mistakes becoming apparent earlier? Will my ego
| survive people judging my process, conclusions, _and_ writing?
| I don 't know. Maybe it'll be fun to find out.
| reboot81 wrote:
| Just love https://healthchecks.io I set it up on all my boxes
| with these two scripts: win
| https://github.com/reboot81/healthchecks_service_ps/ macos
| https://github.com/reboot81/hc_check_maker_macos linux
| https://healthchecks.io/docs/bash/
| danesparza wrote:
| Uptime Kuma (https://github.com/louislam/uptime-kuma). With email
| notifications. So much simpler, and free.
| seriocomic wrote:
| Love this - tried it. The problem as I see it is that these
| still require hosting - ideally (again, as I see it) self-
| hosting a script that monitors internal/homelab things also
| requires its own monitoring.
|
| Short of paying for a service (which somewhat goes against the
| grain of trying to host all your own stuff), the closest I can
| come up with is relying on a service outside your network that
| has access to your network (via a tunnel/vpn).
|
| Given a lot of my own networking set-up (DNS/Domains/Tunnels
| etc) are already managed via Cloudflare, I'm thinking that
| using some compute at that layer to provide a monitoring
| service. Probably something to throw next at my new LLM
| developer...
| justusthane wrote:
| I've been facing a similar search for an ultra-simple but ultra-
| extensible monitoring solution for my homelab. I've had the idea
| to write a Python program where the main script is just
| responsible for scheduling and executing the checks, logging, and
| alerting based on set thresholds.
|
| All monitoring would be handled via plugins, which would be
| extremely easy to write.
|
| It would ship with a few core plugins (ping, http, cert check,
| maybe snmp), but you could easily write a plugin to monitor
| anything else -- for example, you could use the existing Python
| Minecraft library and write a plugin to monitor your Minecraft
| server. Or maybe even the ability to write plugins in any
| language, not just Python.
|
| I'm not a developer and I'm opposed to vibe coding, so it'll be
| slow going :)
| Spooky23 wrote:
| I'm using node exporter to Prometheus and grafana. I also use
| uptime kuma, and send alerts via pushover.
|
| It's shockingly easy to setup. I have the monitoring stack living
| on a GCP host that I have setup for various things and have it
| connected via tailscale.
|
| It actually paid for itself by alerting me to low voltage events
| via NUT. I probably would have lost some gear to poor electrical
| conditions.
___________________________________________________________________
(page generated 2025-07-13 23:00 UTC)