[HN Gopher] Observability's past, present, and future
___________________________________________________________________
Observability's past, present, and future
Author : shcallaway
Score : 41 points
Date : 2026-01-05 16:34 UTC (6 hours ago)
(HTM) web link (blog.sherwoodcallaway.com)
(TXT) w3m dump (blog.sherwoodcallaway.com)
| buchanae wrote:
| I share a lot of this sentiment, although I struggle more with
| the setup and maintenance than the diagnosis.
|
| It's baffling to me that it can still take _so_much_work_ to set
| up a good baseline of observability (not to mention the time we
| spend on tweaking alerting). I recently spent an inordinate
| amount of time trying to make sense of our telemetry setup and
| fill in the gaps. It took weeks. We had data in many systems,
| many different instrumentation frameworks (all stepping on each
| other), noisy alerts, etc.
|
| Part of my problem is that the ecosystem is big. There's too much
| to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF,
| auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I
| don't know, maybe I'm biased by the JVM-heavy systems I've been
| working in.
|
| I worked for New Relic for years, and even in an observability
| company, it was still a lot of work to maintain, and even then
| traces were not heavily used.
|
| I can definitely imagine having Claude debug an issue faster than
| I can type and click around dashboards and query UIs. That sounds
| fun.
| pphysch wrote:
| > Part of my problem is that the ecosystem is big. There's too
| much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer,
| eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on
| and on. I don't know, maybe I'm biased by the JVM-heavy systems
| I've been working in.
|
| We've had success keeping things simple with VictoriaMetrics
| stack, and avoiding what we perceive as unnecessary complexity
| in some of the fancier tools/standards.
| tech_ken wrote:
| > Observability made us very good at producing signals, but only
| slightly better at what comes after: interpreting them,
| generating insights, and translating those insights into
| reliability.
|
| I'm a data professional who's kind of SRE adjacent for a big
| corpo's infra arm and wow does this post ring true for me. I'm
| tempted to just say "well duh, producing telemetry was always the
| low hanging fruit, it's the 'generating insights' part that's
| truly hard", but I think that's too pithy. My more reflective
| take is that generating reliability from data lives in a weird
| hybrid space of domain knowledge and data management, and most
| orgs headcount strategy don't account for this. SWEs pretend that
| data scientists are just SQL jockeys minutes from being replaced
| by an LLM agent; data scientists pretend like stats is the only
| "hard" thing and all domain knowledge can be learned with
| sufficient motivation and documentation. In reality I think both
| are equally hard, it's rare that you find someone who can do
| both, and that doing both is really what's required for true
| "observability".
|
| At a high level I'd say there are three big areas where orgs (or
| at least my org) tend to fall short:
|
| * extremely sound data engineering and org-wide normalization (to
| support correlating diverse signals with highly disparate sources
| during root-cause)
|
| * telemetry that's truly capable of capturing the problem (ie.
| it's not helpful to monitor disk usage if CPU is the bottleneck)
|
| * true 'sleuths' who understand how to leverage the first two
| things to produce insights, and have the org-wide clout to get
| those insights turned into action
|
| I think most orgs tend to pick two of these, and cheap out on the
| third, and the result is what you describe in your post. Maybe
| they have some rockstar engineers who understand how to overcome
| the data ecosystem shortcomings to produce a root-cause analysis,
| or maybe they pay through the nose for some telemetry/dashboard
| platform that they then hand over to contract workers who brute-
| force reliability through tons of work hours. Even when they do
| create dedicated reliability teams, it seems like they are more
| often than not hamstrung by not having any leverage with the
| people who actually build the product. And when everything is a
| distributed system it might actually be 5 or 6 teams who you have
| no leverage with, so even if you win over 1 or 2 critical POCs
| you're left with an incomplete patchwork of telemetry systems
| which meet the owning team's (teams') needs and nothing else.
|
| All this to say that I think reliability is still ultimately an
| incentive problem. You can have the best observability tooling in
| the world, but if don't have folks at every level of the org who
| understand (a) what 'reliable' concretely looks like for your
| product and (b) have the power to effect necessary changes then
| you're going to get a lot of churn with little benefit.
| ghaff wrote:
| It's a long running topic in a lot of areas. I remember back
| when data warehousing was the hot thing, collecting and
| cleaning all this data was supposed to be the key to insights
| that would unlock juicy profits. Basically didn't happen.
| vrnvu wrote:
| First. Love that more tools like Honeycomb (amazing) are popping
| up in the space. I agree with the post.
|
| But. IMO, statistics and probability can't be replaced with
| tooling. As software engineering can't be replaced with no-code
| services to build applications...
|
| If you need to profile some bug or troubleshoot complex systems
| (distributed, dbs). You must do your math homework consistently
| as part of the job.
|
| If you don't comprehend the distribution of your data, the
| seasonality, noise vs signal; how can you measure anything
| valuable? How can you ask the right questions?
| Veserv wrote:
| Of course that sucks. Just enable full time-travel recording in
| production and then you can use a standard multi-program trace
| visualizer and time travel debugger to identify the exact
| execution down to the instruction and precisely identify root
| causes in the code.
|
| Everything is then instrumented automatically and exhaustively
| analyzable using standard tools. At most you might need to add in
| some manual instrumentation to indicate semantic layers, but even
| that can frequently be done after the fact with automated search
| and annotation on the full (instruction-level) recording.
| esafak wrote:
| We need more automation. Less data, more insight. We're at the
| firehose stage, and nobody's got time for that. ML-based anomaly
| detection is not widespread and automated RCA barely exists.
| We'll have solved the problem when AI detects the problem and
| submits the bug fix before the engineers wake up.
| antsou wrote:
| Observability and APM are way, way older than depicted in this
| simplistic post!
| camel_gopher wrote:
| 2006 - Bryan Cantrill publishes this work on software
| observability https://queue.acm.org/detail.cfm?id=1117401
|
| 2015 - Ben Sigelman (one of the Dapper folks) cofounds Lightstep
___________________________________________________________________
(page generated 2026-01-05 23:01 UTC)