hngopher.com

       [HN Gopher] Observability's past, present, and future
       ___________________________________________________________________
        
       Observability's past, present, and future
        
       Author : shcallaway
       Score  : 41 points
       Date   : 2026-01-05 16:34 UTC (6 hours ago)
        
 (HTM) web link (blog.sherwoodcallaway.com)
 (TXT) w3m dump (blog.sherwoodcallaway.com)
        
       | buchanae wrote:
       | I share a lot of this sentiment, although I struggle more with
       | the setup and maintenance than the diagnosis.
       | 
       | It's baffling to me that it can still take _so_much_work_ to set
       | up a good baseline of observability (not to mention the time we
       | spend on tweaking alerting). I recently spent an inordinate
       | amount of time trying to make sense of our telemetry setup and
       | fill in the gaps. It took weeks. We had data in many systems,
       | many different instrumentation frameworks (all stepping on each
       | other), noisy alerts, etc.
       | 
       | Part of my problem is that the ecosystem is big. There's too much
       | to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF,
       | auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I
       | don't know, maybe I'm biased by the JVM-heavy systems I've been
       | working in.
       | 
       | I worked for New Relic for years, and even in an observability
       | company, it was still a lot of work to maintain, and even then
       | traces were not heavily used.
       | 
       | I can definitely imagine having Claude debug an issue faster than
       | I can type and click around dashboards and query UIs. That sounds
       | fun.
        
         | pphysch wrote:
         | > Part of my problem is that the ecosystem is big. There's too
         | much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer,
         | eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on
         | and on. I don't know, maybe I'm biased by the JVM-heavy systems
         | I've been working in.
         | 
         | We've had success keeping things simple with VictoriaMetrics
         | stack, and avoiding what we perceive as unnecessary complexity
         | in some of the fancier tools/standards.
        
       | tech_ken wrote:
       | > Observability made us very good at producing signals, but only
       | slightly better at what comes after: interpreting them,
       | generating insights, and translating those insights into
       | reliability.
       | 
       | I'm a data professional who's kind of SRE adjacent for a big
       | corpo's infra arm and wow does this post ring true for me. I'm
       | tempted to just say "well duh, producing telemetry was always the
       | low hanging fruit, it's the 'generating insights' part that's
       | truly hard", but I think that's too pithy. My more reflective
       | take is that generating reliability from data lives in a weird
       | hybrid space of domain knowledge and data management, and most
       | orgs headcount strategy don't account for this. SWEs pretend that
       | data scientists are just SQL jockeys minutes from being replaced
       | by an LLM agent; data scientists pretend like stats is the only
       | "hard" thing and all domain knowledge can be learned with
       | sufficient motivation and documentation. In reality I think both
       | are equally hard, it's rare that you find someone who can do
       | both, and that doing both is really what's required for true
       | "observability".
       | 
       | At a high level I'd say there are three big areas where orgs (or
       | at least my org) tend to fall short:
       | 
       | * extremely sound data engineering and org-wide normalization (to
       | support correlating diverse signals with highly disparate sources
       | during root-cause)
       | 
       | * telemetry that's truly capable of capturing the problem (ie.
       | it's not helpful to monitor disk usage if CPU is the bottleneck)
       | 
       | * true 'sleuths' who understand how to leverage the first two
       | things to produce insights, and have the org-wide clout to get
       | those insights turned into action
       | 
       | I think most orgs tend to pick two of these, and cheap out on the
       | third, and the result is what you describe in your post. Maybe
       | they have some rockstar engineers who understand how to overcome
       | the data ecosystem shortcomings to produce a root-cause analysis,
       | or maybe they pay through the nose for some telemetry/dashboard
       | platform that they then hand over to contract workers who brute-
       | force reliability through tons of work hours. Even when they do
       | create dedicated reliability teams, it seems like they are more
       | often than not hamstrung by not having any leverage with the
       | people who actually build the product. And when everything is a
       | distributed system it might actually be 5 or 6 teams who you have
       | no leverage with, so even if you win over 1 or 2 critical POCs
       | you're left with an incomplete patchwork of telemetry systems
       | which meet the owning team's (teams') needs and nothing else.
       | 
       | All this to say that I think reliability is still ultimately an
       | incentive problem. You can have the best observability tooling in
       | the world, but if don't have folks at every level of the org who
       | understand (a) what 'reliable' concretely looks like for your
       | product and (b) have the power to effect necessary changes then
       | you're going to get a lot of churn with little benefit.
        
         | ghaff wrote:
         | It's a long running topic in a lot of areas. I remember back
         | when data warehousing was the hot thing, collecting and
         | cleaning all this data was supposed to be the key to insights
         | that would unlock juicy profits. Basically didn't happen.
        
       | vrnvu wrote:
       | First. Love that more tools like Honeycomb (amazing) are popping
       | up in the space. I agree with the post.
       | 
       | But. IMO, statistics and probability can't be replaced with
       | tooling. As software engineering can't be replaced with no-code
       | services to build applications...
       | 
       | If you need to profile some bug or troubleshoot complex systems
       | (distributed, dbs). You must do your math homework consistently
       | as part of the job.
       | 
       | If you don't comprehend the distribution of your data, the
       | seasonality, noise vs signal; how can you measure anything
       | valuable? How can you ask the right questions?
        
       | Veserv wrote:
       | Of course that sucks. Just enable full time-travel recording in
       | production and then you can use a standard multi-program trace
       | visualizer and time travel debugger to identify the exact
       | execution down to the instruction and precisely identify root
       | causes in the code.
       | 
       | Everything is then instrumented automatically and exhaustively
       | analyzable using standard tools. At most you might need to add in
       | some manual instrumentation to indicate semantic layers, but even
       | that can frequently be done after the fact with automated search
       | and annotation on the full (instruction-level) recording.
        
       | esafak wrote:
       | We need more automation. Less data, more insight. We're at the
       | firehose stage, and nobody's got time for that. ML-based anomaly
       | detection is not widespread and automated RCA barely exists.
       | We'll have solved the problem when AI detects the problem and
       | submits the bug fix before the engineers wake up.
        
       | antsou wrote:
       | Observability and APM are way, way older than depicted in this
       | simplistic post!
        
       | camel_gopher wrote:
       | 2006 - Bryan Cantrill publishes this work on software
       | observability https://queue.acm.org/detail.cfm?id=1117401
       | 
       | 2015 - Ben Sigelman (one of the Dapper folks) cofounds Lightstep
        
       ___________________________________________________________________
       (page generated 2026-01-05 23:01 UTC)