[HN Gopher] Is it time to version observability?
       ___________________________________________________________________
        
       Is it time to version observability?
        
       Author : RyeCombinator
       Score  : 56 points
       Date   : 2024-08-08 04:58 UTC (1 days ago)
        
 (HTM) web link (charity.wtf)
 (TXT) w3m dump (charity.wtf)
        
       | datadrivenangel wrote:
       | So the core idea is to move to arbitrarily wide logs?
       | 
       | Seems good in theory, except in practice it just defers the pain
       | to later, like schema on read document databases.
        
       | flockonus wrote:
       | > Y'all, Datadog and Prometheus are the last, best metrics-backed
       | tools that will ever be built. You can't catch up to them or beat
       | them at that; no one can. Do something different. Build for the
       | next generation of software problems, not the last generation.
       | 
       | Heard a very similar thing from Plenty Of Fish creator in 2012, I
       | unfortunately believed him; "the dating space was solved". Turns
       | out it never was, and like every space, solutions will keep on
       | changing.
        
         | abeppu wrote:
         | ... is that a good example? I think people who use the dating
         | apps today mostly hate them, and find that they have misaligned
         | incentives and/or encourage poor behavior. There's been
         | generation of other services that shift in popularity (and
         | network effects mean that lots of people shift) but I'm not
         | convinced that this has ever involved delivering a better
         | solution.
        
           | michaelt wrote:
           | People mostly hate observability tooling too.
           | 
           | The point isn't that people like or dislike - it's that the
           | fact a system someone in the industry tells you isn't worth
           | even trying to compete with might be replaced a handful of
           | years later.
        
             | abeppu wrote:
             | The post author didn't claim that no one else would make
             | money or attract customers in metrics-backed tools after
             | Datadog and Prometheus -- but that they were the last and
             | best. The "at that" in "You can't catch up to them or beat
             | them at that" seems pretty clearly about "best", i.e.
             | quality of the solution.
             | 
             | I claim that the in the intervening decade, dating apps
             | have _changed_ but not gotten better which suggests to me
             | that the Plenty of Fish person may have been right, and
             | this example is not convincingly making the point that
             | flockonus wants to make.
        
           | bloodyplonker22 wrote:
           | Indeed. I hate to say this, but most people hate the dating
           | apps because they're ugly. The top 10% are getting all the
           | dates on these apps and the rest are left with endless
           | swiping and only likes from scammers, bots, and pig
           | butcherers. Trust me, I know because I'm ugly.
        
         | suyash wrote:
         | I'll put InfluxDB right up there as well.
        
         | phillipcarter wrote:
         | IMO dating apps didn't evolve because they got better at
         | matchmaking based on stated preferences in a profile (something
         | Plenty of Fish nailed quite well!), they shifted the paradigm
         | towards swiping on pictures and an evolving matchmaking
         | algorithm based on user interactions within the app.
         | 
         | This is _sort of_ what the article is getting at. For the
         | purposes of gathering, aggregating, sending, and analyzing a
         | bunch of metrics, you 'll be hard-pressed to beat Datadog at
         | this game. They're _extremely_ good at this and, by virtue of
         | having many teams with tons of smart people on them, have
         | figured out many of the best ways to squeeze as much analysis
         | value as you can with this kind of data. The post is arguing
         | that better observability demands a paradigm shift away from
         | metrics as the source of truth for things, and with that, many
         | more possibilities open up.
        
       | archenemybuntu wrote:
       | Id gonna break a nerve and say most orgs overengineer
       | observability. There's the whole topology of otel tools,
       | Prometheus tools and bunch of Long term storage / querying
       | solutions. Very complicated tracing setups. All these are fine if
       | you have a team for maintaining observability only. But your avg
       | product development org can sacrifice most of it and do with
       | proper logging with a request context, plus some important
       | service level metrics + grafana + alarms.
       | 
       | Problem with all these above tools is that, they all seem like
       | essential features to have but once you have the whole topology
       | of 50 half baked CNCF containers set up in "production" shit
       | starts to break in very mysterious ways and also these
       | observability products tend to cost a lot.
        
         | datadrivenangel wrote:
         | The ratio of 'metadata' to data is often hundreds or thousands
         | to one, which translates to cost, especially if you're a using
         | a licensed service. I've been at companies where the analytics
         | and observability costs are 20x the actual cost of the
         | application for cloud hosting. Datadog seems to have switched
         | to revenue extraction in a way that would make oracle proud.
        
           | lincolnq wrote:
           | Is that 20x cost... actually bad though? (I mean, I know
           | _Datadog_ is bad. I used to use it and I hated its cost
           | structure.)
           | 
           | But maybe it's worth it. or at least, the good ones would be
           | worth it. I can imagine great metadata (and platforms to
           | query and explore it) saves more engineering time than it
           | costs in server time. So to me this ratio isn't that
           | material, even though it looks a little weird.
        
             | never_inline wrote:
             | I would be curious to know, what's the ratio of AWS bill to
             | programmer salary in J random grocery delivery startup.
        
             | ElevenLathe wrote:
             | The trouble is that the o11y costs in developer time too.
             | I've seen both traps:
             | 
             | Trap 1: "We MUST have PERFECT information about EVERY
             | request and how it was serviced, in REALTIME!"
             | 
             | This is bad because it ends up being hella expensive, both
             | in engineering time and in actual server (or vendor) bills.
             | Yes, this is what we'd want if cost were no object, but it
             | sometimes actually is an object, even for very important or
             | profitable systems.
             | 
             | Trap 2: "We can give customer support our pager number so
             | they can call us if somebody complains."
             | 
             | This is bad because you're letting your users suffer errors
             | that you could have easily caught and fixed for relatively
             | cheap.
             | 
             | There is diminishing returns with this stuff, and a lot of
             | the calculus depends on the nature of your application,
             | your relationship with consumers of it, your business
             | model, and a million other factors.
        
         | fishtoaster wrote:
         | For what it's worth, I found it almost trivial to set up open
         | telemetry and point it honeycomb. It took me an afternoon about
         | a month ago for a medium-sized python web-app. I've found that
         | I can replace a lot of tooling and manual work needed in the
         | past. At previous startups it's usually like
         | 
         | 1. Set up basic logging (now I just use otel events)
         | 
         | 2. Make it structured logging (Get that for free with otel
         | events)
         | 
         | 3. Add request contexts that's sent along with each log (Also
         | free with otel)
         | 
         | 4. Manually set up tracing ids in my codebase and configure it
         | in my tooling (all free with otel spans)
         | 
         | Really, I was _expecting_ to wind up having to get really into
         | the new observability philosophy to get value out of it, but I
         | found myself really loving this setup with minimal work and
         | minimal koolade-drinking. I 'll probably do something like this
         | over "logs, request context, metrics, and alarms" at future
         | startups.
        
           | JoshTriplett wrote:
           | I've currently done this, and I'm seriously considering
           | _undoing_ it in favor of some other logging solution. My
           | biggest reason: OpenTelemetry fundamentally doesn 't handle
           | events that aren't part of a span, and doesn't handle spans
           | that don't close. So, if you crash, you don't get telemetry
           | to help you debug the crash.
           | 
           | I wish "span start" and "span end" were just independent
           | events, and OTel tools handled and presented unfinished spans
           | or events that don't appear within a span.
        
             | bamboozled wrote:
             | Isn't the problem here that your code is crashing and
             | you're relying on the wrong tool to help you solve that ?
        
               | JoshTriplett wrote:
               | Logging solves this problem. If OTel and observability is
               | attempting to position itself as a better alternative to
               | logging, it needs to solve the problems that logging
               | already solves. I'm not going to use completely separate
               | tools for logging and observability.
               | 
               | Also, "crash" here doesn't necessarily mean "segfault" or
               | equivalent. It can also mean "hang and not finish (and
               | thus not end the span)", or "have a network issue that
               | breaks the ability to submit observability data" (but
               | _after_ an event occurred, which could have been
               | submitted if OTel didn 't wait for spans to end first).
               | There are any number of reasons why a span might start
               | but not finish, most of which are bugs, and OTel and
               | tools built upon it provide zero help when debugging
               | those.
        
             | growse wrote:
             | Can you give an example of an event that's not part of a
             | span / trace?
        
               | Spivak wrote:
               | Unhandled exceptions is a pretty normal one. You get
               | kicked out to your app's topmost level and you lost your
               | span. My wishlist to solve this (and I actually wrote an
               | implementation in Python which leans heavily on
               | reflection) is to be able to attach arbitrary data to
               | stack frames and exceptions when they occur merge all the
               | data top-down and send it up to your handler.
               | 
               | Signal handlers are another one and are a whole other
               | beast simply because they're completely devoid of
               | context.
        
             | jononor wrote:
             | Can't you close the span on an exception?
        
               | JoshTriplett wrote:
               | See https://news.ycombinator.com/item?id=41205665 for
               | more details.
               | 
               | And even in the case of an actual crash, that doesn't
               | necessarily mean the application is in a state to
               | successfully submit additional OTel data.
        
       | amelius wrote:
       | I can't even run valgrind on many libraries and Python modules
       | because they weren't designed with valgrind in mind. Let's work
       | on observability before we version it.
        
       | jrockway wrote:
       | I like the wide log model. At work, we write software that
       | customers run for themselves. When it breaks, we can't exactly
       | ssh in and mutate stuff until it works again, so we need some
       | sort of information that they can upload to us. Logs are the
       | easiest way to do that, and because logs are a key part of our
       | product (batch job runner for k8s), we already have
       | infrastructure to store and retrieve logs. (What's built into k8s
       | is sadly inadequate. The logs die when the pod dies.)
       | 
       | Anyway, from this we can get metrics and traces. For traces, we
       | log the start and end of requests, and generate a unique ID at
       | the start. Server logging contexts have the request's ID.
       | Everything that happens for that request gets logged along with
       | the request ID, so you can watch the request transit the system
       | with "rg 453ca13b-aa96-4204-91df-316923f5f9ae" or whatever on an
       | unpacked debug dump, which is rather efficient at moderate scale.
       | For metrics, we just log stats when we know them; if we have some
       | io.Writer that we're writing to, it can log "just wrote 1234
       | bytes", and then you can post-process that into useful statistics
       | at whatever level of granularity you want ("how fast is the
       | system as a whole sending data on the network?", "how fast is
       | node X sending data on the network?", "how fast is request
       | 453ca13b-aa96-4204-91df-316923f5f9ae sending data to the
       | network?"). This doesn't scale quite as well, as a busy system
       | with small writes is going to write a lot of logs. Our metrics
       | package has per-context.Context aggregation, which cleans this up
       | without requiring any locking across requests like Prometheus
       | does.
       | https://github.com/pachyderm/pachyderm/blob/master/src/inter...
       | 
       | Finally, when I get tired of having 43 terminal windows open with
       | a bunch of "less" sessions over the logs, I hacked something
       | together to do a light JSON parse on each line and send the logs
       | to Postgres:
       | https://github.com/pachyderm/pachyderm/blob/master/src/inter....
       | It is slow to load a big dump, but the queries are surprisingly
       | fast. My favorite thing to do is the "select * from logs where
       | json->'x-request-id' = '453ca13b-aa96-4204-91df-316923f5f9ae'
       | order by time asc" or whatever. Then I don't have 5 different log
       | files open to watch a single request, it's just all there in my
       | psql window.
       | 
       | As many people will say, this analysis method doesn't scale in
       | the same way as something like Jaeger (which scales by deleting
       | 99% of your data) or Prometheus (which scales by throwing away
       | per-request information), but it does let you drill down as deep
       | as necessary, which is important when you have one customer that
       | had one bad request and you absolutely positively have to fix it.
       | 
       | My TL;DR is that if you're a 3 person team writing some software
       | from scratch this afternoon, "print" is a pretty good
       | observability stack. You can add complexity later. Just capture
       | what you need to debug today, and this will last you a very long
       | time. (I wrote the monitoring system for Google Fiber CPE
       | devices... they just sent us their logs every minute and we did
       | some very simple analysis to feed an alerting system; for
       | everything else, a quick MapReduce or dremel invocation over the
       | raw log lines was more than adequate for anything we needed to
       | figure out.)
        
       | Veserv wrote:
       | They do not appear to understand the fundamental difference
       | between logs, traces, and metrics. Sure, if you can log every
       | event you want to record, then everything is just events (I will
       | ignore the fact that they are still stuck on formatted text
       | strings as a event format). The difference is what do you do when
       | you can not record everything you want to either at build time or
       | runtime.
       | 
       | Logs are independent. When you can not store every event, you can
       | drop them randomly. You lose a perfect view of every logged
       | event, but you still retain a statistical view. As we have
       | already assumed you _can not_ log everything, this is the best
       | you can do anyways.
       | 
       | Traces are for correlated events where you want every correlated
       | event (a trace) or none of them (or possibly the first N in a
       | trace). Losing events within a trace makes the entire trace (or
       | at least the latter portions) useless. When you can not store
       | every event, you want to drop randomly at the whole trace level.
       | 
       | Metrics are for situations where you know you can not log
       | everything. You aggregate your data at log time, so instead of
       | getting a statistically random sample you instead get aggregates
       | that incorporate all of your data at the cost of precision.
       | 
       | Note that for the purposes of this post, I have ignored the
       | reason why you can not store every event. That is an orthogonal
       | discussion and techniques that relieve that bottleneck allow more
       | opportunities to stay on the happy path of "just events with
       | post-processed analysis" that the author is advocating for.
        
         | lukev wrote:
         | Yes, I'm quite sure the CTO of a leading observability platform
         | is simply confused about terminology.
        
           | pclmulqdq wrote:
           | It is not impossible that this is the case (at least in GP's
           | view). Companies in the space argue between logs and events
           | as structured or unstructured data and how much to exploit
           | that structure. Unstructured is the simple way, and appears
           | to be the approach that TFA prefers, while deep exploitation
           | of structured event collection actually appears to be better
           | for many technical reasons but is more complex.
           | 
           | From what I can tell, Honeycomb is staffed up with operators
           | (SRE types). The GP is thinking about logs, traces, and
           | metrics like a mathematician, and I am not sure that anyone
           | at Honeycomb actually thinks that way.
        
           | MrDarcy wrote:
           | To be fair, the first heading and section of TFA says no one
           | knows what the terminology means anymore.
           | 
           | Which, to GP's point, is kind of BS.
        
         | growse wrote:
         | > Logs are independent. When you can not store every event, you
         | can drop them randomly. You lose a perfect view of every logged
         | event, but you still retain a statistical view. As we have
         | already assumed you can not log everything, this is the best
         | you can do anyways.
         | 
         | Usually, every log message occurs as part of a process (no
         | matter how short lived), and so I'm not sure it's ever the case
         | that a given log is truly independent from others. "Sampling
         | log events" is a smell that indicates that you _will_ have
         | incomplete views on any given txn /process.
         | 
         | What I find works better is to always correlate logs to traces,
         | and when just drop the majority of non-error traces (keep 100%
         | of actionable error traces).
         | 
         | > When you can not store every event, you want to drop randomly
         | at the whole trace level.
         | 
         | Yes, this.
        
         | cmgriffing wrote:
         | I am no expert but my take was that in many cases, logs,
         | metrics, and traces are stored in individually queryable
         | tables/stores.
         | 
         | Their proposal is to have a single "Event" table/store with
         | fields for spans, traces, and metrics that can be sparsely
         | populated similar to a dynamodb row.
         | 
         | Again, I might have missed the point, though.
        
         | otterley wrote:
         | > what do you do when you can not record everything you want to
         | 
         | The thesis of the article is that you _should_ use events as
         | the source of truth, and derive logs, metrics, and traces from
         | them. I see nothing logically or spiritually wrong with that
         | fundamental approach and it 's been Honeycomb's approach since
         | Day 1.
         | 
         | I do feel like Charity is ignoring the elephant in the room of
         | transit and storage cost: "Logs are infinitely more powerful,
         | useful _and cost-effective_ than metrics. " Weeeelllllll....
         | 
         | Honeycomb has been around a while. If transit and storage were
         | free, using Honeycomb and similar solutions would be a no-
         | brainer. Storage is cheap, but probably not as cheap as it
         | ought to be in the cloud.[1] And certain kinds of transit are
         | still pretty pricey in the cloud. Even if you get transit for
         | free by keeping it local, using your primary network interface
         | for shipping events reduces the amount of bandwidth remaining
         | for the primary purpose of doing real work (i.e., handling
         | requests).
         | 
         | Plus, I think people are aware--even if they don't specifically
         | say so--that data that is processed, stored, and never used
         | again is waste. Since we can't have perfect prior knowledge
         | whether some data will be valuable later, the logical thing
         | course of action is to retain everything. But since doing so
         | has a cost, people will naturally tend towards trying to
         | capture as little as they can get away with yet still be able
         | to do their job.
         | 
         | [1] I work for AWS, but any opinions stated herein are mine and
         | not necessarily those of my colleagues or the company I work
         | for.
        
       | zellyn wrote:
       | A few questions:
       | 
       | a) You're dismissing OTel, but if you _do_ want to do flame
       | graphs, you need traces and spans, and standards (W3C Trace-
       | Context, etc.) to propagate them.
       | 
       | b) What's the difference between an "Event" and a "Wide Log with
       | Trace/Span attached"? Is it that you don't have to think of it
       | only in the context of traces?
       | 
       | c) Periodically emitting wide events for metrics, once you had
       | more than a few, would almost inevitably result in creating a
       | common API for doing it, which would end up looking almost just
       | like OTel metrics, no?
       | 
       | d) If you're clever, metrics histogram sketches can be combined
       | usefully, unlike adding averages
       | 
       | e) Aren't you just talking about storing a hell of a lot of data?
       | Sure, it's easy not to worry, and just throw anything into the
       | Wide Log, as long as you don't have to care about the storage.
       | But that's exactly that happens with every logging system I've
       | used. Is sampling the answer? Like, you still have to send all
       | the data, even from _very_ high QPS systems, so you can tail-
       | sample later after the 24 microservice graph calls all complete?
       | 
       | Don't get me wrong, my years-long inability to adequately and
       | clearly settle the simple theoretical question of "What's the
       | difference between a normal old-school log, and a log attached to
       | a trace/span, and which should I prefer?" has me biased towards
       | your argument :-)
        
       | xyzzy_plugh wrote:
       | I was excited by the title and thought that this was going to be
       | about versioning the observability contracts of services,
       | dashboards, alerts, etc., which are typically exceptionally
       | brittle. Boy am I disappointed.
       | 
       | I get what Charity is shouting. And Honeycomb is incredible. But
       | I think this framing overly simplifies things.
       | 
       | Let's step back and imagine everything emitted JSON only. No
       | other form of telemetry is allowed. This is functionally
       | equivalent to wide events albeit inherently flawed and
       | problematic as I'll demonstrate.
       | 
       | Every time something happens somewhere you emit an Event object.
       | You slurp these to a central place, and now you can count them,
       | connect them as a graph, index and search, compress, transpose,
       | etc. etc.
       | 
       | I agree, this works! Let's assume we build it and all the
       | necessary query and aggregation tools, storage, dashboards,
       | whatever. Hurray! But sooner or later you will have this problem:
       | a developer comes to you and says "my service is falling over"
       | and you'll look and see that for every 1 MiB of traffic it
       | receives, it also sends roughly 1 MiB of traffic, but it produces
       | 10 MiB of JSON Event objects. Possibly more. Look, this is a very
       | complex service, or so they tell you.
       | 
       | You smile and tell them "not a problem! We'll simply pre-
       | aggregate some of these events in the service and emit a periodic
       | summary." Done and done.
       | 
       | Then you find out there's a certain request that causes problems,
       | so you add more Events, but this also causes an unacceptable
       | amount of Event traffic. Not to worry, we can add a special flag
       | to only emit extra logs for certain requests, or we'll randomly
       | add extra logging ~5% of the time. That should do it.
       | 
       | Great! It all works. That's the end of this story, but the result
       | is that you've re-invented metrics and traces. Sure, logs -- or
       | "wide events" that are for the sake of this example the same
       | thing -- work well enough for almost everything, except of course
       | for all the places they don't. And now where they don't, you have
       | to reinvent all this _stuff_.
       | 
       | Metrics and traces solve these problems upfront in a way that's
       | designed to accommodate scaling problems before you suffer an
       | outage, without necessarily making your life significantly harder
       | along the way. At least that's the intention, regardless of
       | whether or not that's true in practice -- certainly not addressed
       | by TFA.
       | 
       | What's more is that in practice metrics and traces _today_ are in
       | fact _wide events_. They 're _metrics_ events, or _tracing_
       | events. It doesn 't really matter if a metric ends up scraped by
       | a Prometheus metrics page or emitted as a JSON log line. That's
       | besides the point. The point is they are fit for purpose.
       | 
       | Observability 2.0 doesn't fix this, it just shifts the problem
       | around. Remind me, how did we do things _before_ Observability
       | 1.0? Because as far as I can tell it 's strikingly similar in
       | appearance to Observability 2.0.
       | 
       | So forgive me if my interpretation of all of this is lipstick on
       | the pig that is Observability 0.1
       | 
       | And finally, I _get_ you _can_ make it work. Google certainly
       | gets that. But then they built Monarch anyways. Why? It 's worth
       | understanding if you ask me. Perhaps we should start by educating
       | the general audience on this matter, but then I'm guessing that
       | would perhaps not aid in the sale of a solution that eschews
       | those very learnings.
        
       | firesteelrain wrote:
       | It took me a bit to really understand the versioning angle and I
       | think I understand.
       | 
       | The blog discusses the idea of evolving observability practices,
       | suggesting a move from traditional methods (metrics, logs,
       | traces) to a new approach where structured log events serve as a
       | central, unified source of truth. The argument is that this shift
       | represents a significant enough change to be considered a new
       | version of observability, similar to how software is versioned
       | when it undergoes major updates. This evolution would enable more
       | precise and insightful software development and operations.
       | 
       | Unlike separate metrics, logs, and traces, structured log events
       | combine these data types into a single, comprehensive source,
       | simplifying analysis and troubleshooting.
       | 
       | Structured events capture more detailed context, making it easier
       | to understand the "why" behind system behavior, not just the
       | "what."
        
       | moomin wrote:
       | We came up with a buzzword to market our product. The industry
       | made this buzzword meaningless. Now we're coming up with a new
       | one. We're sure the same thing won't happen again.
        
       ___________________________________________________________________
       (page generated 2024-08-09 23:00 UTC)