[HN Gopher] Is it time to version observability?
___________________________________________________________________
Is it time to version observability?
Author : RyeCombinator
Score : 56 points
Date : 2024-08-08 04:58 UTC (1 days ago)
(HTM) web link (charity.wtf)
(TXT) w3m dump (charity.wtf)
| datadrivenangel wrote:
| So the core idea is to move to arbitrarily wide logs?
|
| Seems good in theory, except in practice it just defers the pain
| to later, like schema on read document databases.
| flockonus wrote:
| > Y'all, Datadog and Prometheus are the last, best metrics-backed
| tools that will ever be built. You can't catch up to them or beat
| them at that; no one can. Do something different. Build for the
| next generation of software problems, not the last generation.
|
| Heard a very similar thing from Plenty Of Fish creator in 2012, I
| unfortunately believed him; "the dating space was solved". Turns
| out it never was, and like every space, solutions will keep on
| changing.
| abeppu wrote:
| ... is that a good example? I think people who use the dating
| apps today mostly hate them, and find that they have misaligned
| incentives and/or encourage poor behavior. There's been
| generation of other services that shift in popularity (and
| network effects mean that lots of people shift) but I'm not
| convinced that this has ever involved delivering a better
| solution.
| michaelt wrote:
| People mostly hate observability tooling too.
|
| The point isn't that people like or dislike - it's that the
| fact a system someone in the industry tells you isn't worth
| even trying to compete with might be replaced a handful of
| years later.
| abeppu wrote:
| The post author didn't claim that no one else would make
| money or attract customers in metrics-backed tools after
| Datadog and Prometheus -- but that they were the last and
| best. The "at that" in "You can't catch up to them or beat
| them at that" seems pretty clearly about "best", i.e.
| quality of the solution.
|
| I claim that the in the intervening decade, dating apps
| have _changed_ but not gotten better which suggests to me
| that the Plenty of Fish person may have been right, and
| this example is not convincingly making the point that
| flockonus wants to make.
| bloodyplonker22 wrote:
| Indeed. I hate to say this, but most people hate the dating
| apps because they're ugly. The top 10% are getting all the
| dates on these apps and the rest are left with endless
| swiping and only likes from scammers, bots, and pig
| butcherers. Trust me, I know because I'm ugly.
| suyash wrote:
| I'll put InfluxDB right up there as well.
| phillipcarter wrote:
| IMO dating apps didn't evolve because they got better at
| matchmaking based on stated preferences in a profile (something
| Plenty of Fish nailed quite well!), they shifted the paradigm
| towards swiping on pictures and an evolving matchmaking
| algorithm based on user interactions within the app.
|
| This is _sort of_ what the article is getting at. For the
| purposes of gathering, aggregating, sending, and analyzing a
| bunch of metrics, you 'll be hard-pressed to beat Datadog at
| this game. They're _extremely_ good at this and, by virtue of
| having many teams with tons of smart people on them, have
| figured out many of the best ways to squeeze as much analysis
| value as you can with this kind of data. The post is arguing
| that better observability demands a paradigm shift away from
| metrics as the source of truth for things, and with that, many
| more possibilities open up.
| archenemybuntu wrote:
| Id gonna break a nerve and say most orgs overengineer
| observability. There's the whole topology of otel tools,
| Prometheus tools and bunch of Long term storage / querying
| solutions. Very complicated tracing setups. All these are fine if
| you have a team for maintaining observability only. But your avg
| product development org can sacrifice most of it and do with
| proper logging with a request context, plus some important
| service level metrics + grafana + alarms.
|
| Problem with all these above tools is that, they all seem like
| essential features to have but once you have the whole topology
| of 50 half baked CNCF containers set up in "production" shit
| starts to break in very mysterious ways and also these
| observability products tend to cost a lot.
| datadrivenangel wrote:
| The ratio of 'metadata' to data is often hundreds or thousands
| to one, which translates to cost, especially if you're a using
| a licensed service. I've been at companies where the analytics
| and observability costs are 20x the actual cost of the
| application for cloud hosting. Datadog seems to have switched
| to revenue extraction in a way that would make oracle proud.
| lincolnq wrote:
| Is that 20x cost... actually bad though? (I mean, I know
| _Datadog_ is bad. I used to use it and I hated its cost
| structure.)
|
| But maybe it's worth it. or at least, the good ones would be
| worth it. I can imagine great metadata (and platforms to
| query and explore it) saves more engineering time than it
| costs in server time. So to me this ratio isn't that
| material, even though it looks a little weird.
| never_inline wrote:
| I would be curious to know, what's the ratio of AWS bill to
| programmer salary in J random grocery delivery startup.
| ElevenLathe wrote:
| The trouble is that the o11y costs in developer time too.
| I've seen both traps:
|
| Trap 1: "We MUST have PERFECT information about EVERY
| request and how it was serviced, in REALTIME!"
|
| This is bad because it ends up being hella expensive, both
| in engineering time and in actual server (or vendor) bills.
| Yes, this is what we'd want if cost were no object, but it
| sometimes actually is an object, even for very important or
| profitable systems.
|
| Trap 2: "We can give customer support our pager number so
| they can call us if somebody complains."
|
| This is bad because you're letting your users suffer errors
| that you could have easily caught and fixed for relatively
| cheap.
|
| There is diminishing returns with this stuff, and a lot of
| the calculus depends on the nature of your application,
| your relationship with consumers of it, your business
| model, and a million other factors.
| fishtoaster wrote:
| For what it's worth, I found it almost trivial to set up open
| telemetry and point it honeycomb. It took me an afternoon about
| a month ago for a medium-sized python web-app. I've found that
| I can replace a lot of tooling and manual work needed in the
| past. At previous startups it's usually like
|
| 1. Set up basic logging (now I just use otel events)
|
| 2. Make it structured logging (Get that for free with otel
| events)
|
| 3. Add request contexts that's sent along with each log (Also
| free with otel)
|
| 4. Manually set up tracing ids in my codebase and configure it
| in my tooling (all free with otel spans)
|
| Really, I was _expecting_ to wind up having to get really into
| the new observability philosophy to get value out of it, but I
| found myself really loving this setup with minimal work and
| minimal koolade-drinking. I 'll probably do something like this
| over "logs, request context, metrics, and alarms" at future
| startups.
| JoshTriplett wrote:
| I've currently done this, and I'm seriously considering
| _undoing_ it in favor of some other logging solution. My
| biggest reason: OpenTelemetry fundamentally doesn 't handle
| events that aren't part of a span, and doesn't handle spans
| that don't close. So, if you crash, you don't get telemetry
| to help you debug the crash.
|
| I wish "span start" and "span end" were just independent
| events, and OTel tools handled and presented unfinished spans
| or events that don't appear within a span.
| bamboozled wrote:
| Isn't the problem here that your code is crashing and
| you're relying on the wrong tool to help you solve that ?
| JoshTriplett wrote:
| Logging solves this problem. If OTel and observability is
| attempting to position itself as a better alternative to
| logging, it needs to solve the problems that logging
| already solves. I'm not going to use completely separate
| tools for logging and observability.
|
| Also, "crash" here doesn't necessarily mean "segfault" or
| equivalent. It can also mean "hang and not finish (and
| thus not end the span)", or "have a network issue that
| breaks the ability to submit observability data" (but
| _after_ an event occurred, which could have been
| submitted if OTel didn 't wait for spans to end first).
| There are any number of reasons why a span might start
| but not finish, most of which are bugs, and OTel and
| tools built upon it provide zero help when debugging
| those.
| growse wrote:
| Can you give an example of an event that's not part of a
| span / trace?
| Spivak wrote:
| Unhandled exceptions is a pretty normal one. You get
| kicked out to your app's topmost level and you lost your
| span. My wishlist to solve this (and I actually wrote an
| implementation in Python which leans heavily on
| reflection) is to be able to attach arbitrary data to
| stack frames and exceptions when they occur merge all the
| data top-down and send it up to your handler.
|
| Signal handlers are another one and are a whole other
| beast simply because they're completely devoid of
| context.
| jononor wrote:
| Can't you close the span on an exception?
| JoshTriplett wrote:
| See https://news.ycombinator.com/item?id=41205665 for
| more details.
|
| And even in the case of an actual crash, that doesn't
| necessarily mean the application is in a state to
| successfully submit additional OTel data.
| amelius wrote:
| I can't even run valgrind on many libraries and Python modules
| because they weren't designed with valgrind in mind. Let's work
| on observability before we version it.
| jrockway wrote:
| I like the wide log model. At work, we write software that
| customers run for themselves. When it breaks, we can't exactly
| ssh in and mutate stuff until it works again, so we need some
| sort of information that they can upload to us. Logs are the
| easiest way to do that, and because logs are a key part of our
| product (batch job runner for k8s), we already have
| infrastructure to store and retrieve logs. (What's built into k8s
| is sadly inadequate. The logs die when the pod dies.)
|
| Anyway, from this we can get metrics and traces. For traces, we
| log the start and end of requests, and generate a unique ID at
| the start. Server logging contexts have the request's ID.
| Everything that happens for that request gets logged along with
| the request ID, so you can watch the request transit the system
| with "rg 453ca13b-aa96-4204-91df-316923f5f9ae" or whatever on an
| unpacked debug dump, which is rather efficient at moderate scale.
| For metrics, we just log stats when we know them; if we have some
| io.Writer that we're writing to, it can log "just wrote 1234
| bytes", and then you can post-process that into useful statistics
| at whatever level of granularity you want ("how fast is the
| system as a whole sending data on the network?", "how fast is
| node X sending data on the network?", "how fast is request
| 453ca13b-aa96-4204-91df-316923f5f9ae sending data to the
| network?"). This doesn't scale quite as well, as a busy system
| with small writes is going to write a lot of logs. Our metrics
| package has per-context.Context aggregation, which cleans this up
| without requiring any locking across requests like Prometheus
| does.
| https://github.com/pachyderm/pachyderm/blob/master/src/inter...
|
| Finally, when I get tired of having 43 terminal windows open with
| a bunch of "less" sessions over the logs, I hacked something
| together to do a light JSON parse on each line and send the logs
| to Postgres:
| https://github.com/pachyderm/pachyderm/blob/master/src/inter....
| It is slow to load a big dump, but the queries are surprisingly
| fast. My favorite thing to do is the "select * from logs where
| json->'x-request-id' = '453ca13b-aa96-4204-91df-316923f5f9ae'
| order by time asc" or whatever. Then I don't have 5 different log
| files open to watch a single request, it's just all there in my
| psql window.
|
| As many people will say, this analysis method doesn't scale in
| the same way as something like Jaeger (which scales by deleting
| 99% of your data) or Prometheus (which scales by throwing away
| per-request information), but it does let you drill down as deep
| as necessary, which is important when you have one customer that
| had one bad request and you absolutely positively have to fix it.
|
| My TL;DR is that if you're a 3 person team writing some software
| from scratch this afternoon, "print" is a pretty good
| observability stack. You can add complexity later. Just capture
| what you need to debug today, and this will last you a very long
| time. (I wrote the monitoring system for Google Fiber CPE
| devices... they just sent us their logs every minute and we did
| some very simple analysis to feed an alerting system; for
| everything else, a quick MapReduce or dremel invocation over the
| raw log lines was more than adequate for anything we needed to
| figure out.)
| Veserv wrote:
| They do not appear to understand the fundamental difference
| between logs, traces, and metrics. Sure, if you can log every
| event you want to record, then everything is just events (I will
| ignore the fact that they are still stuck on formatted text
| strings as a event format). The difference is what do you do when
| you can not record everything you want to either at build time or
| runtime.
|
| Logs are independent. When you can not store every event, you can
| drop them randomly. You lose a perfect view of every logged
| event, but you still retain a statistical view. As we have
| already assumed you _can not_ log everything, this is the best
| you can do anyways.
|
| Traces are for correlated events where you want every correlated
| event (a trace) or none of them (or possibly the first N in a
| trace). Losing events within a trace makes the entire trace (or
| at least the latter portions) useless. When you can not store
| every event, you want to drop randomly at the whole trace level.
|
| Metrics are for situations where you know you can not log
| everything. You aggregate your data at log time, so instead of
| getting a statistically random sample you instead get aggregates
| that incorporate all of your data at the cost of precision.
|
| Note that for the purposes of this post, I have ignored the
| reason why you can not store every event. That is an orthogonal
| discussion and techniques that relieve that bottleneck allow more
| opportunities to stay on the happy path of "just events with
| post-processed analysis" that the author is advocating for.
| lukev wrote:
| Yes, I'm quite sure the CTO of a leading observability platform
| is simply confused about terminology.
| pclmulqdq wrote:
| It is not impossible that this is the case (at least in GP's
| view). Companies in the space argue between logs and events
| as structured or unstructured data and how much to exploit
| that structure. Unstructured is the simple way, and appears
| to be the approach that TFA prefers, while deep exploitation
| of structured event collection actually appears to be better
| for many technical reasons but is more complex.
|
| From what I can tell, Honeycomb is staffed up with operators
| (SRE types). The GP is thinking about logs, traces, and
| metrics like a mathematician, and I am not sure that anyone
| at Honeycomb actually thinks that way.
| MrDarcy wrote:
| To be fair, the first heading and section of TFA says no one
| knows what the terminology means anymore.
|
| Which, to GP's point, is kind of BS.
| growse wrote:
| > Logs are independent. When you can not store every event, you
| can drop them randomly. You lose a perfect view of every logged
| event, but you still retain a statistical view. As we have
| already assumed you can not log everything, this is the best
| you can do anyways.
|
| Usually, every log message occurs as part of a process (no
| matter how short lived), and so I'm not sure it's ever the case
| that a given log is truly independent from others. "Sampling
| log events" is a smell that indicates that you _will_ have
| incomplete views on any given txn /process.
|
| What I find works better is to always correlate logs to traces,
| and when just drop the majority of non-error traces (keep 100%
| of actionable error traces).
|
| > When you can not store every event, you want to drop randomly
| at the whole trace level.
|
| Yes, this.
| cmgriffing wrote:
| I am no expert but my take was that in many cases, logs,
| metrics, and traces are stored in individually queryable
| tables/stores.
|
| Their proposal is to have a single "Event" table/store with
| fields for spans, traces, and metrics that can be sparsely
| populated similar to a dynamodb row.
|
| Again, I might have missed the point, though.
| otterley wrote:
| > what do you do when you can not record everything you want to
|
| The thesis of the article is that you _should_ use events as
| the source of truth, and derive logs, metrics, and traces from
| them. I see nothing logically or spiritually wrong with that
| fundamental approach and it 's been Honeycomb's approach since
| Day 1.
|
| I do feel like Charity is ignoring the elephant in the room of
| transit and storage cost: "Logs are infinitely more powerful,
| useful _and cost-effective_ than metrics. " Weeeelllllll....
|
| Honeycomb has been around a while. If transit and storage were
| free, using Honeycomb and similar solutions would be a no-
| brainer. Storage is cheap, but probably not as cheap as it
| ought to be in the cloud.[1] And certain kinds of transit are
| still pretty pricey in the cloud. Even if you get transit for
| free by keeping it local, using your primary network interface
| for shipping events reduces the amount of bandwidth remaining
| for the primary purpose of doing real work (i.e., handling
| requests).
|
| Plus, I think people are aware--even if they don't specifically
| say so--that data that is processed, stored, and never used
| again is waste. Since we can't have perfect prior knowledge
| whether some data will be valuable later, the logical thing
| course of action is to retain everything. But since doing so
| has a cost, people will naturally tend towards trying to
| capture as little as they can get away with yet still be able
| to do their job.
|
| [1] I work for AWS, but any opinions stated herein are mine and
| not necessarily those of my colleagues or the company I work
| for.
| zellyn wrote:
| A few questions:
|
| a) You're dismissing OTel, but if you _do_ want to do flame
| graphs, you need traces and spans, and standards (W3C Trace-
| Context, etc.) to propagate them.
|
| b) What's the difference between an "Event" and a "Wide Log with
| Trace/Span attached"? Is it that you don't have to think of it
| only in the context of traces?
|
| c) Periodically emitting wide events for metrics, once you had
| more than a few, would almost inevitably result in creating a
| common API for doing it, which would end up looking almost just
| like OTel metrics, no?
|
| d) If you're clever, metrics histogram sketches can be combined
| usefully, unlike adding averages
|
| e) Aren't you just talking about storing a hell of a lot of data?
| Sure, it's easy not to worry, and just throw anything into the
| Wide Log, as long as you don't have to care about the storage.
| But that's exactly that happens with every logging system I've
| used. Is sampling the answer? Like, you still have to send all
| the data, even from _very_ high QPS systems, so you can tail-
| sample later after the 24 microservice graph calls all complete?
|
| Don't get me wrong, my years-long inability to adequately and
| clearly settle the simple theoretical question of "What's the
| difference between a normal old-school log, and a log attached to
| a trace/span, and which should I prefer?" has me biased towards
| your argument :-)
| xyzzy_plugh wrote:
| I was excited by the title and thought that this was going to be
| about versioning the observability contracts of services,
| dashboards, alerts, etc., which are typically exceptionally
| brittle. Boy am I disappointed.
|
| I get what Charity is shouting. And Honeycomb is incredible. But
| I think this framing overly simplifies things.
|
| Let's step back and imagine everything emitted JSON only. No
| other form of telemetry is allowed. This is functionally
| equivalent to wide events albeit inherently flawed and
| problematic as I'll demonstrate.
|
| Every time something happens somewhere you emit an Event object.
| You slurp these to a central place, and now you can count them,
| connect them as a graph, index and search, compress, transpose,
| etc. etc.
|
| I agree, this works! Let's assume we build it and all the
| necessary query and aggregation tools, storage, dashboards,
| whatever. Hurray! But sooner or later you will have this problem:
| a developer comes to you and says "my service is falling over"
| and you'll look and see that for every 1 MiB of traffic it
| receives, it also sends roughly 1 MiB of traffic, but it produces
| 10 MiB of JSON Event objects. Possibly more. Look, this is a very
| complex service, or so they tell you.
|
| You smile and tell them "not a problem! We'll simply pre-
| aggregate some of these events in the service and emit a periodic
| summary." Done and done.
|
| Then you find out there's a certain request that causes problems,
| so you add more Events, but this also causes an unacceptable
| amount of Event traffic. Not to worry, we can add a special flag
| to only emit extra logs for certain requests, or we'll randomly
| add extra logging ~5% of the time. That should do it.
|
| Great! It all works. That's the end of this story, but the result
| is that you've re-invented metrics and traces. Sure, logs -- or
| "wide events" that are for the sake of this example the same
| thing -- work well enough for almost everything, except of course
| for all the places they don't. And now where they don't, you have
| to reinvent all this _stuff_.
|
| Metrics and traces solve these problems upfront in a way that's
| designed to accommodate scaling problems before you suffer an
| outage, without necessarily making your life significantly harder
| along the way. At least that's the intention, regardless of
| whether or not that's true in practice -- certainly not addressed
| by TFA.
|
| What's more is that in practice metrics and traces _today_ are in
| fact _wide events_. They 're _metrics_ events, or _tracing_
| events. It doesn 't really matter if a metric ends up scraped by
| a Prometheus metrics page or emitted as a JSON log line. That's
| besides the point. The point is they are fit for purpose.
|
| Observability 2.0 doesn't fix this, it just shifts the problem
| around. Remind me, how did we do things _before_ Observability
| 1.0? Because as far as I can tell it 's strikingly similar in
| appearance to Observability 2.0.
|
| So forgive me if my interpretation of all of this is lipstick on
| the pig that is Observability 0.1
|
| And finally, I _get_ you _can_ make it work. Google certainly
| gets that. But then they built Monarch anyways. Why? It 's worth
| understanding if you ask me. Perhaps we should start by educating
| the general audience on this matter, but then I'm guessing that
| would perhaps not aid in the sale of a solution that eschews
| those very learnings.
| firesteelrain wrote:
| It took me a bit to really understand the versioning angle and I
| think I understand.
|
| The blog discusses the idea of evolving observability practices,
| suggesting a move from traditional methods (metrics, logs,
| traces) to a new approach where structured log events serve as a
| central, unified source of truth. The argument is that this shift
| represents a significant enough change to be considered a new
| version of observability, similar to how software is versioned
| when it undergoes major updates. This evolution would enable more
| precise and insightful software development and operations.
|
| Unlike separate metrics, logs, and traces, structured log events
| combine these data types into a single, comprehensive source,
| simplifying analysis and troubleshooting.
|
| Structured events capture more detailed context, making it easier
| to understand the "why" behind system behavior, not just the
| "what."
| moomin wrote:
| We came up with a buzzword to market our product. The industry
| made this buzzword meaningless. Now we're coming up with a new
| one. We're sure the same thing won't happen again.
___________________________________________________________________
(page generated 2024-08-09 23:00 UTC)