[HN Gopher] All you need is Wide Events, not "Metrics, Logs and ...
___________________________________________________________________
All you need is Wide Events, not "Metrics, Logs and Traces"
Author : talboren
Score : 49 points
Date : 2024-02-27 20:53 UTC (2 hours ago)
(HTM) web link (isburmistrov.substack.com)
(TXT) w3m dump (isburmistrov.substack.com)
| swader999 wrote:
| This seems like event sourcing with a nice tool to inspect,
| filter and visualize the event stream. The sampling rate idea is
| a decent tactic I hadn't heard of.
| veeralpatel979 wrote:
| Great article, here is a Python notebook I created earlier to
| show you how you can capture such wide events:
|
| https://colab.research.google.com/drive/1Y65qXXogoDgOnXFBDyF...
| timthelion wrote:
| At the company I work for we send json to kafka and subsiquently
| to Elastic search with great effect. That's basically 'wide
| events'. The magical thing about hooking up a bunch of pipelines
| with kafka is that all of a sudden your observability/metrics
| system becomes an amazing API for extending systems with
| aditional automations. Want to do something when a router
| connects to a network? Just subscribe to this kafka topic here.
| It doesn't matter that the topic was origionally intended just to
| log some events. We even created an open source library for
| writing and running these,pipelines in jupyter. Here's a super
| simple example https://github.com/bitswan-
| space/BitSwan/blob/master/example...
|
| People tend to think kafka is hard, but as you can see from the
| example, it can be extremely easy.
| jeffbee wrote:
| I'm glad that works for you but to me it sounds really
| expensive. At small scale you can do this any way you want but
| if you build an observability system with linear cost and a
| high coefficient it will become an issue if you run into some
| success.
| timthelion wrote:
| The only expensive part is the hardwarevfor the elastic
| servers. Kafka is cheap to run. We have an on prem elastic db
| pulling in tens of thousands of events per second. On prem
| servers aren't _that_ expensive. It 's really just 6 servers
| with 20tb each and another 40tb for backups. And it's not
| like you have to store everything forever... Compare that
| data flow to everyonevwatching youtube all the time. It's
| really nothing...
| ricardobeat wrote:
| I can name a single company in my area that runs their own
| servers, and they've been in the middle of a migration to
| the cloud for the past five years.
| hibikir wrote:
| This works well for a while. But eventually you get big, and
| have little to no idea of what is in your downstream. Then
| every single format change in any event you write must be
| treated like open heart surgery, because tracing your data
| dependencies is unreliable.
|
| Sometimes it seems that it's fixable by 'just having a list of
| people listening', and then you look and all that some of them
| do is mildly transform your data and pass it along. It doesn't
| take long before people realize that. 'just logging some
| events' is making future promises to other teams you don't know
| about, and people start being terrified of emitting anything.
|
| This is a story I've seen in at least 4 places in my career.
| Making data available to other people is not any less scary in
| kafka than it was back in the days where applications shared a
| giant database, and you'd see yearlong projects to do some mild
| changes to a data model, which was originally designed in 5
| minutes.
|
| As for kafka being easy, It's not quite as hard as some people
| say, but it's both a pub sub system and a distributed database.
| When your clusters get large, it definitely isn't easy.
| meandmycode wrote:
| I think this is the crux of it, if something works for awhile
| then actually that's fine, as an industry we over index and
| scare new developers towards complexity. The counter is true
| too, what works at scale doesn't at non scale - not because
| of tech, but because holistically your asking for a lot, a
| lot of knowledge, a lot of complex tech to be deployed by a
| small team.
| dgellow wrote:
| Zookeeper in production can be really a pain to maintain...
| jeffbee wrote:
| The isomorphism of traces and logs is clear. You can flatten a
| trace to a log and you can perfectly reconstruct the trace graph
| from such a log. I don't see the unifying theme that brings
| metrics into this framework, though. Metrics feels fundamentally
| different, as a way to inspect the internal state of your
| program, not necessarily driven by exogenous events.
|
| But I definitely agree with the theme of the article that leaving
| a big company can feel like you got your memory erased in a time
| machine mishap. Inside a FANG you might become normalized to
| logging hundreds of thousands of informational statements, per
| second, per core. You might have got used to every endpoint
| exposing thirty million metric time series. As soon as you walk
| out the door some guy will chew you out about "cardinality" if
| you have 100 metrics.
| HeyImAlex wrote:
| I think all metrics can be reconstructed as "wide events" since
| they're just a bunch of arbitrary data? Counts, gauges, and
| histograms at least seem pretty straight forward to me.
|
| It seems like the main motivation for metrics is that sending +
| storing + querying wide events for everything is cost
| prohibitive and/or performance intensive. If you can afford it
| and it works well, wide events is definitely more flexible. A
| metric is kinda just a pre-aggregation on the event stream.
| growse wrote:
| If you think of a metric as an event representing the act of
| measuring (along with the result of that measurement), then it
| becomes the same as any other event.
| Osmose wrote:
| This isn't an unknown idea outside of Meta, it's just really
| expensive, especially if you're using a vendor and not building
| your own tooling. Prohibitively so, even with sampling.
| ricardobeat wrote:
| Exactly. When they say
|
| > Unlike with prometheus, however, with Wide Events approach we
| don't need to worry about cardinality
|
| This is hinting at the hidden reason why not everyone does it.
| You have to 'worry' about cardinality because Prometheus is
| pre-aggregating data so you can visualize it fast, and
| optimizing storage. If you want the same speed on a massive PB-
| scale data lake, with an infinite amount of unstructured data,
| and in the cloud instead of your own datacenters, it's gonna
| cost you _a lot_ , and for most companies it is not a sensible
| expense.
|
| It does work at smaller scale though, we once had an in-house
| system like this that worked well. Eventually user events were
| moved to MixPanel, and everything else to Datadog,
| metrics/logs/traces + a migration to OpenTel. It took months
| and added 2-digit monthly bills, and in the end debugging or
| resolving incidents wasn't much improved over having instant
| access to events and business metrics. Whoever figures out a
| system that can do "wide events" in a cost-effective way from
| startup to unicorn scale will absolutely make a killing.
| mlhpdx wrote:
| I don't know that's true. My last two very-not-meta-sized
| companies have both had systems that were very cost effective
| and essentially what the article describes. It's not the
| simplest thing to put in place, but far from unapproachable.
|
| I think on if the big hills is moving to a culture that values
| observability (or whatever you choose to call it, I prefer
| forensic debugging). It's another thing to understand and worry
| about and it helps tremendously if there are good, highly
| visible examples of it.
|
| Edit: Typo.
| gtirloni wrote:
| Could you share some specifics of how it could be approached?
| rekwah wrote:
| > just put it there, it might be useful later
|
| > Also note that we have never mentioned anything about
| cardinality. Because it doesn't matter - any field can be of any
| cardinality. Scuba works with raw events and doesn't pre-
| aggregate anything, and so cardinality is not an issue.
|
| This is how we end up with very large, very expensive data
| swamps.
| _visgean wrote:
| that depends on the sampling rate no? I would much rather have
| a rich log record sampled at 1% than more records that dont
| contain enough info to debug..
| growse wrote:
| The people feeling the pain of (and paying for) the expensive
| data swamp are often not the same people who are yolo'ing the
| sample rate to 100% in their apps, because why wouldn't you
| want to store every event?
|
| Put another way, you're in charge of a large telemetry event
| sink. How do you incentivise the correct sampling behaviour
| by your users?
| ojkelly wrote:
| Observability as a shared concept has followed Agile and DevOps.
|
| Something with a real meaning that is enables a step-change is
| development practices. Adoption is organic initially because the
| pain it solves is very real.
|
| But as awareness of the idea grows it threatens established
| institutions and vendors, who must co-opt the concept and
| redefine it such that they are included.
|
| If they can't be explicitly included (logs, metrics, traces)[0],
| then they at least make sure the definition is becomes so vague
| and confused that they are not explicitly excluded[1].
|
| Wide events and a good means to query them covers everything, but
| not if you as a vendor cannot store and query wide events.
|
| [0] as the article notes, one of these is not like the other. [1]
| Is Scrum Agile? What do you mean a standup can't go for an hour?
| See also DevOps as a role.
| guhcampos wrote:
| Incredible what you can do with infinite money!
|
| For everyone else, more specific data structures, sampling and
| careful consideration of what to record are essential.
___________________________________________________________________
(page generated 2024-02-27 23:00 UTC)