[HN Gopher] All you need is Wide Events, not "Metrics, Logs and ...
       ___________________________________________________________________
        
       All you need is Wide Events, not "Metrics, Logs and Traces"
        
       Author : talboren
       Score  : 49 points
       Date   : 2024-02-27 20:53 UTC (2 hours ago)
        
 (HTM) web link (isburmistrov.substack.com)
 (TXT) w3m dump (isburmistrov.substack.com)
        
       | swader999 wrote:
       | This seems like event sourcing with a nice tool to inspect,
       | filter and visualize the event stream. The sampling rate idea is
       | a decent tactic I hadn't heard of.
        
       | veeralpatel979 wrote:
       | Great article, here is a Python notebook I created earlier to
       | show you how you can capture such wide events:
       | 
       | https://colab.research.google.com/drive/1Y65qXXogoDgOnXFBDyF...
        
       | timthelion wrote:
       | At the company I work for we send json to kafka and subsiquently
       | to Elastic search with great effect. That's basically 'wide
       | events'. The magical thing about hooking up a bunch of pipelines
       | with kafka is that all of a sudden your observability/metrics
       | system becomes an amazing API for extending systems with
       | aditional automations. Want to do something when a router
       | connects to a network? Just subscribe to this kafka topic here.
       | It doesn't matter that the topic was origionally intended just to
       | log some events. We even created an open source library for
       | writing and running these,pipelines in jupyter. Here's a super
       | simple example https://github.com/bitswan-
       | space/BitSwan/blob/master/example...
       | 
       | People tend to think kafka is hard, but as you can see from the
       | example, it can be extremely easy.
        
         | jeffbee wrote:
         | I'm glad that works for you but to me it sounds really
         | expensive. At small scale you can do this any way you want but
         | if you build an observability system with linear cost and a
         | high coefficient it will become an issue if you run into some
         | success.
        
           | timthelion wrote:
           | The only expensive part is the hardwarevfor the elastic
           | servers. Kafka is cheap to run. We have an on prem elastic db
           | pulling in tens of thousands of events per second. On prem
           | servers aren't _that_ expensive. It 's really just 6 servers
           | with 20tb each and another 40tb for backups. And it's not
           | like you have to store everything forever... Compare that
           | data flow to everyonevwatching youtube all the time. It's
           | really nothing...
        
             | ricardobeat wrote:
             | I can name a single company in my area that runs their own
             | servers, and they've been in the middle of a migration to
             | the cloud for the past five years.
        
         | hibikir wrote:
         | This works well for a while. But eventually you get big, and
         | have little to no idea of what is in your downstream. Then
         | every single format change in any event you write must be
         | treated like open heart surgery, because tracing your data
         | dependencies is unreliable.
         | 
         | Sometimes it seems that it's fixable by 'just having a list of
         | people listening', and then you look and all that some of them
         | do is mildly transform your data and pass it along. It doesn't
         | take long before people realize that. 'just logging some
         | events' is making future promises to other teams you don't know
         | about, and people start being terrified of emitting anything.
         | 
         | This is a story I've seen in at least 4 places in my career.
         | Making data available to other people is not any less scary in
         | kafka than it was back in the days where applications shared a
         | giant database, and you'd see yearlong projects to do some mild
         | changes to a data model, which was originally designed in 5
         | minutes.
         | 
         | As for kafka being easy, It's not quite as hard as some people
         | say, but it's both a pub sub system and a distributed database.
         | When your clusters get large, it definitely isn't easy.
        
           | meandmycode wrote:
           | I think this is the crux of it, if something works for awhile
           | then actually that's fine, as an industry we over index and
           | scare new developers towards complexity. The counter is true
           | too, what works at scale doesn't at non scale - not because
           | of tech, but because holistically your asking for a lot, a
           | lot of knowledge, a lot of complex tech to be deployed by a
           | small team.
        
         | dgellow wrote:
         | Zookeeper in production can be really a pain to maintain...
        
       | jeffbee wrote:
       | The isomorphism of traces and logs is clear. You can flatten a
       | trace to a log and you can perfectly reconstruct the trace graph
       | from such a log. I don't see the unifying theme that brings
       | metrics into this framework, though. Metrics feels fundamentally
       | different, as a way to inspect the internal state of your
       | program, not necessarily driven by exogenous events.
       | 
       | But I definitely agree with the theme of the article that leaving
       | a big company can feel like you got your memory erased in a time
       | machine mishap. Inside a FANG you might become normalized to
       | logging hundreds of thousands of informational statements, per
       | second, per core. You might have got used to every endpoint
       | exposing thirty million metric time series. As soon as you walk
       | out the door some guy will chew you out about "cardinality" if
       | you have 100 metrics.
        
         | HeyImAlex wrote:
         | I think all metrics can be reconstructed as "wide events" since
         | they're just a bunch of arbitrary data? Counts, gauges, and
         | histograms at least seem pretty straight forward to me.
         | 
         | It seems like the main motivation for metrics is that sending +
         | storing + querying wide events for everything is cost
         | prohibitive and/or performance intensive. If you can afford it
         | and it works well, wide events is definitely more flexible. A
         | metric is kinda just a pre-aggregation on the event stream.
        
         | growse wrote:
         | If you think of a metric as an event representing the act of
         | measuring (along with the result of that measurement), then it
         | becomes the same as any other event.
        
       | Osmose wrote:
       | This isn't an unknown idea outside of Meta, it's just really
       | expensive, especially if you're using a vendor and not building
       | your own tooling. Prohibitively so, even with sampling.
        
         | ricardobeat wrote:
         | Exactly. When they say
         | 
         | > Unlike with prometheus, however, with Wide Events approach we
         | don't need to worry about cardinality
         | 
         | This is hinting at the hidden reason why not everyone does it.
         | You have to 'worry' about cardinality because Prometheus is
         | pre-aggregating data so you can visualize it fast, and
         | optimizing storage. If you want the same speed on a massive PB-
         | scale data lake, with an infinite amount of unstructured data,
         | and in the cloud instead of your own datacenters, it's gonna
         | cost you _a lot_ , and for most companies it is not a sensible
         | expense.
         | 
         | It does work at smaller scale though, we once had an in-house
         | system like this that worked well. Eventually user events were
         | moved to MixPanel, and everything else to Datadog,
         | metrics/logs/traces + a migration to OpenTel. It took months
         | and added 2-digit monthly bills, and in the end debugging or
         | resolving incidents wasn't much improved over having instant
         | access to events and business metrics. Whoever figures out a
         | system that can do "wide events" in a cost-effective way from
         | startup to unicorn scale will absolutely make a killing.
        
         | mlhpdx wrote:
         | I don't know that's true. My last two very-not-meta-sized
         | companies have both had systems that were very cost effective
         | and essentially what the article describes. It's not the
         | simplest thing to put in place, but far from unapproachable.
         | 
         | I think on if the big hills is moving to a culture that values
         | observability (or whatever you choose to call it, I prefer
         | forensic debugging). It's another thing to understand and worry
         | about and it helps tremendously if there are good, highly
         | visible examples of it.
         | 
         | Edit: Typo.
        
           | gtirloni wrote:
           | Could you share some specifics of how it could be approached?
        
       | rekwah wrote:
       | > just put it there, it might be useful later
       | 
       | > Also note that we have never mentioned anything about
       | cardinality. Because it doesn't matter - any field can be of any
       | cardinality. Scuba works with raw events and doesn't pre-
       | aggregate anything, and so cardinality is not an issue.
       | 
       | This is how we end up with very large, very expensive data
       | swamps.
        
         | _visgean wrote:
         | that depends on the sampling rate no? I would much rather have
         | a rich log record sampled at 1% than more records that dont
         | contain enough info to debug..
        
           | growse wrote:
           | The people feeling the pain of (and paying for) the expensive
           | data swamp are often not the same people who are yolo'ing the
           | sample rate to 100% in their apps, because why wouldn't you
           | want to store every event?
           | 
           | Put another way, you're in charge of a large telemetry event
           | sink. How do you incentivise the correct sampling behaviour
           | by your users?
        
       | ojkelly wrote:
       | Observability as a shared concept has followed Agile and DevOps.
       | 
       | Something with a real meaning that is enables a step-change is
       | development practices. Adoption is organic initially because the
       | pain it solves is very real.
       | 
       | But as awareness of the idea grows it threatens established
       | institutions and vendors, who must co-opt the concept and
       | redefine it such that they are included.
       | 
       | If they can't be explicitly included (logs, metrics, traces)[0],
       | then they at least make sure the definition is becomes so vague
       | and confused that they are not explicitly excluded[1].
       | 
       | Wide events and a good means to query them covers everything, but
       | not if you as a vendor cannot store and query wide events.
       | 
       | [0] as the article notes, one of these is not like the other. [1]
       | Is Scrum Agile? What do you mean a standup can't go for an hour?
       | See also DevOps as a role.
        
       | guhcampos wrote:
       | Incredible what you can do with infinite money!
       | 
       | For everyone else, more specific data structures, sampling and
       | careful consideration of what to record are essential.
        
       ___________________________________________________________________
       (page generated 2024-02-27 23:00 UTC)