[HN Gopher] Logging Sucks
       ___________________________________________________________________
        
       Logging Sucks
        
       Author : FlorinSays
       Score  : 459 points
       Date   : 2025-12-21 18:09 UTC (4 hours ago)
        
 (HTM) web link (loggingsucks.com)
 (TXT) w3m dump (loggingsucks.com)
        
       | firefoxd wrote:
       | Good write up.
       | 
       | Gonna go on a tangent here. Why the single purpose domain?
       | Especially since the author has a blog. My blog is full of links
       | to single post domains that are no longer.
        
         | OsrsNeedsf2P wrote:
         | Because it's an ad
        
           | thewisenerd wrote:
           | it's an ad, for what?
           | 
           | i do not see a product upsell anywhere.
           | 
           | if it's an ad for the author themselves, then it's a very
           | good one.
        
             | KomoD wrote:
             | At the end there's a form where you can get a "personalized
             | report", I have a feeling that'll advertise some kind of
             | service, it's usually the case.
        
       | danielfalbo wrote:
       | I see more and more blog posts that contain interactive elements.
       | Despite the general enshittification of the average blog and the
       | internet, this feels like a 'modern' touch that actually adds
       | something valuable to the sufficient ad-free no-popups old blog
       | style.
        
       | heinrichhartman wrote:
       | A post on this topic feels incomplete without a shout-out to
       | Charity Majors - she has been preaching this for a decade,
       | branded the term "wide events" and "observability", and built
       | honeycomb.io around this concept.
       | 
       | Also worth pointing out that you can implement this method with a
       | lot of tools these days. Both structured Logs or Traces lend
       | itself to capture wide events. Just make sure to use a tool that
       | supports general query patterns and has rich visualizations
       | (time-series, histograms).
        
         | the_mitsuhiko wrote:
         | > A post on this topic feels incomplete without a shout-out to
         | Charity Majors
         | 
         | I concur. In fact, I strongly recommend anyone who has been
         | working with observability tools or in the industry to read her
         | blog, and the back story that lead to honeycomb. They were the
         | first to recognize the value of this type of observability and
         | have been a huge inspiration for many that came after.
        
           | loevborg wrote:
           | I've learned more from Charity about telemetry than from
           | anyone else. Her book is great, as are her talks and blog
           | posts. And Honeycomb, as a tool, is frankly pretty amazing
           | 
           | Yep, I'm a fan.
        
           | dcminter wrote:
           | Could you drop a few specific posts here that you think are
           | good for someone (me) who hasn't read her stuff before? Looks
           | like there's a decade of stuff on her blog and I'm not sure I
           | want to start at the very beginning...
        
             | simonw wrote:
             | A few of my favourites:
             | 
             | - Software Sprawl, The Golden Path, and Scaling Teams With
             | Agency: https://charity.wtf/2018/12/02/software-sprawl-the-
             | golden-pa... - introduces the idea of the "golden path",
             | where you tell engineers at your company that if they use
             | the approved stack of e.g. PostgreSQL + Django + Redis then
             | the ops team will support that for them, but if they want
             | to go off path and use something like MongoDB they can do
             | that but they'll be on the hook for ops themselves.
             | 
             | - Generative AI is not going to build your engineering team
             | for you: https://stackoverflow.blog/2024/12/31/generative-
             | ai-is-not-g... - why generative AI doesn't mean you should
             | stop hiring junior programmers.
             | 
             | - I test in prod: https://increment.com/testing/i-test-in-
             | production/ - on how modern distributed systems WILL have
             | errors that only show up in production, hence why you need
             | to have great instrumentation in place. "No pull request
             | should ever be accepted unless the engineer can answer the
             | question, "How will I know if this breaks?""
             | 
             | - Advice for Engineering Managers Who Want to Climb the
             | Ladder: https://charity.wtf/2022/06/13/advice-for-
             | engineering-manage...
             | 
             | - The Engineer/Manager Pendulum:
             | https://charity.wtf/2017/05/11/the-engineer-manager-
             | pendulum... - I LOVE this one, it's about how it's OK to
             | have a career where you swing back and forth between
             | engineering management and being an "IC".
        
           | Aurornis wrote:
           | > They were the first to recognize the value of this type of
           | observability
           | 
           | With all due respect to her great writing, I think there's a
           | mix of revisionist history blended with PR claims going on in
           | this thread. The blog has some good reading, but let's not
           | get ahead of ourselves in rewriting history around this one
           | person/company.
        
             | the_mitsuhiko wrote:
             | > I think there's a mix of revisionist history blended with
             | PR claims going on in this thread.
             | 
             | I can only speak for myself. I worked for a company that is
             | somewhere in the observability space (Sentry) and Charity
             | was a person I looked up to my entire time working on
             | Sentry. Both for how she ran the company, for the design
             | they picked and for the approaches they took. There might
             | be others that have worked on wide events (afterall,
             | Honeycomb is famously inspired by Facebook's scuba), she is
             | for sure the voice that made it popular.
        
         | vasco wrote:
         | She has good content but no single person branded the term
         | "observability", what the heck. You can respect someone without
         | making wild claims.
        
         | rjbwork wrote:
         | Nick Blumhardt for a while longer than that as "structured
         | logging". Seq and Serilog as enabling software and library in
         | the .net ecosystem.
        
           | layer8 wrote:
           | The article emphasizes that their recommendation is different
           | from structured logging.
        
         | fishtoaster wrote:
         | This post was so in-line with her writing that I was _really_
         | expecting it to turn into an ad for Honeycomb at the end. I was
         | pretty surprised with it turned out the author was
         | unaffiliated!
        
         | Aurornis wrote:
         | > she has been preaching this for a decade, branded the term
         | "wide events" and "observability",
         | 
         | With all due respect to her other work, she most certainly did
         | not coin the term "observability". Observability has been a
         | topic in multiple fields for a very long time and has had
         | widespread usage in computing for decades.
         | 
         | I'm sure you meant well by your comment, but I doubt this is a
         | claim she even makes for herself.
         | 
         | She has been an influential writer on the topic and founded a
         | company in this space, but she didn't actually create the
         | concept or terminology of observability.
        
       | alexwennerberg wrote:
       | > Logs were designed for a different era. An era of monoliths,
       | single servers, and problems you could reproduce locally. Today,
       | a single user request might touch 15 services, 3 databases, 2
       | caches, and a message queue. Your logs are still acting like it's
       | 2005.
       | 
       | If a user request is hitting that many things, in my view, that
       | is a deeply broken architecture.
        
         | the_mitsuhiko wrote:
         | > If a user request is hitting that many things, in my view,
         | that is a deeply broken architecture.
         | 
         | If we want it or not, a lot of modern software looks like that.
         | I am also not a particular fan of building software this way,
         | but it's a reality we're facing. In part it's because quite a
         | few services that people used to build in-house are now
         | outsourced to PaaS solutions. Even basic things such as
         | authentication are more and more moving to third parties.
        
           | worik wrote:
           | > but it's a reality we're facing.
           | 
           | Yes. Most software is bad
           | 
           | The incentives between managers and technicians are all wrong
           | 
           | Bad software is more profitable, over the time frames
           | managers care about, than good software
        
             | the_mitsuhiko wrote:
             | The reason we end up with very complex systems I don't
             | think is because of incentives between "managers and
             | technicians". If I were to put my finger to it, I would
             | assume it's the very technicians who argued themselves into
             | a world where increased complexity and more dependencies is
             | seen as a good thing.
             | 
             | Fighting complexity is deeply unpopular.
        
               | 0x3f wrote:
               | At least in my place of work, my non-technical manager is
               | actually on board with my crusade against complex
               | nonsense. Mostly because he agrees it would increase
               | feature velocity to not have to touch 5 services per
               | minor feature. The other engineers love the horrific mess
               | they've built. It's almost like they're roleplaying
               | working at Google and I'm ruining the fun.
        
         | dxdm wrote:
         | > If a user request is hitting that many things, in my view,
         | that is a deeply broken architecture.
         | 
         | Things can add up quickly. I wouldn't be surprised if some
         | requests touch a lot of bases.
         | 
         | Here's an example: a user wants to start renting a bike from
         | your public bike sharing service, using the app on their phone.
         | 
         | This could be an app developed by the bike sharing company
         | itself, or a 3rd party app that bundles mobility options like
         | ride sharing and public transport tickets in one place.
         | 
         | You need to authentice the request and figure out which
         | customer account is making the request. Is the account allowed
         | to start a ride? They might be blocked. They might need to
         | confirm the rules first. Is this ride part of a group ride, and
         | is the customer allowed to start multiple rides at once? Let's
         | also get a small deposit by putting a hold of a small sum on
         | their credit card. Or are they a reliable customer? Then let's
         | not bother them. Or is there a fraud risk? And do we need to
         | trigger special code paths to work around known problems for
         | payment authorization for cards issued by this bank?
         | 
         | Everything good so far? Then let's start the ride.
         | 
         | First, let's lock in the necessary data. Which rental pricing
         | did the customer agree to? Is that actually available to this
         | customer, this geographical zone, for this bike, at this time,
         | or do we need to abort with an error? Otherwise, let's remember
         | this, so we can calculate the correct rental fee at the end.
         | 
         | We normally charge an unlock fee in addition to the per-minute
         | price. Are we doing that in this case? If yes, does the
         | customer have any free unlock credit that we need to consume or
         | reserve now, so that the app can correctly show unlock costs if
         | the user wants to start another group ride before this one
         | ends?
         | 
         | Ok, let's unlock the bike and turn on the electric motor. We
         | need to make sure it's ready to be used and talk to the IoT box
         | on the bike, taking into account the kind of bike, kind of box
         | and software version. Maybe this is a multistep process,
         | because the particular lock needs manual action by the
         | customer. The IoT box might have to know that we're in a zone
         | where we throttle the max speed more than usual.
         | 
         | Now let's inform some downstream data aggregators that a ride
         | started successfully. BI (business intelligence) will want to
         | know, and the city might also require us to report this to
         | them. The customer was referred by a friend, and this is their
         | first ride, so now the friend gets his referral bonus in the
         | form of app credit.
         | 
         | Did we change an unrefundable unlock fee? We might want to
         | invoice that already (for whatever reason; otherwise this will
         | happen after the ride). Let's record the revenue, create the
         | invoice data and the PDF, email it, and report this to the
         | country's tax agency, because that's required in the country
         | this ride is starting in.
         | 
         | Or did things go wrong? Is the vehicle broken? Gotta mark it
         | for service to swing by, and let's undo any payment holds. Or
         | did the deposit fail, because the credit card is marked as
         | stolen? Maybe block the customer and see if we have other
         | recent payments using the same card fingerprint that we might
         | want to proactively refund.
         | 
         | That's just off the top of my head, there may be more for a
         | real life case. Some of these may happen synchronously, others
         | may hit a queue or event bus. The point is, they are all tied
         | to a single request.
         | 
         | So, depending on how you cut things, you might need several
         | services that you can deploy and develop independently.
         | 
         | - auth - core customer management, permissions, ToS agreement,
         | 
         | - pricing, - geo zone definitions, - zone rules,
         | 
         | - benefit programs,
         | 
         | - payments and payment provider integration, - app credits, -
         | fraud handling,
         | 
         | - ride management, - vehicle management, - IoT integration,
         | 
         | - invoicing, - emails, - BI integration, - city hall
         | integration, - tax authority integration,
         | 
         | - and an API gateway that fronts the app request.
         | 
         | These do not have to be separate services, but they are
         | separate enough to warrant it. They wouldn't be exactly micro
         | either.
         | 
         | Not every product will be this complicated, but it's also not
         | that out there, I think.
        
           | 0x3f wrote:
           | > These do not have to be separate services, but they are
           | separate enough to warrant it.
           | 
           | All of this arises from your failure to question this basic
           | assumption though, doesn't it?
        
             | dxdm wrote:
             | > All of this arises from your failure to question this
             | basic assumption though, doesn't it?
             | 
             | Haha, no. "All of this" is a scenario I consider quite
             | realistic in terms of what needs to happen. The question
             | is, how should you split this up, if at all?
             | 
             | Mind that these concerns will be involved in other ways
             | with other requests, serving customers and internal users.
             | There are enough different concerns at different levels of
             | abstraction that you might need different domain experts to
             | develop and maintain them, maybe using different
             | programming languages, depending on who you can get. There
             | will definitely be multiple teams. It may be beneficial to
             | deploy and scale some functions independently; they have
             | different load and availability requirements.
             | 
             | Of course you can slice things differently. Which
             | assumptions have _you_ questioned recently? I think you 've
             | been given some material. No need to be rude.
        
               | 0x3f wrote:
               | I don't think I was rude. You're overcomplicating the
               | architecture here for no good reason. It might be common
               | to do so, but that doesn't make it good practice. And
               | ultimately I think it's your job as a professional to
               | question it, which makes not doing so a form of
               | 'failure'. Sorry if that seems harsh; I'm sharing what I
               | believe to be genuine and valuable wisdom.
               | 
               | Happy to discuss why you think this is all necessary.
               | Open to questioning assumptions of my own too, if you
               | have specifics.
               | 
               | As it is, you're just quoting microservices dogma. Your
               | auth service doesn't need a different programming
               | language from your invoicing system. Nor does it need to
               | be scaled independently. Why would it?
        
       | jdpage wrote:
       | Tangential, but I wonder if the given example might be straying a
       | step too far? Normally we want to keep sensitive data out of
       | logs, but the example includes a user.lifetime_value_cents field.
       | I'd want to have a chat with the rest of the business before
       | sticking something like that in logs.
        
         | nightpool wrote:
         | In some companies, this type of information is often very
         | important and very easily available to everyone at all levels
         | of the business to help prioritize and understand customer
         | value. I would not consider it "sensitive" in the same way that
         | e.g. PII would be.
        
           | jdpage wrote:
           | Good to know! At previous jobs, that information wasn't
           | available to me (and it didn't matter because the customer
           | bases were small enough that every customer was top
           | priority), so I assumed it was considered more sensitive than
           | it perhaps is.
        
       | zkmon wrote:
       | > Logs were designed for a different era. An era of monoliths,
       | single servers, and problems you could reproduce locally. Today,
       | a single user request might touch 15 services, 3 databases, 2
       | caches, and a message queue. Your logs are still acting like it's
       | 2005.
       | 
       | Logs are fine. The job of local logs is to record the talk of a
       | local process. They are doing this fine. Local logs were never
       | meant to give you a picture of what's going on some other server.
       | For such context, you need a transaction tracing that can stitch
       | the story together across all processes involved.
       | 
       | Usually, looking at the logs at right place should lead you to
       | the root cause.
        
         | holoduke wrote:
         | APN/Kibana. All what I need for inspecting logs.
        
           | devmor wrote:
           | Shoutout to Kibana. Absolutely my favorite UI tool for trying
           | to figure out what went wrong (and sometimes, IF anything
           | went wrong in the first place)
        
         | venturecruelty wrote:
         | >Today, a single user request might touch 15 services, 3
         | databases, 2 caches, and a message queue.
         | 
         | Not if I have anything to say about it.
         | 
         | >Your logs are still acting like it's 2005.
         | 
         | Yeah, because that's just before software development went
         | absolutely insane.
        
         | otterley wrote:
         | One of the points the author is trying to make (although he
         | doesn't make it well, and his attitude makes it hard to read)
         | is that logs aren't just for root-causing incidents.
         | 
         | When properly seasoned with context, logs give you useful
         | information like who is impacted (not every incident impacts
         | every customer the same way), correlations between component
         | performance and inputs, and so forth. When connected to
         | analytical engines, logs with rich context can help you figure
         | out things like behaviors that lead to abandonment, the impact
         | of security vulnerability exploits, and much more. And in their
         | never-ending quest to improve their offerings and make more
         | money, product managers love being able to test their theories
         | against real data.
        
           | ivan_gammel wrote:
           | It's a wild violation of SRP to suggest that. Separating
           | concerns is way more efficient. Database can handle audit
           | trail and some key metrics much better, no special tools
           | needed, you can join transaction log with domain tables as a
           | bonus.
        
             | otterley wrote:
             | Are you assuming they're all stored identically? If so,
             | that's not necessarily the case.
             | 
             | Once the logs have entered the ingestion endpoint, they can
             | take the most optimal path for their use case. Metrics can
             | be extracted and sent off to a time-series metric database,
             | while logs can be multiplexed to different destinations,
             | including stored raw in cheap archival storage, or matched
             | to schemas, indexed, stored in purpose-built search engines
             | like OpenSearch, and stored "cooked" in Apache
             | Iceberg+Parquet tables for rapid querying with Spark,
             | Trino, or other analytical engines.
             | 
             | Have you ever taken, say, VPC flow logs, saved them in
             | Parquet format, and queried them with DuckDB? I just
             | experimented with this the other day and it was mind-
             | blowingly awesome--and _fast_. I, for one, am glad the days
             | of writing parsers and report generators myself are over.
        
               | ivan_gammel wrote:
               | Good joke.
        
       | ohans wrote:
       | This was a brilliant write up, and loved the interactivity.
       | 
       | I do think "logs are broken" is a bit overstated. The real
       | problem is unstructured events + weak conventions + poor
       | correlation.
       | 
       | Brilliant write up regardless
        
       | the__alchemist wrote:
       | From what I gather: This is referring to Web sites or other HTTP
       | applications which are internally implemented as a collection of
       | separate applications/ micro-services?
        
       | cowsandmilk wrote:
       | Horrid advice at the end about logging every error, exception,
       | slow request, etc if you are sampling healthy requests.
       | 
       | Taking slow requests as an example, a dependency gets slower and
       | now your log volume suddenly goes up 100x. Can your service
       | handle that? Are you causing a cascading outage due to increased
       | log volumes?
       | 
       | Recovery is easier if your service is doing the same or less work
       | in a degraded state. Increasing logging by 20-100x when degraded
       | is not that.
        
         | otterley wrote:
         | It's an important architectural requirement for a production
         | service to be able to scale out their log ingestion
         | capabilities to meet demand.
         | 
         | Besides, a little local on-disk buffering goes a long way, and
         | is cheap to boot. It's an antipattern to flush logs directly
         | over the network.
        
         | trevor-e wrote:
         | Yea that was my thought too. I like the idea in principle, but
         | these magic thresholds can really bite you. It claims to be
         | P(99), probably off some historical measurement, but that's
         | only true if it's dynamically changing. Maybe this could
         | periodically query the OTEL provider for the real number to at
         | least limit the time window of something bad happening.
        
         | debazel wrote:
         | My impression was that you would apply this filter after the
         | logs have reach your log destination, so there should be no
         | difference for your services unless you host your own log
         | infra, in which case there might be issues on that side. At
         | least that's how we do it with Datadog because ingestion is
         | cheap but indexing and storing logs long term is the expensive
         | part.
        
         | Veserv wrote:
         | I do not see how logging could bottleneck you in a degraded
         | state unless your logging is terribly inefficient. A properly
         | designed logging system can record on the order of 100 million
         | logs per second per core.
         | 
         | Are you actually contemplating handling 10 million requests per
         | second per core that are failing?
        
           | otterley wrote:
           | Generation and publication is just the beginning (never mind
           | the fact that resources consumed by an application to log
           | something are no longer available to do real work). You have
           | to consider the scalability of each component in the logging
           | architecture from end to end. There's ingestion, parsing,
           | transformation, aggregation, derivation, indexing, and
           | storage. Each one of those needs to scale to meet demand.
        
             | Veserv wrote:
             | I already accounted for consumed resources when I said 10
             | million instead of 100 million. I allocated 10% to logging
             | overhead. If your service is within 10% of overload you are
             | already in for a bad time. And frankly, what systems are
             | you using that are handling 10 million requests per second
             | per core (100 nanoseconds per request)? Hell, what services
             | are you deploying that you even have 10 million requests
             | per second per core to handle?
             | 
             | All of those other costs are, again, trivial with proper
             | design. You can easily handle billions of events per second
             | on the backend with even a modest server. This is done
             | regularly by time traveling debuggers which actually need
             | to handle these data rates. So again, what are we even
             | deploying that has billions of events per second?
        
               | otterley wrote:
               | In my experience working at AWS and with customers, you
               | don't need billions of TPS to make an end-to-end logging
               | infrastructure keel over. It takes much less than that.
               | As a working example, you can host your own end-to-end
               | infra (the LGTM stack is pretty easy to deploy in a
               | Kubernetes cluster) and see what it takes to bring yours
               | to a grind with a given set of resources and TPS/volume.
        
               | Veserv wrote:
               | I prefaced all my statements with the assumption that the
               | chosen logging system is not poorly designed and terribly
               | inefficient. Sounds like their logging solutions are
               | poorly designed and terribly inefficient then.
               | 
               | It is, in fact, a self-fulfilling prophecy to complain
               | that logging can be a bottleneck if you then choose
               | logging that is 100-1000x slower than it should be. What
               | a concept.
        
               | otterley wrote:
               | At the end of the day, it comes down to what sort of
               | functionality you want out of your observability. Modest
               | needs usually require modest resources: sure, you could
               | just append to log files on your application hosts and
               | ship them to a central aggregator where they're stored
               | as-is. That's cheap and fast, but you won't get a lot of
               | functionality out of it. If you want more, like real-time
               | indexing, transformation, analytics, alerting, etc., it
               | requires more resources. Ain't no such thing as a free
               | lunch.
        
               | dpark wrote:
               | Surely you aren't doing real time indexing,
               | transformation, analytics, etc in the same service that
               | is producing the logs.
               | 
               | A catastrophic increase in logging could certainly take
               | down your log processing pipeline but it should not
               | create cascading failures that compromise your service.
        
               | otterley wrote:
               | Of course not. Worst case should be backpressure, which
               | means processing, indexing, and storage delays. Your
               | service might be fine but your visibility will be
               | reduced.
        
               | dpark wrote:
               | For sure. Your can definitely tip over your logging
               | pipeline and impact visibility.
               | 
               | I just wanted to make sure we weren't still talking about
               | "causing a cascading outage due to increased log volumes"
               | as was mentioned above, which would indicate a
               | significant architectural issue.
        
         | Cort3z wrote:
         | Just implement exponential backoff for slow requests logging,
         | or some other heuristic, to control it. I definitely agree it
         | is a concern though.
        
         | XCSme wrote:
         | Good point. It also reminded me of when I was trying to
         | optimize my app for some scenarios, then I realized it's better
         | to optimize it for ALL scenarios, so it works fast and the
         | servers can handle no matter what. To be more specific, I
         | decided NOT to cache any common queries, but instead make sure
         | that all queries are fast as possible.
        
         | golem14 wrote:
         | For high volume services, you can still log a sample of healthy
         | requests, e.g., trace_id mod 100 == 0. That keeps log growth
         | under control. The higher the volume, the smaller percentage
         | you can use.
        
       | kgklxksnrb wrote:
       | Logfiles are a user interface.
        
       | otterley wrote:
       | The substance of this post is outstanding.
       | 
       | The framing is not, though. Why does it have to sound so dramatic
       | and provocative? It's insulting to its audience. Grumpiness, in
       | the long term, is a career-limiting attitude.
        
         | b0ringdeveloper wrote:
         | I get the AI feeling from it.
        
           | otterley wrote:
           | It might have been AI-assisted, and it might not have been.
           | It doesn't really matter. The author is ultimately
           | responsible for the end result.
        
         | rglover wrote:
         | Career-limiting perhaps (if expressing normal human emotion is
         | a minus inside of an organization, it may be time to bail) but
         | some of the best minds I've met/observed were absolute
         | _curmudgeons_ (with purpose--they were properly bothered by a
         | problem and refused to go along with the  "sweep it under the
         | rug" behavior).
         | 
         | Sure, I've dealt with plenty of assholes, too, but the grumps
         | are usually just tired of their valid insight being ignored by
         | more foolish, orthogonally incentivized types (read: "playing
         | the game" not "making it work well").
        
           | otterley wrote:
           | We've all tolerated the grumpy genius at some point in our
           | careers. Nevertheless, most of us would prefer to work with a
           | person who's both smart and kind over someone who's smart and
           | curmudgeonly. It is possible to be both smart and kind, and
           | I've had the pleasure of working with such people.
           | 
           | Assholes can sap an organization's strength faster than any
           | productive value their intelligence can provide. I'm not
           | suggesting the author is an asshole, though; there's not
           | enough evidence from this post.
        
       | jupin wrote:
       | Some excellent points raised in this article.
        
       | charcircuit wrote:
       | This article is attacking a strawman. It makes up terrible logs
       | and then says they are bad. Even if this was a single monolith
       | the logs still don't include even something like a thread id, to
       | avoid mixing different requests together.
        
         | blinded wrote:
         | I see logs worse that that on the daily.
        
       | dcminter wrote:
       | I've generally found that structured logs that include a
       | correlation ID make it quite easy to narrow down the general area
       | or exact cause of problems. Usually (in enterprise orgs) via
       | Splunk or Datadog.
       | 
       | Where I've had problems it's usually been one of:
       | 
       | There wasn't anything logged in the error block. A comment saying
       | "never happens" is often discovered later :)
       | 
       | Too much was logged and someone mandated dialing the logging down
       | to save costs. Sigh.
       | 
       | A new thread was started and the thread-local details including
       | the correlation ID got lost, then the error occurred downstream
       | of that. I'd like better solutions for that one.
       | 
       | Edit: Incidentally a correlation ID is not (necessarily) the same
       | thing as a request ID. An API often needs to allow for the caller
       | making multiple calls to achieve an objective; 5 request IDs
       | might be tied to a single correlation ID.
        
         | loglog wrote:
         | Java has a solution for the thread problem: Scoped Values [0].
         | If only the logging+tracing libraries would start using it...
         | 
         | [0] https://openjdk.org/jeps/506
        
           | dcminter wrote:
           | Oh, excellent, these slipped under my radar. Sounds extremely
           | promising and I do mostly work in Java!
        
       | Spivak wrote:
       | Slapping on OpenTelemetry actually will solve your problem.
       | 
       | Point #1 isn't true, auto instrumentation exists and is really
       | good. When I integrate OTel I add my own auto instrumentors
       | wherever possible to automatically add lots of context. Which
       | gets into point #2.
       | 
       | Point #2 also isn't true. It can add business context in a
       | hierarchal manner and ship wide events. You shouldn't have to
       | tell every span all the information again. Just where it appears
       | naturally the first time.
       | 
       | Point #3 also also isn't true because OTel libs make it really
       | annoying to just write a log message and very strongly pushes you
       | into a hierarchy of nested context managers.
       | 
       | Like the author's ideal setup is basically using OTel with
       | Honeycomb. You get the querying and everything. And unlike
       | rawdogging wide events all your traces are connected, can _span_
       | multiple services and do timing for you.
        
       | yujzgzc wrote:
       | You might also need different systems for low-cardinality, low-
       | latency production monitoring (where you want to throw alerts
       | quickly and high cardinality fields would just get in the way),
       | and medium to long term logging with wide events.
       | 
       | Also if you're going to log wide events, for the sake of the
       | person querying them after you, please don't let your schema be
       | an ad hoc JSON dict of dicts, put some thought into the schema
       | structure (and better have a logging system that enforces the
       | schema).
        
       | m3047 wrote:
       | I agree with this statement: "Instead of logging what your code
       | is doing, log what happened to this request." but the impression
       | I can't shake is that this person lacks experience, or more
       | likely has a lot of experience doing the same thing over and
       | over.
       | 
       | "Bug parts" (as in "acceptable number of bug parts per candy
       | bar") logging should include the precursors of processing
       | metrics. I think what he calls "wide events" I call bug parts
       | logging in order to emphasize that it _also_ may include signals
       | pertaining to which code paths were taken, how many times, and
       | how long it took.
       | 
       | Logging is not metrics is not auditing. In particular processing
       | can continue if logging (temporarily) fails but not if auditing
       | has failed. I prefer the terminology "observables" to "logging"
       | and "evaluatives" to "metrics".
       | 
       | In mature SCADA systems there is the well-worn notion of a
       | "historian". Read up on it.
       | 
       | A fluid level sensor on CANbus sending events 10x a second isn't
       | telling me whether or not I have enough fuel to get to my
       | destination (a significant question); however, that granularity
       | might be helpful for diagnosing a stuck sensor (or bad
       | connection). It would be impossibly fatiguing and hopelessly
       | distracting to try to answer the significan question from this
       | firehose of low-information events. Even a de-noised fuel gauge
       | doesn't directly diagnose my desired evaluative (will I get there
       | or not?).
       | 
       | Does my fuel gauge need to also serve as the debugging interface
       | for the sensor? No, it does not. Likewise, send metrics /
       | evaluatives to the cloud not logging / observables; when
       | something goes sideways the real work is getting off your ass and
       | taking a look. Take the time to think about what that looks like:
       | maybe that's the best takeaway.
        
         | otterley wrote:
         | > Logging is not metrics is not auditing.
         | 
         | I espouse a "grand theory of observability" that, like matter
         | and energy, treats logs, metrics, and audits alike. At the end
         | of the day, they're streams of bits, and so long as no fidelity
         | is lost, they can be converted between each other. Audit trails
         | are certainly carried over logs. Metrics are streams of time-
         | series numeric data; they can be carried over log channels or
         | embedded inside logs (as they often are).
         | 
         | How these signals are stored, transformed, queried, and
         | presented may differ, but at the end of the day, the
         | consumption endpoint and mechanism can be the same regardless
         | of origin. Doing so simplifies both the conceptual framework
         | and design of the processing system, and makes it flexible
         | enough to suit any conceivable set of use cases. Plus, storing
         | the ingested logs as-is in inexpensive long-term archival
         | storage allows you to reprocess them later however you like.
        
           | Veserv wrote:
           | Saying they are all the same when no fidelity is lost is
           | missing the point. The _only_ distinction between logs,
           | traces, and metrics is literally what to do when fidelity is
           | lost.
           | 
           | If you have insufficient ingestion rate:
           | 
           | Logs are for events that can be independently sampled and be
           | coherent. You can drop arbitrary logs to stay within
           | ingestion rate.
           | 
           | Traces are for correlated sequences of events where the
           | entire sequence needs to be retained to be useful/coherent.
           | You can drop arbitrary whole sequences to stay within
           | ingestion rate.
           | 
           | Metrics are pre-aggregated collections of events. You pre-
           | limited your emission rate to fit your ingestion rate at the
           | cost of upfront loss of fidelity.
           | 
           | If you have adequate ingestion rate, then you just emit your
           | events bare and post-process/visualize your events however
           | you want.
        
             | otterley wrote:
             | > If you have insufficient ingestion rate
             | 
             | I would rather fix this problem than every other problem.
             | If I'm seeing backpressure, I'd prefer to buffer locally on
             | disk until the ingestion system can get caught up. If I
             | need to prioritize signal delivery once the backpressure
             | has resolved itself, I can do that locally as well by
             | separating streams (i.e. priority queueing). It doesn't
             | change the fundamental nature of the system, though.
        
           | lll-o-lll wrote:
           | Auditing is fundamentally different because it has different
           | durability and consistency requirements. I can buffer my
           | logs, but I might need to transact my audit.
        
             | otterley wrote:
             | For most cases, buffering audit logs on local storage is
             | fine. What matters is that the data is available and
             | durable _somewhere_ in the path, not that it be
             | transactionally durable at the final endpoint.
        
             | chickensong wrote:
             | You could have the log shipper filter events and create a
             | separate audit stream with different behavior and
             | destination.
        
               | cluckindan wrote:
               | Really, have sane log message types and include "audit"
               | as one of them.
               | 
               | Log levels could be considered an anti-pattern.
        
       | mrkeen wrote:
       | > Your logs are lying to you. Not maliciously. They're just not
       | equipped to tell the truth.
       | 
       | The best way to equip logs to tell the truth is to have other
       | parts of the system consume them as their source of truth.
       | 
       | Firstly: "what the system does" and "what the logs say" can't be
       | two different things.
       | 
       | Secondly: developers can't put less info into the logs than they
       | should, because their feature simply won't work without it.
        
         | 8n4vidtmkvmk wrote:
         | That doesn't sound like a good plan. You're coupling logging
         | with business logic. I don't want to have to think if i change
         | a debug string am i going to break something.
        
           | andoando wrote:
           | Your logic wouldn't be dependent on a debug string, but some
           | enum in a structured field. Ex, event_type:
           | CREATED_TRANSACTION.
           | 
           | Seeing logging as debugging is flawed imo. A log is
           | technically just a record of what happened in your database.
        
           | SoftTalker wrote:
           | You're also assuming your log infrastructure is a lot more
           | durable than most are. Generally, logging is not a guaranteed
           | action. Writing a log message is not normally something where
           | you wait for a disk sync before proceeding. Dropping a log
           | message here or there is not a fatal error. Logs get rotated
           | and deleted automatically. They are designed for retroactive
           | use and best effort event recording, not assumed to be a
           | flawless record of everything the system did.
        
       | tetha wrote:
       | One thing this is missing: Standardization and probably the ECS'
       | idea of "related" fields.
       | 
       | A common problem in a log aggregation is the question if you
       | query for user.id, user_id, userID, buyer.user.id, buyer.id,
       | buyer_user_id, buyer_id, ... Every log aggregation ends up being
       | plagued by this. You need standard field names there, or it
       | becomes a horrible mess.
       | 
       | And for a centralized aggregation, I like ECS' idea of "related".
       | If you have a buyer and a seller, both with user IDs, you'd have
       | a `related.user.id` with both id's in there. This makes it very
       | simple to say "hey, give me everything related to request X" or
       | "give me everything involving user Y in this time frame" (as long
       | as this is kept up to date, naturally)
        
         | ttoinou wrote:
         | I always wondered why we didnt have some kind of fuzzy english
         | words search regexes/tool, that is robust to keyboard typing
         | mistakes, spelling mistake, synonyms, plural, conjugation etc.
        
         | j-pb wrote:
         | I actually wrote my bachelors on this topic, but instead of
         | going the ECS route (which still has redundant fields in
         | different components) I went in the RDF direction. That system
         | has shifted towards more of a middleware/database hybrid over
         | time (https://github.com/triblespace/triblespace-rs). I always
         | wonder if we'd actually need logging if we had more data-
         | oriented stacks where the logs fall out as a natural byproduct
         | of communication and storage.
        
       | thevinter wrote:
       | The presentation is fantastic and I loved the interactive
       | examples!
       | 
       | Too bad that all of this effort is spent arguing something which
       | can be summarised as "add structured tags to your logs"
       | 
       | Generally speaking my biggest gripe with wide logs (and other
       | "innovative" solutions to logging) is that whatever perceived
       | benefit you argue for doesn't justify the increased complexity
       | and loss of readability.
       | 
       | We're throwing away `grep "uid=user-123" application.log` to get
       | what? The shipping method of the user attached to every log?
       | Doesn't feel an improvement to me...
       | 
       | P.S. The checkboxes in the wide event builder don't work for me
       | (brave - android)
        
         | dannyfreeman wrote:
         | Do you really loose the ability to grep? You can still search
         | for json fragments `grep '"uid": "user-123"' application.log`
         | 
         | If the json logged isn't pretty printed everything should still
         | be on one line. You can also grep with the `--context` flag to
         | get more surrounding lines.
        
       | bambax wrote:
       | > _Logging Sucks_
       | 
       | But does it? Or is it bad logging, or excessive logging, or
       | unsearchable logs?
       | 
       | A client of mine uses SnapLogic, which is a middleware / ETL
       | that's supposed run pipelines in batch mode to pass data around
       | between systems. It generates an enormous amount of logs that are
       | so difficult to access, search and read that they may as well
       | don't exist.
       | 
       | We're replacing all of that with simple Python scripts that do
       | the same thing and generate normal simple logs with simple errors
       | when something's truly wrong or the data is in the wrong format.
       | 
       | Terse logging is what you want, not an exhaustive (and
       | exhausting) torrent of irrelevant information.
        
       | asdev wrote:
       | this is the best lead generation form i've ever seen
        
       | roncesvalles wrote:
       | AI slop blogvert. The first example is disingenuous btw. Everyone
       | these days uses requestIDs to be able to query all log lines
       | emanated by a single request, usually set by the first backend
       | service to receive the request and then propagated using headers
       | (and also set in the server response).
       | 
       | There isn't anything radical about his proposed solutions either.
       | Most log storage can be set with a rule where all warning logs or
       | above can be retained, but only a sample of info and debug logs.
       | 
       | The "key insight" is also flawed. The reason why we log at every
       | step is because sometimes your request never completes and it
       | could be for 1000 reasons but you really need to know how far it
       | got in your system. Logging only a summary at the end is happy
       | path thinking.
        
       | mnahkies wrote:
       | That was difficult to read, smelt very AI assisted though the
       | message was worthwhile, it could've been shorter and more to the
       | point.
       | 
       | A few things I've been thinking about recently:
       | 
       | - we have authentication everywhere in our stack, so I've started
       | including the user id on every log line. This makes getting a
       | holistic view of what a user experienced much easier.
       | 
       | - logging an error as a separate log line to the request log is a
       | pain. You can filter for the trace, but it makes it hard to
       | surface "show me all the logs for 5xx requests and the error
       | associated" - it's doable, but it's more difficult than filtering
       | on the status code of the request log
       | 
       | - it's not enough to just start including that context, you have
       | to educate your coworkers that it's now present. I've seen people
       | making life hard for themselves because they didn't realize we'd
       | added this context
        
         | spike021 wrote:
         | If your codebase has the concept of a request ID, you could
         | also feasibly use that to trace what a user has been doing with
         | more specificity.
        
           | mnahkies wrote:
           | We do have both a span id and trace id - but I personally
           | find this more cumbersome over filtering on a user id. YMMV
           | if you're interested in a single trace then you'd filter for
           | that, but I find you often also care what happened "around" a
           | trace
        
           | ivan_gammel wrote:
           | ...and the same ID can be displayed to user on HTTP 500 with
           | the support contact, making life of everyone much easier.
        
             | dexwiz wrote:
             | I have seen pushback on this kind of behavior because
             | "users don't like error codes" or other such nonsense. UX
             | and Product like to pretend nothing will ever break, and
             | when it does they want some funny little image, not useful
             | output.
             | 
             | A good compromise is to log whenever a user would see the
             | error code, and treat those events with very high priority.
        
               | spockz wrote:
               | We put the error code behind a kind of message/dialog
               | that invites the user to contact us if the problem
               | persists and then report that code.
               | 
               | It's my long standing wish to be able to link
               | traces/errors automatically to callers when they call the
               | helpdesk. We have all the required information. It's just
               | that the helpdesk has actually very little use for this
               | level of detail. So they can only attach it to the ticket
               | so that actual application teams don't have to search for
               | it.
        
               | ivan_gammel wrote:
               | Nah, that's easy problem to solve with UX copy.
               | ,,Something went wrong. Try again or contact support.
               | Your support request number is XXXX XXXX" (base 58
               | version of UUID).
        
           | nine_k wrote:
           | ...if it does not, you should add it. A request ID, trace ID,
           | correlation key, whatever you call it, you should thread it
           | through every remote call, if you value your sanity.
        
           | kulahan wrote:
           | If you care about this more than anything else (e.g. if you
           | care about audits a LOT and need them perfect), you can
           | simply code the app via action paths, rather than for
           | modularity. It makes changes harder down the road, but for
           | codebases that don't change much, this can be a viable
           | tradeoff to significantly improve tracing and logging.
        
         | xmprt wrote:
         | On the other hand, investing in better tracing tools unlocks a
         | whole nother level of logging and debugging capabilities that
         | aren't feasible with just request logs. It's kind of like you
         | mentioned with using the user id as a "trace" in your first
         | message but on steroids.
        
           | dexwiz wrote:
           | These tools tend to be very expensive in my experience unless
           | you are running your own monitoring cloud. Either you end up
           | sampling traces at low rates to save on costs, or your
           | observability bill is more than your infrastructure bill.
        
             | dietr1ch wrote:
             | Doing stuff like turning on tracing for clients that saw
             | errors in the last 2 minutes, or for requests that were
             | retried should only gather a small portion of your data.
             | Maybe you can include other sessions/requests at random if
             | you want to have a baseline to compare against.
        
             | jonasdegendt wrote:
             | We self host Grafana Tempo and whilst the cost isn't
             | negligible (at 50k spans per second), the money saved in
             | developer time when debugging an error, compared to having
             | to sift through and connect logs, is easily an order of
             | magnitude higher.
        
         | khazhoux wrote:
         | > That was difficult to read, smelt very AI assisted though the
         | message was worthwhile...
         | 
         | It won't be long before _ad computem_ comments like this are
         | frowned upon.
        
           | bccdee wrote:
           | Why? "This was written badly" is a perfectly normal thing to
           | say; "this was written badly because you didn't put in the
           | effort of writing it yourself" doubly so.
        
             | 0xbadcafebee wrote:
             | Say they used AI to write it, it came out bad, and they
             | published it anyway. They had the opportunity to "make it
             | better" before publishing, but didn't. The only conclusion
             | for this is, they just aren't good at writing. So whether
             | AI is used or not, it'll suck either way. So there's no
             | need to complain about the AI.
             | 
             | It's like complaining that somebody typed a crappy letter
             | rather than hand-wrote it. Either way the letter's gonna
             | suck, so why complain that it was typed?
        
           | alwa wrote:
           | I read it as a more-or-less kind comment: "even though you'll
           | notice that they let an AI make the writing terrible, the
           | underlying point is good enough to be worth struggling
           | through that and discussing"
        
         | giancarlostoro wrote:
         | TIDs are good here too. If you generate it and enforce it
         | across all your services spanning various teams and APIs anyone
         | of any team can grab a TID you provide and you can get the full
         | end to end of one transaction.
        
       | bob1029 wrote:
       | > Logs were designed for a different era. An era of monoliths,
       | single servers, and problems you could reproduce locally.
       | 
       | I worked with enterprise message bus loggers in semiconductor
       | manufacturing context wherein we had _thousands_ of participants
       | on the message bus. It generated something like 300-400 megabytes
       | per hour. Despite the insane volume we made this work really well
       | using just grep and other basic CLI tools.
       | 
       | The logs were mere time series of events. Figuring out the detail
       | about specific events (e.g. a list of all the tools a lot
       | visited) required writing queries into the Oracle monster. You
       | _could_ derive history from the event logs if you had enough
       | patience  & disk space, but that would have been very silly given
       | the alternative option. We used them predominantly to establish a
       | casual chain between events when the details are still
       | preliminary. Identifying suspects and such. Actually resolving
       | really complicated business usually requires more than a
       | perfectly detailed log file.
        
         | iLoveOncall wrote:
         | > It generated something like 300-400 megabytes per hour.
         | Despite the insane volume we made this work really well using
         | just grep and other basic CLI tools.
         | 
         | 400MB of logs an hour is nothing at all, that's why a naive
         | grep can work. You don't even need to rotate your log files
         | frequently in this situation.
        
         | fsniper wrote:
         | At last a sane person. Logs are for identifying the event
         | timeline, not to acquire the whole reqs/resp data. Putting
         | every detail into the logs is -in my experience - makes
         | undertanding issues harder. Logs tell a story. When, what
         | happened, not how or why that happened. Why is in the code, how
         | is in the combination of, data, logs, events, code.
         | 
         | And loosely related, I also dislike log interfaces like elk
         | stack. They make following track of events really hard. Most of
         | the time you do not know what you are loooking for, just a
         | vauge understanding of why you are looking into the logs. So a
         | line passed 3 micro seconds ago maybe your euraka moment, where
         | no search could identify , just intuition and following logs
         | diligently can.
        
       | eterm wrote:
       | Overly dismissive of OTLP without proper substance to the
       | criticism.
        
         | tuetuopay wrote:
         | On some languages the tracing frameworks are a godsend. In Rust
         | the instrument macro will automatically record all function
         | arguments as span tags. Plonk anything in e.g jaeger and any
         | full trace can be looked up from pretty much any value.
        
       | ivan_gammel wrote:
       | The problem statement in this article sounds weird. I thought in
       | 2025 everyone logs at least thread id and context id (user id,
       | request id etc), and in microservice architecture at least
       | transaction or saga id. You don't need structured logging,
       | because grep by this id is sufficient for incident investigation.
       | And for analytics and metrics databases of events and requests
       | make more sense.
        
       | ardme wrote:
       | Maybe better written and simplified to: "microservices suck".
        
       | exabrial wrote:
       | Our logging guidance is: "Don't write comments, write logs" and
       | that serves us pretty well. The point being, don't write code
       | "clever code", write obvious code, and try to make it similar to
       | everything else thats been done, regardless if you agree with it.
        
       | lstroud wrote:
       | Sounds like he's just asking for an old school Inman style
       | transaction log.
        
       | UltraSane wrote:
       | Splunk is expensive but it makes searching logs so much faster
       | and more effective. I think of it as SQL for unstructured data.
        
         | preisschild wrote:
         | loki works great too and is FOSS
        
           | UltraSane wrote:
           | We really need an open-source implementation of the Splunk
           | Query Language. The query language is what lets you actually
           | find the few dozen relevant lines out of the billions of
           | lines logged.
        
       | theodpHN wrote:
       | Just out of curiosity, how have you seen risk/compliance,
       | regulatory, and audit departments at organizations deal with the
       | disconnect between security and privacy for something like
       | mainframe logging (e.g., JES2, JES3), which is typically
       | inherently governed, and modern distributed logging, which is
       | typically inherently permissive? Both are vastly different
       | approaches, but each is somehow considered 'compliant.' Btw,
       | employees at a company I was at were once investigated for
       | insider trading simply because it was discovered the company used
       | pooled logs that were accessible by production support
       | programmers (the company decided to override the default
       | mainframe security), which was deemed a possible source of
       | insider trading information that could be tapped into by those
       | who had log access (programmers were eventually cleared if it was
       | discovered their small personal trades were immaterial and just
       | coincidental with the company's trading, but the investigation
       | led to uncomfortable confrontations for some!).
        
       | shireboy wrote:
       | Kinda get what he's saying: provide more metadata with structured
       | logging as opposed to lots of string only logs. Ok, modern
       | logging frameworks steer you towards that anyway. But as a
       | counterpoint: often it can be hard to safely enrich logging like
       | that. In the example they include subscription age, user info,
       | etc. More than once I've seen logging code lookup metadata or
       | assume it existed, only to cause perf issues or outright errors
       | as expected data didn't exist. Similar with sampling, it can be
       | frustrating when the thing you need gets sampled out. In the end
       | "it depends" on scenario, but I still find myself not logging
       | enough or else logging too much
        
       | Lord_Zero wrote:
       | Structured Logging is not just JSON. It's the use of templates
       | with context. It solves 90% of what this article complains about
       | if you just log the template along with the variables and the
       | message separately. Along with logging the right stuff. IE `"User
       | {username} created order {orderid}"`
        
       | nacozarina wrote:
       | AI writing sucks even more, get rekt
        
       | adamddev1 wrote:
       | > Today, a single user request might touch 15 services, 3
       | databases, 2 caches, and a message queue.
       | 
       | And this is why _the internet_ today sucks.
        
       | redleader55 wrote:
       | The article, AI or not, is extremely naive. It doesn't mention
       | any premise or any problem to solve. Proposes a solution and just
       | goes with it. What if your monster of a event is lost when your
       | service crashes or is lost by the logging library/service/etc?
       | What if you're interested in measuring, post factum, how long
       | each step takes? What if you want to trace a log through several
       | (micro-)services and maybe between a mobile app and some batch
       | job executor that runs once a day?
       | 
       | "Logging sucks" when you don't understand the problem you're
       | trying to solve.
        
       | gfody wrote:
       | the best implementation of structured logging I've seen is dotnet
       | build's binlogs (https://msbuildlog.com), I would love to see it
       | evolve into a general purpose logging solution
        
       | kalmyk wrote:
       | distributed event id and you are all set
        
       | grekowalski wrote:
       | "Logs were designed for a different era. An era of monoliths,
       | single servers, and problems you could reproduce locally."
       | 
       | But the next era will be like the previous one. Today monolith is
       | enough for most of apps.
        
       | yoan9224 wrote:
       | Logs were designed for a different era. An era of monoliths,
       | single servers, and problems you could reproduce locally. Today,
       | a single user request might touch 15 services, 3 databases, 2
       | caches, and a message queue.
       | 
       | If a user request is hitting that many things, in my view, that
       | is a deeply broken architecture.
       | 
       | I'm building an analytics SaaS and we made the conscious decision
       | to keep it simple: Next.js API routes + Supabase + minimal
       | external services. A single page view hits maybe 3 components max
       | (CDN -> App -> Database).
       | 
       | That said, I agree completely on structured logging with rich
       | context. We include user_id, session_id, and event_type on every
       | log line. Makes debugging infinitely easier.
       | 
       | The "wide events" concept is solid, but the real win is just
       | having consistent, searchable structure. You don't need a
       | revolutionary new paradigm - just stop logging random strings and
       | use JSON with a schema.
        
       | gijoeyguerra wrote:
       | Persisting a data schema that represents business events is a
       | great idea. That's more about Event Sourcing though and doing
       | that can answer a ton of questions about the system without doing
       | it in log messages.
       | 
       | Wide events as a strategy is expensive, even with sampling, and
       | doesn't address the fundamental problem - why do we log messages?
       | 
       | I was hoping the article would enumerate why we log messages.
       | Nailing down those scenarios first will lead to a happy life.
       | 
       | Why do we log? - proof of life - is the system running? - what is
       | the state (in memory) when an error occurred? - when did an error
       | occur? - do I need to get up at 2 am and fix something? - what do
       | I need to fix?
       | 
       | I feel like every team operating a system has their own reasons
       | for logging.
        
       | fny wrote:
       | This seems like a classic time vs space trade off.
       | 
       | Instead of reconstructing a "wide event" from multiple log lines
       | with the same request id, the suggestion seems to be logging wide
       | events repeatedly to simplify reconstruction from request ids.
       | 
       | I personally don't see the advantage, and in either scenario, if
       | you're not logging what's needed your screwed.
        
       | Hackbraten wrote:
       | > No grep-ing.
       | 
       | How is grep a bad thing? I find myself using it all the time.
       | 
       | I'm not into graphical user interfaces. They overwhelm me. By the
       | time I've clicked myself through the GUI or written some horrible
       | proprietary $COMPANY Query Language string, I might have already
       | figured out the bug using tried and tested CLI tools.
        
       | thangalin wrote:
       | Use events instead of repetitious logging calls.
       | 
       | https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/
        
       | XCSme wrote:
       | I've recently added error tracking to my self-hosted analytics
       | app (UXWizz), and the way I did it is simply add extra events to
       | each user/session. Once you have the concept of a session or
       | user, you can simply attach errors or logs as Events stored for
       | that user. This solves the main problem mentioned in the article,
       | where you don't know what happened, plus being an Event stored in
       | a MySQL database, you can still query it.
       | 
       | Why not simply use Events for logging, instead of plain strings?
        
       | Illniyar wrote:
       | While I agree with some of it, I feel like there's a big gotcha
       | here that isn't addressed. Having 1 single wide event, at the end
       | of a request, means that if something unexpected happens in the
       | middle (stack overflow, some bug that throws an error that
       | bypasses your logging system, lambda times out etc...) you don't
       | get any visibility into what happens.
       | 
       | You also most likely lose out on a lot of logging frameworks your
       | language has that your dependencies might use.
       | 
       | I would say this is a good layer to put on top of your regular
       | logs. Make sure you have a request/session wide id and aggregate
       | all those in your clickhouse or whatever into a single "log".
        
         | wesammikhail wrote:
         | The way I have solved for this in my own framework in PHP is by
         | having a Logging class with the following interface
         | interface LoggerInterface {              // calls
         | $this->system(LEVEL_ERROR, ...);              public function
         | exception(Throwable $e): void;              // Typical system
         | logs              public function system(string $level, string
         | $message, ?string $category = null, mixed ...$extra): void;
         | // User specific logs that can be seen in the user's "my
         | history"               public function log(string $event,
         | int|string|null $user_id = null, ?string $category = null,
         | ?string $message = null, mixed ...$extra): void;       }
         | 
         | I also have a global exception handler that is registered at
         | application bootstrap time that takes any exception that
         | happens at runtime and runs $logger->exception($e);
         | 
         | There is obviously a tiny bit more of boilerplating to this to
         | ensure reliability, but it works so well that I can't live
         | without it anymore. The logs are then inserted into a wide DB
         | table with all the field one could ever want to examine thanks
         | to the variadic parameter.
        
         | pastage wrote:
         | Having logs in the format "connection X:Y accepted at Z ns for
         | http request XXX" and then a "connection X:Y closed at Z ns for
         | http response XXX" is rather nice when debugging issues on slow
         | systems.
        
       | paragraft wrote:
       | I've recently come off a team that was racking up a huge Splunk
       | bill with ~70 log events for each request on a high traffic
       | service, and this is all very resonant (except the bit about
       | sampling, I never gave that much thought - reducing our Splunk
       | bill 70x was ambitious enough for me!).
       | 
       | Hadn't heard the "wide event" name, but I had settled on the same
       | idea myself in that time (called them "top-level events" - i.e.
       | we would gather information from the duration of the request and
       | only log it at the "top" of the stack at the end), and
       | evangelised them internally mostly on the basis it gave you
       | fantastic correlation ability.
       | 
       | In theory if you've got a trace id in Splunk you can do
       | correlated queries anyway, but we were working in Spring and
       | forever having issues with losing our MDC after doing cross-
       | thread dispatch and forgetting to copy the MDC thread global
       | across. This wasn't obvious from the top-level, and usually only
       | during an incident would you realise you weren't seeing all the
       | loglines you expected for a given trace. So absent a better
       | solution there, tracking debug info more explicitly was
       | appealing.
       | 
       | Also used these top-level events to store sub-durations (e.g. for
       | calling downstream services, invoking a model etc), and with
       | Splunk if you record not just the length of a sub-process but its
       | absolute start, you can reconstruct a hacky waterfall chart of
       | where time was spent in your query.
        
       | efitz wrote:
       | Because of the nature of how software is built and deployed
       | nowadays, it's generally not possible to write single log entries
       | that tell the "whole story" of "what happened".
       | 
       | I could write about this for hours, but instead I'll just discuss
       | two concepts that you need in modern logging: vertical
       | correlation and horizontal correlation.
       | 
       | Within a system, requests tend to go "up" and "down" stacks of
       | software. It is very useful in these scenarios to have "vertical
       | correlation" fields shared between adjacent layers, so that
       | activity in one layer can be unambiguously attributed to activity
       | in the adjacent layers. But sharing such a correlation value
       | requires passing the value between layers, which might be a
       | breaking api change. Occasionally it's possible to construct a
       | correlation value at each adjacent layer by transforming existing
       | parameters in exactly the same way on the calling side and called
       | side.
       | 
       | Additionally, software on one system converses with software on
       | other systems; in those cases you need to have pairwise
       | correlation values between adjacent peer layers. Again, same
       | limitations apply to carrying such a correlation value via the
       | API or protocol.
       | 
       | Really foresighted devs can anticipate these requirements and
       | generate unique transaction ids that can be shared between
       | machines and up and down the stack.
        
       | simonw wrote:
       | I hope registering an entire domain name for a blog post doesn't
       | become a trend. I like linking to things that are likely to last
       | a long time - a personal blog is one thing, but expecting people
       | to keep paying the renewal fee every year for a single article
       | feels less likely to me.
       | 
       | A good alternative here is subdomains, since those don't have an
       | additional annual fee. https://logging-sucks.boristane.com/ could
       | work well here.
        
         | willempienaar wrote:
         | I kind of agree, but the message in this particular post does
         | border on https://simonwillison.net/2024/Jul/13/give-people-
         | something-...
        
           | simonw wrote:
           | Maybe I should have written "Give people something to link to
           | that they can expect to stick around for a very long time"
        
         | flockonus wrote:
         | Sorry, this is not a "blog post" - it's far closer to digital
         | marketing. A lead attractor as the author is trying to sell a
         | service very clearly by the end of his page (no disrespect
         | meant to either).
        
       | etamponi wrote:
       | > Logs were designed for a different era. An era of monoliths,
       | single servers, and problems you could reproduce locally. Today,
       | a single user request might touch 15 services, 3 databases, 2
       | caches, and a message queue. Your logs are still acting like it's
       | 2005.
       | 
       | Perhaps it's time to take back the good things from 2005.
        
       | 0xbadcafebee wrote:
       | > Here's the mental model shift that changes everything: Instead
       | of logging what your code is doing, log what happened to this
       | request.
       | 
       | Yeah that doesn't magically fix everything. Logging is still an
       | arbitrary, clunky, unintuitive process that requires intentional
       | design and extra systems to be useful.
       | 
       | The "Wide Event log" example is 949 bytes, which isn't
       | unmanageably large, but it is 3x larger than most log messages
       | which are about 300 bytes. And in that blob of data might be key
       | insights, but it is left up to an extra engineering process to
       | discover what might be unusual in that blob. It lacks things like
       | code line numbers, stack trace, and context given by the program
       | about its particular functions (rather than assumptions based on
       | a few pieces of metadata). And it's excessively verbose, as it
       | has a trace and request ID and service name, but duplicates
       | information already available to tracing systems based on those 3
       | metrics.
       | 
       | > Wide events are a philosophy: one comprehensive event per
       | request, with all context attached.
       | 
       | That's simply impossible. You cannot have all context from
       | viewing a single point in the network, regardless of how hard you
       | try to record or pass on information. That's the whole point of
       | tracing: you correlate the context of different network points,
       | specifically because that's the only way to discover the missing
       | details.
       | 
       | > Modern columnar databases (ClickHouse, BigQuery, etc.) are
       | specifically designed for high-cardinality, high-dimensionality
       | data. The tooling has caught up. Your practices should too.
       | 
       | You should not depend on a space shuttle to get to the grocery
       | store. Logging is intended to be an abstracted component which
       | can be built on by other systems. Your app should work just as
       | well running from Docker on your laptop as it does in the cloud.
        
         | zX41ZdbW wrote:
         | ClickHouse is a tiny component - a single binary that runs on a
         | laptop.
        
       | scolvin wrote:
       | Correction, logging used to suck - now it's fixed
       | https://pydantic.dev/logfire :-)
        
       | groundzeros2015 wrote:
       | > An era of monoliths, single servers, and problems you could
       | reproduce locally.
       | 
       | Actually this is the problem. It's extremely difficult to debug
       | when you do not preserve basic properties that allow you to read
       | code and reason about events.
        
       | KaiserPro wrote:
       | I agree that logging suck bollocks, especially when most of the
       | time you really want metrics.
       | 
       | Opentel is great, but sadly A lot of the stuff that I am using
       | doesn't support it.
       | 
       | The thing that made it much more bearable, even easy is loki and
       | a decent log parser.
       | 
       | I know a lot of kids like using SQL to interact with things,
       | loki's explore interface beats the living shit out of SQL (In my
       | opinion) its really simple to just isolate and slice logs
       | interactivly. You can build your query really simply.
       | 
       | It beats splunk/sumo/scuba(facebook's log system) hands down in
       | terms of searchability.
        
       | mkarrmann wrote:
       | I broadly agree with the article.
       | 
       | The described pattern is standard in Meta. This, along with the
       | infrastructure and tooling to support it, was the single largest
       | "devx quality of life improvement" in my experience moving to big
       | tech.
        
       | hoppp wrote:
       | "Today, a single user request might touch 15 services, 3
       | databases, 2 caches, and a message queue."
       | 
       | This right here is the fundamental problem because the way it's
       | done is highly inefficient, complex and I believe only exists so
       | cloud providers can sell their expensive offerings.
       | 
       | A monolith is fine 90% of the time.
        
         | sgarland wrote:
         | That, and everyone thinks they have to do things this way, so
         | it's a terrible cycle.
         | 
         | So many problems would be solved if service calls were IPC
         | instead of network calls.
        
       | juancn wrote:
       | Logging is one tool of many, you need logs, metrics and
       | distributed tracing at the very least for any significant piece
       | of modern infra.
       | 
       | If you really want to get serious, you also want some kind of
       | continuous profiling (like pyroscope) or at the very least some
       | periodic thread dump collector (for serious degradation
       | diagnostics once every couple of minutes is enough).
       | 
       | But logging is still a great tool.
        
       | iamwil wrote:
       | Content marketing and lead generation is getting more sneaky
        
       | nostrademons wrote:
       | Google solved most of these problems around 2005, with tools like
       | LOG_EVERY_N (now part of absl [1]), Dapper [2], and several other
       | tools that aren't public yet. You can trace an individual request
       | through every internal system, view the request/response
       | protobufs, every log that the server emitted, timing details,
       | etc. More to the point, you can _share_ this trace, which means
       | that it 's possible for one person to discover the bug, reproduce
       | it, and then have another person in a completely different
       | office/timezone/country debug it, _even if the latter cannot
       | reproduce the bug themselves_. This has proved hugely useful;
       | just last week I was tasked with reproducing a bug on sparsely-
       | available prerelease hardware so that a distant team could
       | diagnose what went wrong.
       | 
       | The key insight that this article hints at but doesn't quite get
       | too: you should treat your logs as a _product_ whose customers
       | are _the rest of the devs in your company_. The way you log
       | things is intimately connected with what you want to do with
       | them, and you need to build systems to generate useful insights
       | from the log statements. In some cases it literally is part of
       | the product: many of the machine learning systems that generate
       | recommendations, search results, spam filtering, abuse detection,
       | traffic direction, etc. are all based on the logs for the
       | product, and you need to consider them as first-class citizens
       | that you absolutely cannot break while adding new features. Logs
       | are not just for debugging.
       | 
       | [1] https://absl.readthedocs.io/en/latest/absl.logging.html
       | 
       | [2] https://research.google/pubs/dapper-a-large-scale-
       | distribute...
        
       ___________________________________________________________________
       (page generated 2025-12-21 23:00 UTC)