[HN Gopher] Logging Sucks
___________________________________________________________________
Logging Sucks
Author : FlorinSays
Score : 459 points
Date : 2025-12-21 18:09 UTC (4 hours ago)
(HTM) web link (loggingsucks.com)
(TXT) w3m dump (loggingsucks.com)
| firefoxd wrote:
| Good write up.
|
| Gonna go on a tangent here. Why the single purpose domain?
| Especially since the author has a blog. My blog is full of links
| to single post domains that are no longer.
| OsrsNeedsf2P wrote:
| Because it's an ad
| thewisenerd wrote:
| it's an ad, for what?
|
| i do not see a product upsell anywhere.
|
| if it's an ad for the author themselves, then it's a very
| good one.
| KomoD wrote:
| At the end there's a form where you can get a "personalized
| report", I have a feeling that'll advertise some kind of
| service, it's usually the case.
| danielfalbo wrote:
| I see more and more blog posts that contain interactive elements.
| Despite the general enshittification of the average blog and the
| internet, this feels like a 'modern' touch that actually adds
| something valuable to the sufficient ad-free no-popups old blog
| style.
| heinrichhartman wrote:
| A post on this topic feels incomplete without a shout-out to
| Charity Majors - she has been preaching this for a decade,
| branded the term "wide events" and "observability", and built
| honeycomb.io around this concept.
|
| Also worth pointing out that you can implement this method with a
| lot of tools these days. Both structured Logs or Traces lend
| itself to capture wide events. Just make sure to use a tool that
| supports general query patterns and has rich visualizations
| (time-series, histograms).
| the_mitsuhiko wrote:
| > A post on this topic feels incomplete without a shout-out to
| Charity Majors
|
| I concur. In fact, I strongly recommend anyone who has been
| working with observability tools or in the industry to read her
| blog, and the back story that lead to honeycomb. They were the
| first to recognize the value of this type of observability and
| have been a huge inspiration for many that came after.
| loevborg wrote:
| I've learned more from Charity about telemetry than from
| anyone else. Her book is great, as are her talks and blog
| posts. And Honeycomb, as a tool, is frankly pretty amazing
|
| Yep, I'm a fan.
| dcminter wrote:
| Could you drop a few specific posts here that you think are
| good for someone (me) who hasn't read her stuff before? Looks
| like there's a decade of stuff on her blog and I'm not sure I
| want to start at the very beginning...
| simonw wrote:
| A few of my favourites:
|
| - Software Sprawl, The Golden Path, and Scaling Teams With
| Agency: https://charity.wtf/2018/12/02/software-sprawl-the-
| golden-pa... - introduces the idea of the "golden path",
| where you tell engineers at your company that if they use
| the approved stack of e.g. PostgreSQL + Django + Redis then
| the ops team will support that for them, but if they want
| to go off path and use something like MongoDB they can do
| that but they'll be on the hook for ops themselves.
|
| - Generative AI is not going to build your engineering team
| for you: https://stackoverflow.blog/2024/12/31/generative-
| ai-is-not-g... - why generative AI doesn't mean you should
| stop hiring junior programmers.
|
| - I test in prod: https://increment.com/testing/i-test-in-
| production/ - on how modern distributed systems WILL have
| errors that only show up in production, hence why you need
| to have great instrumentation in place. "No pull request
| should ever be accepted unless the engineer can answer the
| question, "How will I know if this breaks?""
|
| - Advice for Engineering Managers Who Want to Climb the
| Ladder: https://charity.wtf/2022/06/13/advice-for-
| engineering-manage...
|
| - The Engineer/Manager Pendulum:
| https://charity.wtf/2017/05/11/the-engineer-manager-
| pendulum... - I LOVE this one, it's about how it's OK to
| have a career where you swing back and forth between
| engineering management and being an "IC".
| Aurornis wrote:
| > They were the first to recognize the value of this type of
| observability
|
| With all due respect to her great writing, I think there's a
| mix of revisionist history blended with PR claims going on in
| this thread. The blog has some good reading, but let's not
| get ahead of ourselves in rewriting history around this one
| person/company.
| the_mitsuhiko wrote:
| > I think there's a mix of revisionist history blended with
| PR claims going on in this thread.
|
| I can only speak for myself. I worked for a company that is
| somewhere in the observability space (Sentry) and Charity
| was a person I looked up to my entire time working on
| Sentry. Both for how she ran the company, for the design
| they picked and for the approaches they took. There might
| be others that have worked on wide events (afterall,
| Honeycomb is famously inspired by Facebook's scuba), she is
| for sure the voice that made it popular.
| vasco wrote:
| She has good content but no single person branded the term
| "observability", what the heck. You can respect someone without
| making wild claims.
| rjbwork wrote:
| Nick Blumhardt for a while longer than that as "structured
| logging". Seq and Serilog as enabling software and library in
| the .net ecosystem.
| layer8 wrote:
| The article emphasizes that their recommendation is different
| from structured logging.
| fishtoaster wrote:
| This post was so in-line with her writing that I was _really_
| expecting it to turn into an ad for Honeycomb at the end. I was
| pretty surprised with it turned out the author was
| unaffiliated!
| Aurornis wrote:
| > she has been preaching this for a decade, branded the term
| "wide events" and "observability",
|
| With all due respect to her other work, she most certainly did
| not coin the term "observability". Observability has been a
| topic in multiple fields for a very long time and has had
| widespread usage in computing for decades.
|
| I'm sure you meant well by your comment, but I doubt this is a
| claim she even makes for herself.
|
| She has been an influential writer on the topic and founded a
| company in this space, but she didn't actually create the
| concept or terminology of observability.
| alexwennerberg wrote:
| > Logs were designed for a different era. An era of monoliths,
| single servers, and problems you could reproduce locally. Today,
| a single user request might touch 15 services, 3 databases, 2
| caches, and a message queue. Your logs are still acting like it's
| 2005.
|
| If a user request is hitting that many things, in my view, that
| is a deeply broken architecture.
| the_mitsuhiko wrote:
| > If a user request is hitting that many things, in my view,
| that is a deeply broken architecture.
|
| If we want it or not, a lot of modern software looks like that.
| I am also not a particular fan of building software this way,
| but it's a reality we're facing. In part it's because quite a
| few services that people used to build in-house are now
| outsourced to PaaS solutions. Even basic things such as
| authentication are more and more moving to third parties.
| worik wrote:
| > but it's a reality we're facing.
|
| Yes. Most software is bad
|
| The incentives between managers and technicians are all wrong
|
| Bad software is more profitable, over the time frames
| managers care about, than good software
| the_mitsuhiko wrote:
| The reason we end up with very complex systems I don't
| think is because of incentives between "managers and
| technicians". If I were to put my finger to it, I would
| assume it's the very technicians who argued themselves into
| a world where increased complexity and more dependencies is
| seen as a good thing.
|
| Fighting complexity is deeply unpopular.
| 0x3f wrote:
| At least in my place of work, my non-technical manager is
| actually on board with my crusade against complex
| nonsense. Mostly because he agrees it would increase
| feature velocity to not have to touch 5 services per
| minor feature. The other engineers love the horrific mess
| they've built. It's almost like they're roleplaying
| working at Google and I'm ruining the fun.
| dxdm wrote:
| > If a user request is hitting that many things, in my view,
| that is a deeply broken architecture.
|
| Things can add up quickly. I wouldn't be surprised if some
| requests touch a lot of bases.
|
| Here's an example: a user wants to start renting a bike from
| your public bike sharing service, using the app on their phone.
|
| This could be an app developed by the bike sharing company
| itself, or a 3rd party app that bundles mobility options like
| ride sharing and public transport tickets in one place.
|
| You need to authentice the request and figure out which
| customer account is making the request. Is the account allowed
| to start a ride? They might be blocked. They might need to
| confirm the rules first. Is this ride part of a group ride, and
| is the customer allowed to start multiple rides at once? Let's
| also get a small deposit by putting a hold of a small sum on
| their credit card. Or are they a reliable customer? Then let's
| not bother them. Or is there a fraud risk? And do we need to
| trigger special code paths to work around known problems for
| payment authorization for cards issued by this bank?
|
| Everything good so far? Then let's start the ride.
|
| First, let's lock in the necessary data. Which rental pricing
| did the customer agree to? Is that actually available to this
| customer, this geographical zone, for this bike, at this time,
| or do we need to abort with an error? Otherwise, let's remember
| this, so we can calculate the correct rental fee at the end.
|
| We normally charge an unlock fee in addition to the per-minute
| price. Are we doing that in this case? If yes, does the
| customer have any free unlock credit that we need to consume or
| reserve now, so that the app can correctly show unlock costs if
| the user wants to start another group ride before this one
| ends?
|
| Ok, let's unlock the bike and turn on the electric motor. We
| need to make sure it's ready to be used and talk to the IoT box
| on the bike, taking into account the kind of bike, kind of box
| and software version. Maybe this is a multistep process,
| because the particular lock needs manual action by the
| customer. The IoT box might have to know that we're in a zone
| where we throttle the max speed more than usual.
|
| Now let's inform some downstream data aggregators that a ride
| started successfully. BI (business intelligence) will want to
| know, and the city might also require us to report this to
| them. The customer was referred by a friend, and this is their
| first ride, so now the friend gets his referral bonus in the
| form of app credit.
|
| Did we change an unrefundable unlock fee? We might want to
| invoice that already (for whatever reason; otherwise this will
| happen after the ride). Let's record the revenue, create the
| invoice data and the PDF, email it, and report this to the
| country's tax agency, because that's required in the country
| this ride is starting in.
|
| Or did things go wrong? Is the vehicle broken? Gotta mark it
| for service to swing by, and let's undo any payment holds. Or
| did the deposit fail, because the credit card is marked as
| stolen? Maybe block the customer and see if we have other
| recent payments using the same card fingerprint that we might
| want to proactively refund.
|
| That's just off the top of my head, there may be more for a
| real life case. Some of these may happen synchronously, others
| may hit a queue or event bus. The point is, they are all tied
| to a single request.
|
| So, depending on how you cut things, you might need several
| services that you can deploy and develop independently.
|
| - auth - core customer management, permissions, ToS agreement,
|
| - pricing, - geo zone definitions, - zone rules,
|
| - benefit programs,
|
| - payments and payment provider integration, - app credits, -
| fraud handling,
|
| - ride management, - vehicle management, - IoT integration,
|
| - invoicing, - emails, - BI integration, - city hall
| integration, - tax authority integration,
|
| - and an API gateway that fronts the app request.
|
| These do not have to be separate services, but they are
| separate enough to warrant it. They wouldn't be exactly micro
| either.
|
| Not every product will be this complicated, but it's also not
| that out there, I think.
| 0x3f wrote:
| > These do not have to be separate services, but they are
| separate enough to warrant it.
|
| All of this arises from your failure to question this basic
| assumption though, doesn't it?
| dxdm wrote:
| > All of this arises from your failure to question this
| basic assumption though, doesn't it?
|
| Haha, no. "All of this" is a scenario I consider quite
| realistic in terms of what needs to happen. The question
| is, how should you split this up, if at all?
|
| Mind that these concerns will be involved in other ways
| with other requests, serving customers and internal users.
| There are enough different concerns at different levels of
| abstraction that you might need different domain experts to
| develop and maintain them, maybe using different
| programming languages, depending on who you can get. There
| will definitely be multiple teams. It may be beneficial to
| deploy and scale some functions independently; they have
| different load and availability requirements.
|
| Of course you can slice things differently. Which
| assumptions have _you_ questioned recently? I think you 've
| been given some material. No need to be rude.
| 0x3f wrote:
| I don't think I was rude. You're overcomplicating the
| architecture here for no good reason. It might be common
| to do so, but that doesn't make it good practice. And
| ultimately I think it's your job as a professional to
| question it, which makes not doing so a form of
| 'failure'. Sorry if that seems harsh; I'm sharing what I
| believe to be genuine and valuable wisdom.
|
| Happy to discuss why you think this is all necessary.
| Open to questioning assumptions of my own too, if you
| have specifics.
|
| As it is, you're just quoting microservices dogma. Your
| auth service doesn't need a different programming
| language from your invoicing system. Nor does it need to
| be scaled independently. Why would it?
| jdpage wrote:
| Tangential, but I wonder if the given example might be straying a
| step too far? Normally we want to keep sensitive data out of
| logs, but the example includes a user.lifetime_value_cents field.
| I'd want to have a chat with the rest of the business before
| sticking something like that in logs.
| nightpool wrote:
| In some companies, this type of information is often very
| important and very easily available to everyone at all levels
| of the business to help prioritize and understand customer
| value. I would not consider it "sensitive" in the same way that
| e.g. PII would be.
| jdpage wrote:
| Good to know! At previous jobs, that information wasn't
| available to me (and it didn't matter because the customer
| bases were small enough that every customer was top
| priority), so I assumed it was considered more sensitive than
| it perhaps is.
| zkmon wrote:
| > Logs were designed for a different era. An era of monoliths,
| single servers, and problems you could reproduce locally. Today,
| a single user request might touch 15 services, 3 databases, 2
| caches, and a message queue. Your logs are still acting like it's
| 2005.
|
| Logs are fine. The job of local logs is to record the talk of a
| local process. They are doing this fine. Local logs were never
| meant to give you a picture of what's going on some other server.
| For such context, you need a transaction tracing that can stitch
| the story together across all processes involved.
|
| Usually, looking at the logs at right place should lead you to
| the root cause.
| holoduke wrote:
| APN/Kibana. All what I need for inspecting logs.
| devmor wrote:
| Shoutout to Kibana. Absolutely my favorite UI tool for trying
| to figure out what went wrong (and sometimes, IF anything
| went wrong in the first place)
| venturecruelty wrote:
| >Today, a single user request might touch 15 services, 3
| databases, 2 caches, and a message queue.
|
| Not if I have anything to say about it.
|
| >Your logs are still acting like it's 2005.
|
| Yeah, because that's just before software development went
| absolutely insane.
| otterley wrote:
| One of the points the author is trying to make (although he
| doesn't make it well, and his attitude makes it hard to read)
| is that logs aren't just for root-causing incidents.
|
| When properly seasoned with context, logs give you useful
| information like who is impacted (not every incident impacts
| every customer the same way), correlations between component
| performance and inputs, and so forth. When connected to
| analytical engines, logs with rich context can help you figure
| out things like behaviors that lead to abandonment, the impact
| of security vulnerability exploits, and much more. And in their
| never-ending quest to improve their offerings and make more
| money, product managers love being able to test their theories
| against real data.
| ivan_gammel wrote:
| It's a wild violation of SRP to suggest that. Separating
| concerns is way more efficient. Database can handle audit
| trail and some key metrics much better, no special tools
| needed, you can join transaction log with domain tables as a
| bonus.
| otterley wrote:
| Are you assuming they're all stored identically? If so,
| that's not necessarily the case.
|
| Once the logs have entered the ingestion endpoint, they can
| take the most optimal path for their use case. Metrics can
| be extracted and sent off to a time-series metric database,
| while logs can be multiplexed to different destinations,
| including stored raw in cheap archival storage, or matched
| to schemas, indexed, stored in purpose-built search engines
| like OpenSearch, and stored "cooked" in Apache
| Iceberg+Parquet tables for rapid querying with Spark,
| Trino, or other analytical engines.
|
| Have you ever taken, say, VPC flow logs, saved them in
| Parquet format, and queried them with DuckDB? I just
| experimented with this the other day and it was mind-
| blowingly awesome--and _fast_. I, for one, am glad the days
| of writing parsers and report generators myself are over.
| ivan_gammel wrote:
| Good joke.
| ohans wrote:
| This was a brilliant write up, and loved the interactivity.
|
| I do think "logs are broken" is a bit overstated. The real
| problem is unstructured events + weak conventions + poor
| correlation.
|
| Brilliant write up regardless
| the__alchemist wrote:
| From what I gather: This is referring to Web sites or other HTTP
| applications which are internally implemented as a collection of
| separate applications/ micro-services?
| cowsandmilk wrote:
| Horrid advice at the end about logging every error, exception,
| slow request, etc if you are sampling healthy requests.
|
| Taking slow requests as an example, a dependency gets slower and
| now your log volume suddenly goes up 100x. Can your service
| handle that? Are you causing a cascading outage due to increased
| log volumes?
|
| Recovery is easier if your service is doing the same or less work
| in a degraded state. Increasing logging by 20-100x when degraded
| is not that.
| otterley wrote:
| It's an important architectural requirement for a production
| service to be able to scale out their log ingestion
| capabilities to meet demand.
|
| Besides, a little local on-disk buffering goes a long way, and
| is cheap to boot. It's an antipattern to flush logs directly
| over the network.
| trevor-e wrote:
| Yea that was my thought too. I like the idea in principle, but
| these magic thresholds can really bite you. It claims to be
| P(99), probably off some historical measurement, but that's
| only true if it's dynamically changing. Maybe this could
| periodically query the OTEL provider for the real number to at
| least limit the time window of something bad happening.
| debazel wrote:
| My impression was that you would apply this filter after the
| logs have reach your log destination, so there should be no
| difference for your services unless you host your own log
| infra, in which case there might be issues on that side. At
| least that's how we do it with Datadog because ingestion is
| cheap but indexing and storing logs long term is the expensive
| part.
| Veserv wrote:
| I do not see how logging could bottleneck you in a degraded
| state unless your logging is terribly inefficient. A properly
| designed logging system can record on the order of 100 million
| logs per second per core.
|
| Are you actually contemplating handling 10 million requests per
| second per core that are failing?
| otterley wrote:
| Generation and publication is just the beginning (never mind
| the fact that resources consumed by an application to log
| something are no longer available to do real work). You have
| to consider the scalability of each component in the logging
| architecture from end to end. There's ingestion, parsing,
| transformation, aggregation, derivation, indexing, and
| storage. Each one of those needs to scale to meet demand.
| Veserv wrote:
| I already accounted for consumed resources when I said 10
| million instead of 100 million. I allocated 10% to logging
| overhead. If your service is within 10% of overload you are
| already in for a bad time. And frankly, what systems are
| you using that are handling 10 million requests per second
| per core (100 nanoseconds per request)? Hell, what services
| are you deploying that you even have 10 million requests
| per second per core to handle?
|
| All of those other costs are, again, trivial with proper
| design. You can easily handle billions of events per second
| on the backend with even a modest server. This is done
| regularly by time traveling debuggers which actually need
| to handle these data rates. So again, what are we even
| deploying that has billions of events per second?
| otterley wrote:
| In my experience working at AWS and with customers, you
| don't need billions of TPS to make an end-to-end logging
| infrastructure keel over. It takes much less than that.
| As a working example, you can host your own end-to-end
| infra (the LGTM stack is pretty easy to deploy in a
| Kubernetes cluster) and see what it takes to bring yours
| to a grind with a given set of resources and TPS/volume.
| Veserv wrote:
| I prefaced all my statements with the assumption that the
| chosen logging system is not poorly designed and terribly
| inefficient. Sounds like their logging solutions are
| poorly designed and terribly inefficient then.
|
| It is, in fact, a self-fulfilling prophecy to complain
| that logging can be a bottleneck if you then choose
| logging that is 100-1000x slower than it should be. What
| a concept.
| otterley wrote:
| At the end of the day, it comes down to what sort of
| functionality you want out of your observability. Modest
| needs usually require modest resources: sure, you could
| just append to log files on your application hosts and
| ship them to a central aggregator where they're stored
| as-is. That's cheap and fast, but you won't get a lot of
| functionality out of it. If you want more, like real-time
| indexing, transformation, analytics, alerting, etc., it
| requires more resources. Ain't no such thing as a free
| lunch.
| dpark wrote:
| Surely you aren't doing real time indexing,
| transformation, analytics, etc in the same service that
| is producing the logs.
|
| A catastrophic increase in logging could certainly take
| down your log processing pipeline but it should not
| create cascading failures that compromise your service.
| otterley wrote:
| Of course not. Worst case should be backpressure, which
| means processing, indexing, and storage delays. Your
| service might be fine but your visibility will be
| reduced.
| dpark wrote:
| For sure. Your can definitely tip over your logging
| pipeline and impact visibility.
|
| I just wanted to make sure we weren't still talking about
| "causing a cascading outage due to increased log volumes"
| as was mentioned above, which would indicate a
| significant architectural issue.
| Cort3z wrote:
| Just implement exponential backoff for slow requests logging,
| or some other heuristic, to control it. I definitely agree it
| is a concern though.
| XCSme wrote:
| Good point. It also reminded me of when I was trying to
| optimize my app for some scenarios, then I realized it's better
| to optimize it for ALL scenarios, so it works fast and the
| servers can handle no matter what. To be more specific, I
| decided NOT to cache any common queries, but instead make sure
| that all queries are fast as possible.
| golem14 wrote:
| For high volume services, you can still log a sample of healthy
| requests, e.g., trace_id mod 100 == 0. That keeps log growth
| under control. The higher the volume, the smaller percentage
| you can use.
| kgklxksnrb wrote:
| Logfiles are a user interface.
| otterley wrote:
| The substance of this post is outstanding.
|
| The framing is not, though. Why does it have to sound so dramatic
| and provocative? It's insulting to its audience. Grumpiness, in
| the long term, is a career-limiting attitude.
| b0ringdeveloper wrote:
| I get the AI feeling from it.
| otterley wrote:
| It might have been AI-assisted, and it might not have been.
| It doesn't really matter. The author is ultimately
| responsible for the end result.
| rglover wrote:
| Career-limiting perhaps (if expressing normal human emotion is
| a minus inside of an organization, it may be time to bail) but
| some of the best minds I've met/observed were absolute
| _curmudgeons_ (with purpose--they were properly bothered by a
| problem and refused to go along with the "sweep it under the
| rug" behavior).
|
| Sure, I've dealt with plenty of assholes, too, but the grumps
| are usually just tired of their valid insight being ignored by
| more foolish, orthogonally incentivized types (read: "playing
| the game" not "making it work well").
| otterley wrote:
| We've all tolerated the grumpy genius at some point in our
| careers. Nevertheless, most of us would prefer to work with a
| person who's both smart and kind over someone who's smart and
| curmudgeonly. It is possible to be both smart and kind, and
| I've had the pleasure of working with such people.
|
| Assholes can sap an organization's strength faster than any
| productive value their intelligence can provide. I'm not
| suggesting the author is an asshole, though; there's not
| enough evidence from this post.
| jupin wrote:
| Some excellent points raised in this article.
| charcircuit wrote:
| This article is attacking a strawman. It makes up terrible logs
| and then says they are bad. Even if this was a single monolith
| the logs still don't include even something like a thread id, to
| avoid mixing different requests together.
| blinded wrote:
| I see logs worse that that on the daily.
| dcminter wrote:
| I've generally found that structured logs that include a
| correlation ID make it quite easy to narrow down the general area
| or exact cause of problems. Usually (in enterprise orgs) via
| Splunk or Datadog.
|
| Where I've had problems it's usually been one of:
|
| There wasn't anything logged in the error block. A comment saying
| "never happens" is often discovered later :)
|
| Too much was logged and someone mandated dialing the logging down
| to save costs. Sigh.
|
| A new thread was started and the thread-local details including
| the correlation ID got lost, then the error occurred downstream
| of that. I'd like better solutions for that one.
|
| Edit: Incidentally a correlation ID is not (necessarily) the same
| thing as a request ID. An API often needs to allow for the caller
| making multiple calls to achieve an objective; 5 request IDs
| might be tied to a single correlation ID.
| loglog wrote:
| Java has a solution for the thread problem: Scoped Values [0].
| If only the logging+tracing libraries would start using it...
|
| [0] https://openjdk.org/jeps/506
| dcminter wrote:
| Oh, excellent, these slipped under my radar. Sounds extremely
| promising and I do mostly work in Java!
| Spivak wrote:
| Slapping on OpenTelemetry actually will solve your problem.
|
| Point #1 isn't true, auto instrumentation exists and is really
| good. When I integrate OTel I add my own auto instrumentors
| wherever possible to automatically add lots of context. Which
| gets into point #2.
|
| Point #2 also isn't true. It can add business context in a
| hierarchal manner and ship wide events. You shouldn't have to
| tell every span all the information again. Just where it appears
| naturally the first time.
|
| Point #3 also also isn't true because OTel libs make it really
| annoying to just write a log message and very strongly pushes you
| into a hierarchy of nested context managers.
|
| Like the author's ideal setup is basically using OTel with
| Honeycomb. You get the querying and everything. And unlike
| rawdogging wide events all your traces are connected, can _span_
| multiple services and do timing for you.
| yujzgzc wrote:
| You might also need different systems for low-cardinality, low-
| latency production monitoring (where you want to throw alerts
| quickly and high cardinality fields would just get in the way),
| and medium to long term logging with wide events.
|
| Also if you're going to log wide events, for the sake of the
| person querying them after you, please don't let your schema be
| an ad hoc JSON dict of dicts, put some thought into the schema
| structure (and better have a logging system that enforces the
| schema).
| m3047 wrote:
| I agree with this statement: "Instead of logging what your code
| is doing, log what happened to this request." but the impression
| I can't shake is that this person lacks experience, or more
| likely has a lot of experience doing the same thing over and
| over.
|
| "Bug parts" (as in "acceptable number of bug parts per candy
| bar") logging should include the precursors of processing
| metrics. I think what he calls "wide events" I call bug parts
| logging in order to emphasize that it _also_ may include signals
| pertaining to which code paths were taken, how many times, and
| how long it took.
|
| Logging is not metrics is not auditing. In particular processing
| can continue if logging (temporarily) fails but not if auditing
| has failed. I prefer the terminology "observables" to "logging"
| and "evaluatives" to "metrics".
|
| In mature SCADA systems there is the well-worn notion of a
| "historian". Read up on it.
|
| A fluid level sensor on CANbus sending events 10x a second isn't
| telling me whether or not I have enough fuel to get to my
| destination (a significant question); however, that granularity
| might be helpful for diagnosing a stuck sensor (or bad
| connection). It would be impossibly fatiguing and hopelessly
| distracting to try to answer the significan question from this
| firehose of low-information events. Even a de-noised fuel gauge
| doesn't directly diagnose my desired evaluative (will I get there
| or not?).
|
| Does my fuel gauge need to also serve as the debugging interface
| for the sensor? No, it does not. Likewise, send metrics /
| evaluatives to the cloud not logging / observables; when
| something goes sideways the real work is getting off your ass and
| taking a look. Take the time to think about what that looks like:
| maybe that's the best takeaway.
| otterley wrote:
| > Logging is not metrics is not auditing.
|
| I espouse a "grand theory of observability" that, like matter
| and energy, treats logs, metrics, and audits alike. At the end
| of the day, they're streams of bits, and so long as no fidelity
| is lost, they can be converted between each other. Audit trails
| are certainly carried over logs. Metrics are streams of time-
| series numeric data; they can be carried over log channels or
| embedded inside logs (as they often are).
|
| How these signals are stored, transformed, queried, and
| presented may differ, but at the end of the day, the
| consumption endpoint and mechanism can be the same regardless
| of origin. Doing so simplifies both the conceptual framework
| and design of the processing system, and makes it flexible
| enough to suit any conceivable set of use cases. Plus, storing
| the ingested logs as-is in inexpensive long-term archival
| storage allows you to reprocess them later however you like.
| Veserv wrote:
| Saying they are all the same when no fidelity is lost is
| missing the point. The _only_ distinction between logs,
| traces, and metrics is literally what to do when fidelity is
| lost.
|
| If you have insufficient ingestion rate:
|
| Logs are for events that can be independently sampled and be
| coherent. You can drop arbitrary logs to stay within
| ingestion rate.
|
| Traces are for correlated sequences of events where the
| entire sequence needs to be retained to be useful/coherent.
| You can drop arbitrary whole sequences to stay within
| ingestion rate.
|
| Metrics are pre-aggregated collections of events. You pre-
| limited your emission rate to fit your ingestion rate at the
| cost of upfront loss of fidelity.
|
| If you have adequate ingestion rate, then you just emit your
| events bare and post-process/visualize your events however
| you want.
| otterley wrote:
| > If you have insufficient ingestion rate
|
| I would rather fix this problem than every other problem.
| If I'm seeing backpressure, I'd prefer to buffer locally on
| disk until the ingestion system can get caught up. If I
| need to prioritize signal delivery once the backpressure
| has resolved itself, I can do that locally as well by
| separating streams (i.e. priority queueing). It doesn't
| change the fundamental nature of the system, though.
| lll-o-lll wrote:
| Auditing is fundamentally different because it has different
| durability and consistency requirements. I can buffer my
| logs, but I might need to transact my audit.
| otterley wrote:
| For most cases, buffering audit logs on local storage is
| fine. What matters is that the data is available and
| durable _somewhere_ in the path, not that it be
| transactionally durable at the final endpoint.
| chickensong wrote:
| You could have the log shipper filter events and create a
| separate audit stream with different behavior and
| destination.
| cluckindan wrote:
| Really, have sane log message types and include "audit"
| as one of them.
|
| Log levels could be considered an anti-pattern.
| mrkeen wrote:
| > Your logs are lying to you. Not maliciously. They're just not
| equipped to tell the truth.
|
| The best way to equip logs to tell the truth is to have other
| parts of the system consume them as their source of truth.
|
| Firstly: "what the system does" and "what the logs say" can't be
| two different things.
|
| Secondly: developers can't put less info into the logs than they
| should, because their feature simply won't work without it.
| 8n4vidtmkvmk wrote:
| That doesn't sound like a good plan. You're coupling logging
| with business logic. I don't want to have to think if i change
| a debug string am i going to break something.
| andoando wrote:
| Your logic wouldn't be dependent on a debug string, but some
| enum in a structured field. Ex, event_type:
| CREATED_TRANSACTION.
|
| Seeing logging as debugging is flawed imo. A log is
| technically just a record of what happened in your database.
| SoftTalker wrote:
| You're also assuming your log infrastructure is a lot more
| durable than most are. Generally, logging is not a guaranteed
| action. Writing a log message is not normally something where
| you wait for a disk sync before proceeding. Dropping a log
| message here or there is not a fatal error. Logs get rotated
| and deleted automatically. They are designed for retroactive
| use and best effort event recording, not assumed to be a
| flawless record of everything the system did.
| tetha wrote:
| One thing this is missing: Standardization and probably the ECS'
| idea of "related" fields.
|
| A common problem in a log aggregation is the question if you
| query for user.id, user_id, userID, buyer.user.id, buyer.id,
| buyer_user_id, buyer_id, ... Every log aggregation ends up being
| plagued by this. You need standard field names there, or it
| becomes a horrible mess.
|
| And for a centralized aggregation, I like ECS' idea of "related".
| If you have a buyer and a seller, both with user IDs, you'd have
| a `related.user.id` with both id's in there. This makes it very
| simple to say "hey, give me everything related to request X" or
| "give me everything involving user Y in this time frame" (as long
| as this is kept up to date, naturally)
| ttoinou wrote:
| I always wondered why we didnt have some kind of fuzzy english
| words search regexes/tool, that is robust to keyboard typing
| mistakes, spelling mistake, synonyms, plural, conjugation etc.
| j-pb wrote:
| I actually wrote my bachelors on this topic, but instead of
| going the ECS route (which still has redundant fields in
| different components) I went in the RDF direction. That system
| has shifted towards more of a middleware/database hybrid over
| time (https://github.com/triblespace/triblespace-rs). I always
| wonder if we'd actually need logging if we had more data-
| oriented stacks where the logs fall out as a natural byproduct
| of communication and storage.
| thevinter wrote:
| The presentation is fantastic and I loved the interactive
| examples!
|
| Too bad that all of this effort is spent arguing something which
| can be summarised as "add structured tags to your logs"
|
| Generally speaking my biggest gripe with wide logs (and other
| "innovative" solutions to logging) is that whatever perceived
| benefit you argue for doesn't justify the increased complexity
| and loss of readability.
|
| We're throwing away `grep "uid=user-123" application.log` to get
| what? The shipping method of the user attached to every log?
| Doesn't feel an improvement to me...
|
| P.S. The checkboxes in the wide event builder don't work for me
| (brave - android)
| dannyfreeman wrote:
| Do you really loose the ability to grep? You can still search
| for json fragments `grep '"uid": "user-123"' application.log`
|
| If the json logged isn't pretty printed everything should still
| be on one line. You can also grep with the `--context` flag to
| get more surrounding lines.
| bambax wrote:
| > _Logging Sucks_
|
| But does it? Or is it bad logging, or excessive logging, or
| unsearchable logs?
|
| A client of mine uses SnapLogic, which is a middleware / ETL
| that's supposed run pipelines in batch mode to pass data around
| between systems. It generates an enormous amount of logs that are
| so difficult to access, search and read that they may as well
| don't exist.
|
| We're replacing all of that with simple Python scripts that do
| the same thing and generate normal simple logs with simple errors
| when something's truly wrong or the data is in the wrong format.
|
| Terse logging is what you want, not an exhaustive (and
| exhausting) torrent of irrelevant information.
| asdev wrote:
| this is the best lead generation form i've ever seen
| roncesvalles wrote:
| AI slop blogvert. The first example is disingenuous btw. Everyone
| these days uses requestIDs to be able to query all log lines
| emanated by a single request, usually set by the first backend
| service to receive the request and then propagated using headers
| (and also set in the server response).
|
| There isn't anything radical about his proposed solutions either.
| Most log storage can be set with a rule where all warning logs or
| above can be retained, but only a sample of info and debug logs.
|
| The "key insight" is also flawed. The reason why we log at every
| step is because sometimes your request never completes and it
| could be for 1000 reasons but you really need to know how far it
| got in your system. Logging only a summary at the end is happy
| path thinking.
| mnahkies wrote:
| That was difficult to read, smelt very AI assisted though the
| message was worthwhile, it could've been shorter and more to the
| point.
|
| A few things I've been thinking about recently:
|
| - we have authentication everywhere in our stack, so I've started
| including the user id on every log line. This makes getting a
| holistic view of what a user experienced much easier.
|
| - logging an error as a separate log line to the request log is a
| pain. You can filter for the trace, but it makes it hard to
| surface "show me all the logs for 5xx requests and the error
| associated" - it's doable, but it's more difficult than filtering
| on the status code of the request log
|
| - it's not enough to just start including that context, you have
| to educate your coworkers that it's now present. I've seen people
| making life hard for themselves because they didn't realize we'd
| added this context
| spike021 wrote:
| If your codebase has the concept of a request ID, you could
| also feasibly use that to trace what a user has been doing with
| more specificity.
| mnahkies wrote:
| We do have both a span id and trace id - but I personally
| find this more cumbersome over filtering on a user id. YMMV
| if you're interested in a single trace then you'd filter for
| that, but I find you often also care what happened "around" a
| trace
| ivan_gammel wrote:
| ...and the same ID can be displayed to user on HTTP 500 with
| the support contact, making life of everyone much easier.
| dexwiz wrote:
| I have seen pushback on this kind of behavior because
| "users don't like error codes" or other such nonsense. UX
| and Product like to pretend nothing will ever break, and
| when it does they want some funny little image, not useful
| output.
|
| A good compromise is to log whenever a user would see the
| error code, and treat those events with very high priority.
| spockz wrote:
| We put the error code behind a kind of message/dialog
| that invites the user to contact us if the problem
| persists and then report that code.
|
| It's my long standing wish to be able to link
| traces/errors automatically to callers when they call the
| helpdesk. We have all the required information. It's just
| that the helpdesk has actually very little use for this
| level of detail. So they can only attach it to the ticket
| so that actual application teams don't have to search for
| it.
| ivan_gammel wrote:
| Nah, that's easy problem to solve with UX copy.
| ,,Something went wrong. Try again or contact support.
| Your support request number is XXXX XXXX" (base 58
| version of UUID).
| nine_k wrote:
| ...if it does not, you should add it. A request ID, trace ID,
| correlation key, whatever you call it, you should thread it
| through every remote call, if you value your sanity.
| kulahan wrote:
| If you care about this more than anything else (e.g. if you
| care about audits a LOT and need them perfect), you can
| simply code the app via action paths, rather than for
| modularity. It makes changes harder down the road, but for
| codebases that don't change much, this can be a viable
| tradeoff to significantly improve tracing and logging.
| xmprt wrote:
| On the other hand, investing in better tracing tools unlocks a
| whole nother level of logging and debugging capabilities that
| aren't feasible with just request logs. It's kind of like you
| mentioned with using the user id as a "trace" in your first
| message but on steroids.
| dexwiz wrote:
| These tools tend to be very expensive in my experience unless
| you are running your own monitoring cloud. Either you end up
| sampling traces at low rates to save on costs, or your
| observability bill is more than your infrastructure bill.
| dietr1ch wrote:
| Doing stuff like turning on tracing for clients that saw
| errors in the last 2 minutes, or for requests that were
| retried should only gather a small portion of your data.
| Maybe you can include other sessions/requests at random if
| you want to have a baseline to compare against.
| jonasdegendt wrote:
| We self host Grafana Tempo and whilst the cost isn't
| negligible (at 50k spans per second), the money saved in
| developer time when debugging an error, compared to having
| to sift through and connect logs, is easily an order of
| magnitude higher.
| khazhoux wrote:
| > That was difficult to read, smelt very AI assisted though the
| message was worthwhile...
|
| It won't be long before _ad computem_ comments like this are
| frowned upon.
| bccdee wrote:
| Why? "This was written badly" is a perfectly normal thing to
| say; "this was written badly because you didn't put in the
| effort of writing it yourself" doubly so.
| 0xbadcafebee wrote:
| Say they used AI to write it, it came out bad, and they
| published it anyway. They had the opportunity to "make it
| better" before publishing, but didn't. The only conclusion
| for this is, they just aren't good at writing. So whether
| AI is used or not, it'll suck either way. So there's no
| need to complain about the AI.
|
| It's like complaining that somebody typed a crappy letter
| rather than hand-wrote it. Either way the letter's gonna
| suck, so why complain that it was typed?
| alwa wrote:
| I read it as a more-or-less kind comment: "even though you'll
| notice that they let an AI make the writing terrible, the
| underlying point is good enough to be worth struggling
| through that and discussing"
| giancarlostoro wrote:
| TIDs are good here too. If you generate it and enforce it
| across all your services spanning various teams and APIs anyone
| of any team can grab a TID you provide and you can get the full
| end to end of one transaction.
| bob1029 wrote:
| > Logs were designed for a different era. An era of monoliths,
| single servers, and problems you could reproduce locally.
|
| I worked with enterprise message bus loggers in semiconductor
| manufacturing context wherein we had _thousands_ of participants
| on the message bus. It generated something like 300-400 megabytes
| per hour. Despite the insane volume we made this work really well
| using just grep and other basic CLI tools.
|
| The logs were mere time series of events. Figuring out the detail
| about specific events (e.g. a list of all the tools a lot
| visited) required writing queries into the Oracle monster. You
| _could_ derive history from the event logs if you had enough
| patience & disk space, but that would have been very silly given
| the alternative option. We used them predominantly to establish a
| casual chain between events when the details are still
| preliminary. Identifying suspects and such. Actually resolving
| really complicated business usually requires more than a
| perfectly detailed log file.
| iLoveOncall wrote:
| > It generated something like 300-400 megabytes per hour.
| Despite the insane volume we made this work really well using
| just grep and other basic CLI tools.
|
| 400MB of logs an hour is nothing at all, that's why a naive
| grep can work. You don't even need to rotate your log files
| frequently in this situation.
| fsniper wrote:
| At last a sane person. Logs are for identifying the event
| timeline, not to acquire the whole reqs/resp data. Putting
| every detail into the logs is -in my experience - makes
| undertanding issues harder. Logs tell a story. When, what
| happened, not how or why that happened. Why is in the code, how
| is in the combination of, data, logs, events, code.
|
| And loosely related, I also dislike log interfaces like elk
| stack. They make following track of events really hard. Most of
| the time you do not know what you are loooking for, just a
| vauge understanding of why you are looking into the logs. So a
| line passed 3 micro seconds ago maybe your euraka moment, where
| no search could identify , just intuition and following logs
| diligently can.
| eterm wrote:
| Overly dismissive of OTLP without proper substance to the
| criticism.
| tuetuopay wrote:
| On some languages the tracing frameworks are a godsend. In Rust
| the instrument macro will automatically record all function
| arguments as span tags. Plonk anything in e.g jaeger and any
| full trace can be looked up from pretty much any value.
| ivan_gammel wrote:
| The problem statement in this article sounds weird. I thought in
| 2025 everyone logs at least thread id and context id (user id,
| request id etc), and in microservice architecture at least
| transaction or saga id. You don't need structured logging,
| because grep by this id is sufficient for incident investigation.
| And for analytics and metrics databases of events and requests
| make more sense.
| ardme wrote:
| Maybe better written and simplified to: "microservices suck".
| exabrial wrote:
| Our logging guidance is: "Don't write comments, write logs" and
| that serves us pretty well. The point being, don't write code
| "clever code", write obvious code, and try to make it similar to
| everything else thats been done, regardless if you agree with it.
| lstroud wrote:
| Sounds like he's just asking for an old school Inman style
| transaction log.
| UltraSane wrote:
| Splunk is expensive but it makes searching logs so much faster
| and more effective. I think of it as SQL for unstructured data.
| preisschild wrote:
| loki works great too and is FOSS
| UltraSane wrote:
| We really need an open-source implementation of the Splunk
| Query Language. The query language is what lets you actually
| find the few dozen relevant lines out of the billions of
| lines logged.
| theodpHN wrote:
| Just out of curiosity, how have you seen risk/compliance,
| regulatory, and audit departments at organizations deal with the
| disconnect between security and privacy for something like
| mainframe logging (e.g., JES2, JES3), which is typically
| inherently governed, and modern distributed logging, which is
| typically inherently permissive? Both are vastly different
| approaches, but each is somehow considered 'compliant.' Btw,
| employees at a company I was at were once investigated for
| insider trading simply because it was discovered the company used
| pooled logs that were accessible by production support
| programmers (the company decided to override the default
| mainframe security), which was deemed a possible source of
| insider trading information that could be tapped into by those
| who had log access (programmers were eventually cleared if it was
| discovered their small personal trades were immaterial and just
| coincidental with the company's trading, but the investigation
| led to uncomfortable confrontations for some!).
| shireboy wrote:
| Kinda get what he's saying: provide more metadata with structured
| logging as opposed to lots of string only logs. Ok, modern
| logging frameworks steer you towards that anyway. But as a
| counterpoint: often it can be hard to safely enrich logging like
| that. In the example they include subscription age, user info,
| etc. More than once I've seen logging code lookup metadata or
| assume it existed, only to cause perf issues or outright errors
| as expected data didn't exist. Similar with sampling, it can be
| frustrating when the thing you need gets sampled out. In the end
| "it depends" on scenario, but I still find myself not logging
| enough or else logging too much
| Lord_Zero wrote:
| Structured Logging is not just JSON. It's the use of templates
| with context. It solves 90% of what this article complains about
| if you just log the template along with the variables and the
| message separately. Along with logging the right stuff. IE `"User
| {username} created order {orderid}"`
| nacozarina wrote:
| AI writing sucks even more, get rekt
| adamddev1 wrote:
| > Today, a single user request might touch 15 services, 3
| databases, 2 caches, and a message queue.
|
| And this is why _the internet_ today sucks.
| redleader55 wrote:
| The article, AI or not, is extremely naive. It doesn't mention
| any premise or any problem to solve. Proposes a solution and just
| goes with it. What if your monster of a event is lost when your
| service crashes or is lost by the logging library/service/etc?
| What if you're interested in measuring, post factum, how long
| each step takes? What if you want to trace a log through several
| (micro-)services and maybe between a mobile app and some batch
| job executor that runs once a day?
|
| "Logging sucks" when you don't understand the problem you're
| trying to solve.
| gfody wrote:
| the best implementation of structured logging I've seen is dotnet
| build's binlogs (https://msbuildlog.com), I would love to see it
| evolve into a general purpose logging solution
| kalmyk wrote:
| distributed event id and you are all set
| grekowalski wrote:
| "Logs were designed for a different era. An era of monoliths,
| single servers, and problems you could reproduce locally."
|
| But the next era will be like the previous one. Today monolith is
| enough for most of apps.
| yoan9224 wrote:
| Logs were designed for a different era. An era of monoliths,
| single servers, and problems you could reproduce locally. Today,
| a single user request might touch 15 services, 3 databases, 2
| caches, and a message queue.
|
| If a user request is hitting that many things, in my view, that
| is a deeply broken architecture.
|
| I'm building an analytics SaaS and we made the conscious decision
| to keep it simple: Next.js API routes + Supabase + minimal
| external services. A single page view hits maybe 3 components max
| (CDN -> App -> Database).
|
| That said, I agree completely on structured logging with rich
| context. We include user_id, session_id, and event_type on every
| log line. Makes debugging infinitely easier.
|
| The "wide events" concept is solid, but the real win is just
| having consistent, searchable structure. You don't need a
| revolutionary new paradigm - just stop logging random strings and
| use JSON with a schema.
| gijoeyguerra wrote:
| Persisting a data schema that represents business events is a
| great idea. That's more about Event Sourcing though and doing
| that can answer a ton of questions about the system without doing
| it in log messages.
|
| Wide events as a strategy is expensive, even with sampling, and
| doesn't address the fundamental problem - why do we log messages?
|
| I was hoping the article would enumerate why we log messages.
| Nailing down those scenarios first will lead to a happy life.
|
| Why do we log? - proof of life - is the system running? - what is
| the state (in memory) when an error occurred? - when did an error
| occur? - do I need to get up at 2 am and fix something? - what do
| I need to fix?
|
| I feel like every team operating a system has their own reasons
| for logging.
| fny wrote:
| This seems like a classic time vs space trade off.
|
| Instead of reconstructing a "wide event" from multiple log lines
| with the same request id, the suggestion seems to be logging wide
| events repeatedly to simplify reconstruction from request ids.
|
| I personally don't see the advantage, and in either scenario, if
| you're not logging what's needed your screwed.
| Hackbraten wrote:
| > No grep-ing.
|
| How is grep a bad thing? I find myself using it all the time.
|
| I'm not into graphical user interfaces. They overwhelm me. By the
| time I've clicked myself through the GUI or written some horrible
| proprietary $COMPANY Query Language string, I might have already
| figured out the bug using tried and tested CLI tools.
| thangalin wrote:
| Use events instead of repetitious logging calls.
|
| https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/
| XCSme wrote:
| I've recently added error tracking to my self-hosted analytics
| app (UXWizz), and the way I did it is simply add extra events to
| each user/session. Once you have the concept of a session or
| user, you can simply attach errors or logs as Events stored for
| that user. This solves the main problem mentioned in the article,
| where you don't know what happened, plus being an Event stored in
| a MySQL database, you can still query it.
|
| Why not simply use Events for logging, instead of plain strings?
| Illniyar wrote:
| While I agree with some of it, I feel like there's a big gotcha
| here that isn't addressed. Having 1 single wide event, at the end
| of a request, means that if something unexpected happens in the
| middle (stack overflow, some bug that throws an error that
| bypasses your logging system, lambda times out etc...) you don't
| get any visibility into what happens.
|
| You also most likely lose out on a lot of logging frameworks your
| language has that your dependencies might use.
|
| I would say this is a good layer to put on top of your regular
| logs. Make sure you have a request/session wide id and aggregate
| all those in your clickhouse or whatever into a single "log".
| wesammikhail wrote:
| The way I have solved for this in my own framework in PHP is by
| having a Logging class with the following interface
| interface LoggerInterface { // calls
| $this->system(LEVEL_ERROR, ...); public function
| exception(Throwable $e): void; // Typical system
| logs public function system(string $level, string
| $message, ?string $category = null, mixed ...$extra): void;
| // User specific logs that can be seen in the user's "my
| history" public function log(string $event,
| int|string|null $user_id = null, ?string $category = null,
| ?string $message = null, mixed ...$extra): void; }
|
| I also have a global exception handler that is registered at
| application bootstrap time that takes any exception that
| happens at runtime and runs $logger->exception($e);
|
| There is obviously a tiny bit more of boilerplating to this to
| ensure reliability, but it works so well that I can't live
| without it anymore. The logs are then inserted into a wide DB
| table with all the field one could ever want to examine thanks
| to the variadic parameter.
| pastage wrote:
| Having logs in the format "connection X:Y accepted at Z ns for
| http request XXX" and then a "connection X:Y closed at Z ns for
| http response XXX" is rather nice when debugging issues on slow
| systems.
| paragraft wrote:
| I've recently come off a team that was racking up a huge Splunk
| bill with ~70 log events for each request on a high traffic
| service, and this is all very resonant (except the bit about
| sampling, I never gave that much thought - reducing our Splunk
| bill 70x was ambitious enough for me!).
|
| Hadn't heard the "wide event" name, but I had settled on the same
| idea myself in that time (called them "top-level events" - i.e.
| we would gather information from the duration of the request and
| only log it at the "top" of the stack at the end), and
| evangelised them internally mostly on the basis it gave you
| fantastic correlation ability.
|
| In theory if you've got a trace id in Splunk you can do
| correlated queries anyway, but we were working in Spring and
| forever having issues with losing our MDC after doing cross-
| thread dispatch and forgetting to copy the MDC thread global
| across. This wasn't obvious from the top-level, and usually only
| during an incident would you realise you weren't seeing all the
| loglines you expected for a given trace. So absent a better
| solution there, tracking debug info more explicitly was
| appealing.
|
| Also used these top-level events to store sub-durations (e.g. for
| calling downstream services, invoking a model etc), and with
| Splunk if you record not just the length of a sub-process but its
| absolute start, you can reconstruct a hacky waterfall chart of
| where time was spent in your query.
| efitz wrote:
| Because of the nature of how software is built and deployed
| nowadays, it's generally not possible to write single log entries
| that tell the "whole story" of "what happened".
|
| I could write about this for hours, but instead I'll just discuss
| two concepts that you need in modern logging: vertical
| correlation and horizontal correlation.
|
| Within a system, requests tend to go "up" and "down" stacks of
| software. It is very useful in these scenarios to have "vertical
| correlation" fields shared between adjacent layers, so that
| activity in one layer can be unambiguously attributed to activity
| in the adjacent layers. But sharing such a correlation value
| requires passing the value between layers, which might be a
| breaking api change. Occasionally it's possible to construct a
| correlation value at each adjacent layer by transforming existing
| parameters in exactly the same way on the calling side and called
| side.
|
| Additionally, software on one system converses with software on
| other systems; in those cases you need to have pairwise
| correlation values between adjacent peer layers. Again, same
| limitations apply to carrying such a correlation value via the
| API or protocol.
|
| Really foresighted devs can anticipate these requirements and
| generate unique transaction ids that can be shared between
| machines and up and down the stack.
| simonw wrote:
| I hope registering an entire domain name for a blog post doesn't
| become a trend. I like linking to things that are likely to last
| a long time - a personal blog is one thing, but expecting people
| to keep paying the renewal fee every year for a single article
| feels less likely to me.
|
| A good alternative here is subdomains, since those don't have an
| additional annual fee. https://logging-sucks.boristane.com/ could
| work well here.
| willempienaar wrote:
| I kind of agree, but the message in this particular post does
| border on https://simonwillison.net/2024/Jul/13/give-people-
| something-...
| simonw wrote:
| Maybe I should have written "Give people something to link to
| that they can expect to stick around for a very long time"
| flockonus wrote:
| Sorry, this is not a "blog post" - it's far closer to digital
| marketing. A lead attractor as the author is trying to sell a
| service very clearly by the end of his page (no disrespect
| meant to either).
| etamponi wrote:
| > Logs were designed for a different era. An era of monoliths,
| single servers, and problems you could reproduce locally. Today,
| a single user request might touch 15 services, 3 databases, 2
| caches, and a message queue. Your logs are still acting like it's
| 2005.
|
| Perhaps it's time to take back the good things from 2005.
| 0xbadcafebee wrote:
| > Here's the mental model shift that changes everything: Instead
| of logging what your code is doing, log what happened to this
| request.
|
| Yeah that doesn't magically fix everything. Logging is still an
| arbitrary, clunky, unintuitive process that requires intentional
| design and extra systems to be useful.
|
| The "Wide Event log" example is 949 bytes, which isn't
| unmanageably large, but it is 3x larger than most log messages
| which are about 300 bytes. And in that blob of data might be key
| insights, but it is left up to an extra engineering process to
| discover what might be unusual in that blob. It lacks things like
| code line numbers, stack trace, and context given by the program
| about its particular functions (rather than assumptions based on
| a few pieces of metadata). And it's excessively verbose, as it
| has a trace and request ID and service name, but duplicates
| information already available to tracing systems based on those 3
| metrics.
|
| > Wide events are a philosophy: one comprehensive event per
| request, with all context attached.
|
| That's simply impossible. You cannot have all context from
| viewing a single point in the network, regardless of how hard you
| try to record or pass on information. That's the whole point of
| tracing: you correlate the context of different network points,
| specifically because that's the only way to discover the missing
| details.
|
| > Modern columnar databases (ClickHouse, BigQuery, etc.) are
| specifically designed for high-cardinality, high-dimensionality
| data. The tooling has caught up. Your practices should too.
|
| You should not depend on a space shuttle to get to the grocery
| store. Logging is intended to be an abstracted component which
| can be built on by other systems. Your app should work just as
| well running from Docker on your laptop as it does in the cloud.
| zX41ZdbW wrote:
| ClickHouse is a tiny component - a single binary that runs on a
| laptop.
| scolvin wrote:
| Correction, logging used to suck - now it's fixed
| https://pydantic.dev/logfire :-)
| groundzeros2015 wrote:
| > An era of monoliths, single servers, and problems you could
| reproduce locally.
|
| Actually this is the problem. It's extremely difficult to debug
| when you do not preserve basic properties that allow you to read
| code and reason about events.
| KaiserPro wrote:
| I agree that logging suck bollocks, especially when most of the
| time you really want metrics.
|
| Opentel is great, but sadly A lot of the stuff that I am using
| doesn't support it.
|
| The thing that made it much more bearable, even easy is loki and
| a decent log parser.
|
| I know a lot of kids like using SQL to interact with things,
| loki's explore interface beats the living shit out of SQL (In my
| opinion) its really simple to just isolate and slice logs
| interactivly. You can build your query really simply.
|
| It beats splunk/sumo/scuba(facebook's log system) hands down in
| terms of searchability.
| mkarrmann wrote:
| I broadly agree with the article.
|
| The described pattern is standard in Meta. This, along with the
| infrastructure and tooling to support it, was the single largest
| "devx quality of life improvement" in my experience moving to big
| tech.
| hoppp wrote:
| "Today, a single user request might touch 15 services, 3
| databases, 2 caches, and a message queue."
|
| This right here is the fundamental problem because the way it's
| done is highly inefficient, complex and I believe only exists so
| cloud providers can sell their expensive offerings.
|
| A monolith is fine 90% of the time.
| sgarland wrote:
| That, and everyone thinks they have to do things this way, so
| it's a terrible cycle.
|
| So many problems would be solved if service calls were IPC
| instead of network calls.
| juancn wrote:
| Logging is one tool of many, you need logs, metrics and
| distributed tracing at the very least for any significant piece
| of modern infra.
|
| If you really want to get serious, you also want some kind of
| continuous profiling (like pyroscope) or at the very least some
| periodic thread dump collector (for serious degradation
| diagnostics once every couple of minutes is enough).
|
| But logging is still a great tool.
| iamwil wrote:
| Content marketing and lead generation is getting more sneaky
| nostrademons wrote:
| Google solved most of these problems around 2005, with tools like
| LOG_EVERY_N (now part of absl [1]), Dapper [2], and several other
| tools that aren't public yet. You can trace an individual request
| through every internal system, view the request/response
| protobufs, every log that the server emitted, timing details,
| etc. More to the point, you can _share_ this trace, which means
| that it 's possible for one person to discover the bug, reproduce
| it, and then have another person in a completely different
| office/timezone/country debug it, _even if the latter cannot
| reproduce the bug themselves_. This has proved hugely useful;
| just last week I was tasked with reproducing a bug on sparsely-
| available prerelease hardware so that a distant team could
| diagnose what went wrong.
|
| The key insight that this article hints at but doesn't quite get
| too: you should treat your logs as a _product_ whose customers
| are _the rest of the devs in your company_. The way you log
| things is intimately connected with what you want to do with
| them, and you need to build systems to generate useful insights
| from the log statements. In some cases it literally is part of
| the product: many of the machine learning systems that generate
| recommendations, search results, spam filtering, abuse detection,
| traffic direction, etc. are all based on the logs for the
| product, and you need to consider them as first-class citizens
| that you absolutely cannot break while adding new features. Logs
| are not just for debugging.
|
| [1] https://absl.readthedocs.io/en/latest/absl.logging.html
|
| [2] https://research.google/pubs/dapper-a-large-scale-
| distribute...
___________________________________________________________________
(page generated 2025-12-21 23:00 UTC)