[HN Gopher] Tracing: Structured logging, but better
___________________________________________________________________
Tracing: Structured logging, but better
Author : pondidum
Score : 188 points
Date : 2023-09-18 21:52 UTC (2 days ago)
(HTM) web link (andydote.co.uk)
(TXT) w3m dump (andydote.co.uk)
| crabbone wrote:
| > The second problem with writing logs to stdout
|
| Who on Earth does that? Logs are almost always written to
| stderr... In part to prevent other problems author is talking
| about (eg. mixing with the output generated by the application).
|
| I don't understand why this has to be either or... If you store
| the trace output somewhere you get a log... (let's call it "un-
| annotated" log, since trace won't have the human-readable message
| part). Trace is great when examining the application
| interactively, but if you use the same exact tool and save the
| results for later you get logs, with all the same problems the
| author ascribes to logs.
| FridgeSeal wrote:
| I do, as does everyone at my work? Along with basically
| everyone I've ever worked with, ever?
|
| Like, I develop cli apps, so like, what else would go to stdout
| that you suppose will interfere?
| dalyons wrote:
| Being doing it for decade+, ever since the 12 factor app
| concept became popular. It's way more common imho for web apps
| than stderr logging.
| OJFord wrote:
| Loads of people, it drives me around the twist too (especially
| when there's inevitably custom parsing to separate the log
| messages from the output) but it happens, probably well
| correlated with people that use more GUI tools, not that
| there's anything wrong with that, just I think the more you use
| a CLI the more you're probably aware of this being an issue, or
| other lesser best practices that might make life easier like
| newline and tab separation.
| alkonaut wrote:
| I like a log to read like a book if it's the result of a task
| taking a finite time, such as for example an installation, a
| compilation, a loading of a browser page or similar. Users are
| going to look into it for clues about what happened and they a)
| aren't always related to those who wrote the tools b) don't have
| access to the source code or any special log analytics/querying
| tools.
|
| That's when you want a _log_ and that's what the big traditional
| log frameworks were designed to handle.
|
| A web backend/service is basically the opposite. End users don't
| have access to the log, those who analyze it can cross reference
| with system internals like source code or db state and the log is
| basically infinite. In that situation a structured log and
| querying obviously wins.
|
| It's honestly not even clear that these systems are that closely
| related.
| WatchDog wrote:
| It's a good distinction to make, logging for client based
| systems, is essentially UI design.
|
| For a web app, serving lots of concurrent users, they are
| essentially unreadable without tools, so you may as well
| optimise the logs for tool based consumption.
| layer8 wrote:
| > Log Levels are meaningless. Is a log line debug, info, warning,
| error, fatal, or some other shade in between?
|
| I partly agree and disagree. In terms of severity, there are only
| three levels:
|
| - info: not a problem
|
| - warning: potential problem
|
| - error: actual problem (operational failure)
|
| Other levels like "debug" are not about severity, but about level
| of detail.
|
| In addition, something that is an error in a subcomponent may
| only be a warning or even just an info on the level of the
| superordinate component. Thus the severity has to be interpreted
| relative to the source component.
|
| The latter can be an issue if the severity is only interpreted
| globally. Either it will be wrong for the global level, or
| subcomponents have to know the global context they are running in
| to use the severity appropriate for that context. The latter
| causes undesirable dependencies on a global context. Meaning, the
| developer of a lower-level subcomponent would have to know the
| exact context in which that component is used, in order to chose
| the appropriate log level. And what if the component is used in
| different contexts entailing different severities?
|
| So one might conclude that the severity indication is useless
| after all, but IMO one should rather conclude that severity needs
| to be interpreted relative to the component. This also means that
| a lower-level error may have to be logged again in the higher-
| level context if it's still an error there, so that it doesn't
| get ignored if e.g. monitoring only looks at errors on the
| higher-level context.
|
| Differences between "fatal" and "error" are really nesting
| differences between components/contexts. An error is always fatal
| on the level where it originates.
| abraae wrote:
| > In addition, something that is an error in a subcomponent may
| only be a warning or even just an info on the level of the
| superordinate component.
|
| Or, keep it simple.
|
| - error means someone is alerted urgently to look at the
| problem
|
| - warning means someone should be looking into it eventually,
| with a view to reclassifying as info/debug or resolving it.
|
| IMO many people don't care much about their logs, until the
| shit hits the fan. Only then, in production, do they realise
| just how much harder their overly verbose (or inadequate)
| logging is making things.
|
| The simple filter of "all errors send an alert" can go a long
| way to encouraging a bit of ownership and correctness on
| logging.
| layer8 wrote:
| > - error means someone is alerted urgently to look at the
| problem
|
| The issue is that the code that encounters the problem may
| not have the knowledge/context to decide whether it warrants
| alerting. The code higher up that does have the knowledge, on
| the other hand, often doesn't have the lower-level
| information that is useful to have in the log for analyzing
| the failure. So how do you link the two? When you write
| modular code that minimizes assumptions about its context,
| that situation is a common occurrence.
| abraae wrote:
| If the code detecting the error is a library/subordinate
| service then the same rule can be followed - should this be
| immediately brought to a human's attention?
|
| The answer for a library will often be no, since the
| library doesn't "have the knowledge/context to decide
| whether it warrants alerting".
|
| So in that case the library can log as info, and leave it
| to the caller to log as error if warranted (after learning
| about the error from return code/http status etc.).
|
| When investigating the error, the human has access to the
| info details from the subordinate service.
| SkyPuncher wrote:
| I agree with your premise, but do consider debug to be a fourth
| level.
|
| Info is things like "processing X"
|
| Debug is things like "variable is Y" or "made it to this point"
| Hermitian909 wrote:
| The OP is wrong, log levels are very valuable if you leverage
| them.
|
| Here's a classic problem as an illustration: The storage cost
| of your logs is really prohibitive. You would like to cut out
| some of your logs from storage but cannot lower retention below
| some threshold (say 2 weeks maybe). For this example, assume
| that tracing is also enabled and every log has a traceId
|
| A good answer is to run a compaction job that inspects each
| trace. If it contains an error preserve it. Remove X% of all
| other traces.
|
| Log levels make the ergonomics for this excellent and it can
| save millions of dollars a year at sufficient scale.
| BillinghamJ wrote:
| I tend to think of "warning" as - "something unexpected
| happened, but it was handled safely"
|
| And then "error" as - "things are not okay, a developer is
| going to need to intervene"
|
| And errors then split roughly between "must be fixed sometime",
| and "must be fixed now/ASAP"
| layer8 wrote:
| > I tend to think of "warning" as - "something unexpected
| happened, but it was handled safely"
|
| It was handled safely at the level where it occurred, but
| because it was unusual/unexpected, the underlying cause may
| cause issues later on or higher up.
|
| If one were sure it would 100% not indicate any issue, one
| wouldn't need to warn about it.
| waffletower wrote:
| There are logging libraries that include syntactically scoped
| timers, such as mulog (https://github.com/BrunoBonacci/mulog).
| While a great library, we preferred timbre
| (https://github.com/taoensso/timbre) and rolled our own logging
| timer macro that interoperates with it. More convenient to have
| such niceties in a Lisp of course. Since we also have
| OpenTelemetry available, it would also be easy to wrap traces
| around code form boundaries as well. Thanks OP for the idea!
| mrkeen wrote:
| > If you're writing log statements, you're doing it wrong.
|
| I too use this bait statement.
|
| Then I follow it up with (the short version):
|
| 1) Rewrite your log statements so that they're machine readable
|
| 2) Prove they're machine-readable by having the down-stream
| services read them instead of the REST call you would have
| otherwise sent.
|
| 3) Switch out log4j for Kafka, which will handle the persistence
| & multiplexing for you.
|
| Voila, you got yourself a reactive, event-driven system with
| accurate "logs".
|
| If you're like me and you read the article thinking "I like the
| result but I hate polluting my business code with all that
| tracing code", well now you can create an _independent_ reader of
| your kafka events which just focuses on turning events into
| traces.
| rewmie wrote:
| > 3) Switch out log4j for Kafka, which will handle the
| persistence & multiplexing for you.
|
| I don't think this is a reasonable statement. There are already
| a few logging agents that support structured logging without
| dragging in heavyweight dependencies such as Kafka. Bringing up
| Kafka sounds like a case of a solution looking for a problem.
| ahoka wrote:
| I think OP meant event sourcing.
| rewmie wrote:
| > I think OP meant event sourcing.
|
| That is really besides the point. Logging and tracing have
| always been fundamentally event sourcing, but that never
| forced anyone ever at all to onboard onto freaking Kafka of
| all event streaming/messaging platforms.
|
| This blend of suggestion sounds an awful lot like resume
| driven development instead of actually putting together a
| logging service.
| mrkeen wrote:
| > There are already a few logging agents that support
| structured logging without dragging in heavyweight
| dependencies such as Kafka.
|
| What are they? Because admittedly I've lost a little love for
| the operational side of Kafka, and I wish the client-side
| were a little "dumber", so I could match it better to my uses
| cases.
| bowsamic wrote:
| How to get me to leave your company 101
| mrkeen wrote:
| I did write a pretty glib description of what to do ;)
|
| That said, I've had conflicts with a previous team-mate about
| this. He couldn't wrap his head around Kafka being a source
| of truth. But when I asked him whether he'd trust our Kafka
| or our Postgres if they disagreed, he conceded that he'd
| believe Kafka's side of things.
| amelius wrote:
| This is stuff that a debugger is supposed to do for you, for
| free.
|
| This should not require code at the application level, but it
| should be implemented at the tooling level.
| goalieca wrote:
| Logging is essential for security. I think tracing is wonderful
| and so are metrics. I see these as more of a triad for
| observability.
| waffletower wrote:
| Indeed, the three legs (metrics, logs, traces) of
| OpenTelemetry's telescope. https://opentelemetry.io
| candiddevmike wrote:
| Something missing from OTel IMO is a standard way of linking
| all three together. It seems like an exercise left to the
| reader, but I feel like there should be standard metadata for
| showing a relationship between traces, metrics, and logs.
| Right now each of these functions is on an island (same with
| the tooling and storage of the data, but that's another
| rant).
| discodachshund wrote:
| Isn't that the trace ID? For metrics, it's in the form of
| exemplars, and for logs it is the log context
| candiddevmike wrote:
| That might be dependent on the library then, there isn't
| an official OTel Go logging library yet. Seems you have
| to add the trace ID exemplars manually too
| phillipcarter wrote:
| Go is behind several of the languages in OTel right now.
| Just a consequence of a very difficult implementation and
| its load-bearing nature as being the language (and
| library) of choice for CNCF infrastructure. If you use
| Java or .NET, for example, it's quite fleshed out.
| jen20 wrote:
| One would hope that there will not _be_ an Open Telemetry
| logging library for Go. Unlike last time there was a
| thread about this, there is now a standard - `slog` in
| the stdlib.
| spullara wrote:
| It drives me insane that the standardized tracing libraries have
| you only report closed spans. What if it crashes? What if it
| stalls? Why should I keep open spans in memory when I can just
| write an end span event?
| jauntywundrkind wrote:
| What's most incredible to me is how close tracing feels in spirit
| to me to event-sourcing.
|
| Here's this log of every frame of compute going on, plus data or
| metadata about the frame.... but afaik we have yet to start using
| the same stream of computation for business processes as we do
| for it's excellent observability.
| alexisread wrote:
| Any of the Clickhouse-based Otel stores can do event sourcing -
| just set up materialised views on the trace tables. I know the
| following use CH: https://uptrace.dev/ https://signoz.io/
| https://github.com/hyperdxio/hyperdx
| juliogreff wrote:
| As a matter of fact, at a previous job we used traces as a data
| source for event sourcing. One use case: we tracked usage of
| certain features in API calls in traces, and some batch job ran
| at whatever frequency aggregated which users were using which
| features. While it was far from real time because of the sheer
| amount of data, it was so simple to implement that we had
| dozens of use cases implemented like that.
| skybrian wrote:
| How would a hobbyist programmer get started with tracing for a
| simple web app? Where do the traces end up and how do I query it?
| Can tracing be used in a development environment?
|
| Context: the last thing I wrote used Deno and Deno Deploy.
| curioussavage wrote:
| Just install opentelemetry libs. I found this example with a
| quick search: https://dev.to/grunet/leveraging-opentelemetry-
| in-deno-45bj
|
| opentelemetry has a service you can run that will collect the
| telemetry data and you can export it to something like
| prometheus which can store it and let you query it. Example
| here https://github.com/open-telemetry/opentelemetry-collector-
| co...
|
| Typically in dev environments trace spans are just emitted to
| stdout just like logs. I sometimes turn that off too though
| because it gets noisy.
| andersrs wrote:
| I have a side project that I run in Kubernetes with a postgres
| database and a few Go/Nodejs apps. Recommend me a lightweight
| otel backend that isn't going to blow out my cloud costs.
| perpil wrote:
| I was recently musing about the 2 different types of logs:
|
| 1. application logs, emitted multiple times per request and serve
| as breadcrumbs
|
| 2. request logs emitted once per request and include latencies,
| counters and metadata about the request and response
|
| The application logs were useless to me except during
| development. However the request logs I could run aggregations on
| which made them far more useful for answering questions. What the
| author explains very well is that the problem with application
| logs is they aren't very human-readable which is where
| visualizing a request with tracing shines. If you don't have
| tracing, creating request logs will get you most of the way
| there, it's certainly better than application logs.
| https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...
| benreesman wrote:
| As a historical critic of Rust-mania (and if I'm honest, kind of
| an asshole about it too many times, fail), I've recently bumped
| into stuff like tokio-tracing, eyre, tokio-console, and some
| others.
|
| And while my historical gripes are largely still the status quo:
| stack traces in multi-threaded, evented/async code that _actually
| show real line numbers_? Span-based tracing that makes concurrent
| introspection possible _by default_?
|
| I'm in. I apologize for everything bad I ever said and don't care
| whatever other annoying thing.
|
| That's the whole show. Unless it deletes my hard drive I don't
| really care about anything else by comparison.
| zoogeny wrote:
| One thing about logging and tracing is the inevitable cost (in
| real money).
|
| I love observability probably more than most. And my initial
| reaction to this article is the obvious: why not both?
|
| In fact, I tend to think more in terms of "events" when writing
| both logs and tracing code. How that event is notified, stored,
| transmitted, etc. is in some ways divorced from the activity. I
| don't care if it is going to stdout, or over udp to an
| aggregator, or turning into trace statements, or ending up in
| Kafka, etc.
|
| But inevitably I bump up against cost. For even medium sized
| systems, the amount of data I would like to track gets quite
| expensive. For example, many tracing services charge for the tags
| you add to traces. So doing `trace.String("key", value)` becomes
| something I think about from a cost perspective. I worked at a
| place that had a $250k/year New Relic bill and we were avoiding
| any kind of custom attributes. Just getting APM metrics for
| servers and databases was enough to get to that cost.
|
| Logs are cheap, easy, reliable and don't lock me in to an
| expensive service to start. I mean, maybe you end up integrating
| splunk or perhaps self-hosting kibana, but you can get 90% of the
| benefits just by dumping the logs into Cloudwatch or even S3 for
| a much cheaper price.
| alexisread wrote:
| Any of the Clickhouse-based Otel stores can dump the traces to
| s3 for long-term storage, and can be self-hosted. I know the
| following use CH: https://uptrace.dev/ https://signoz.io/
| https://github.com/hyperdxio/hyperdx
| phillipcarter wrote:
| FWIW part of the reason you're seeing that is, at least
| traditionally, APM companies rebranding as Observability
| companies stuffed trace data into metrics data stores, which
| becomes prohibitively expensive to query with custom
| tags/attributes/fields. Newer tools/companies have a different
| approach that makes cost far more predictable and generally
| lower.
|
| Luckily, some of the larger incumbents are also moving away
| from this model, especially as OpenTelemetry is making tracing
| more widespread as a baseline of sorts for data. And you can
| definitely bet they're hearing about it from their customers
| right now, and they want to keep their customers.
|
| Cost is still a concern but it's getting addressed as well.
| Right now every vendor has different approaches (e.g., the one
| I work for has a robust sampling proxy you can use), but that
| too is going the way of standardization. OTel is defining how
| to propagate sampling metadata in signals so that downstream
| tools can use the metadata about population representativeness
| to show accurate counts for things and so on.
| thinkharderdev wrote:
| > I mean, maybe you end up integrating splunk or perhaps self-
| hosting kibana
|
| I think this is the issue. Both Splunk and OpenSearch (even
| self-hosted OpenSearch) get really pricy as well especially
| with large volumes of log data. Cloudwatch can also get
| ludicrously expensive. They charge something like $0.50 per GB
| (!) and another $0.03 per GB to store. I've seen situations at
| a previous employer where someone accidentally deployed a
| lambda function with debug logging and ran up a few thousand $$
| in Cloudwatch bills overnight.
|
| You should look at Coralogix (disclaimer: I work there). We've
| built a platform that allows you to store your observability
| data in S3 and query it through our infrastructure. It can be
| dramatically more cost-effective than other providers in this
| space.
| jameshart wrote:
| Observability costs feel high when everything's working fine.
| When something snaps and everything is down and you need to
| know why in a hurry... those observability premiums you've been
| paying all along can pay off fast.
| thangalin wrote:
| > In fact, I tend to think more in terms of "events" when
| writing both logs and tracing code.
|
| They are events[1]. For my text editor, KeenWrite, events can
| be logged either to the console when run from the command-line
| or displayed in a dialog when running in GUI mode. By changing
| "logger.log()" statements to "event.publish()" statements, a
| number of practical benefits are realized, including:
|
| * Decoupled logging implementation from the system (swap one
| line of code to change loggers).
|
| * Publish events on a message bus (e.g., D-Bus) to allow
| extending system functionality without modifying the existing
| code base.
|
| * Standard logging format, which can be machine parsed, to help
| trace in-field production problems.
|
| * Ability to assign unique identifiers to each event, allowing
| for publication of problem/solution documentation based on
| those IDs (possibly even seeding LLMs these days).
|
| [1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code-
| smell/
| jameshart wrote:
| But events that another system relies upon are now an _API_.
| Be careful not to lock together things that are only
| superficially similar, as it affects your ability to change
| them independently.
| thangalin wrote:
| Architecturally, the decoupling works as follows:
| Event -> Bus -> UI Subscriber -> Dialog (table)
| Event -> Bus -> Log Subscriber -> Console (text)
| Event -> Bus -> D-Bus Subscriber -> Relay -> D-Bus ->
| Publish (TCP/IP)
|
| With D-Bus, published messages are versioned, allowing for
| API changes without breaking third-party consumers. The
| D-Bus Subscriber provides a layer of isolation between the
| application and the published messages so that the two can
| vary independently.
| hosh wrote:
| I have made use of tracing, metrics, and logging all together
| and find each of them have its own place, as well as synergies
| of being able to work with all three together.
|
| Cost is a real issue, and not just in terms of how much the
| vendor costs you. When tracing becomes a noticeable fraction of
| CPU or memory usage relative to the application, it's time to
| rethink doing 100% sampling. In practice, if you are sampling
| thousands of requests per second, you're very unlikely to
| actually look through each one of those thousands (thousands of
| req/s may not be a lot for some sites, but it is already
| exceeding human-scale without tooling). In order to keep
| accurate, useful statistics with sampling, you end up using
| metrics to store trace metrics prior to sampling.
| hosh wrote:
| That's weird. I use both logging and tracing where I can. And
| metrics.
|
| While there are better tools for alerting, metrics, or
| aggregations, it helps a lot in debugging and troubleshooting.
| aero142 wrote:
| I think the author's point is that tracing is a better
| implementation of both logs and metrics, and I think it's a
| valid point. * metrics are pre-aggregated into timeseries data,
| which makes cardinality expensive. You could also aggregate a
| value from a trace statement. * Logs are hand crafted and
| unique, and are usually improved by adding structured
| attributes. Structured attributes are better as traces because
| you can have execution context and well defined attributes that
| provide better detail.
|
| Traces can be aggregated or sampled to provide all of the
| information available from logs, but in a more flexible way. *
| Certain traces can be retained at 100%. This is equivalent to
| logs. * Certain trace attributes can be converted to timeseries
| data. This is equivalent to metrics. * Certain traces can be
| sampled and/or queried with streaming infrastructure. This is a
| way to observe data with high cardinality without hitting the
| high cost.
| hosh wrote:
| There are things you can do with metrics and logging that you
| cannot do with traces. These usually fall outside of
| debugging application performance and bottlenecks. So I think
| what the author says is true if you are only thinking about
| application, and not for gaining a holistic understanding of
| the entire system, including infrastructure.
|
| Probably the biggest tradeoff with traces is that, in
| practice, you are not retaining 100% of all traces. In order
| to keep accurate statistics, it generally gets ingested as
| metrics before sampling. The other is that traces are not
| stored in such a way where you are looking at what is
| happening at a point-in-time -- which is what logging does
| well. If I want to ensure I have execution context for
| logging, I make the effort to add trace and span ids so that
| traces and logging can be correlated.
|
| To be fair, I live in the devops world more often than not,
| and my colleagues on the dev teams rarely have to venture
| outside of traces.
|
| I don't mind the points this author is making. My main
| criticism is that it is scoped to the world of applications
| -- which is fine -- but then taken as universal for all of
| software engineering.
| fnordpiglet wrote:
| Tracing is poor at both very long lived traces, at stream
| processing, and most tracing implementations are too heavy to run
| in computationally bound tasks beyond at a very coarse level.
| Logging is nice in that it has no context, no overhead, is
| generally very cheap to compose and emit, and with including
| transaction id and done in a structured way gives you most of
| what tracing does without all the other baggage.
|
| That said for the spaces where tracing works well, it works
| unreasonably well.
| cschneid wrote:
| When I worked at ScoutAPM, that list is basically the exact
| areas where we had issues supporting. We didn't do full-on
| tracing in the OpenTracing kind of way, but the agent was
| pretty similar, with spans (mostly automatically inserted), and
| annotations on those spans with timing, parentage, and extra
| info (like the sql query this represented in Active record).
|
| The really hard things, which we had reasonable answers for,
| but never quite perfect: * Rails websockets (actioncable) *
| very long running background jobs (we stopped collecting at
| some limit, to prevent unbounded memory) * trying to profile
| code, we used a modified version of Stackprof to do sampling
| instead of exact profiling. That worked surprisingly well at
| finding hotspots, with low overhead.
|
| All sorts of other tricks came along too. I should go look at
| that codebase again to remind me. That'd be good for my
| resume.... :)
|
| https://github.com/scoutapp/scout_apm_ruby
| riv991 wrote:
| I think Open Telemetry has solved the stream processing problem
| issue with span links[1]. Treating each unit of work as an
| individual trace but being able to combine them and see a
| causal relationship. Slack published a blog about it pretty
| recently [2]
|
| [1]
| https://opentelemetry.io/docs/concepts/signals/traces/#span-...
|
| [2] https://slack.engineering/tracing-notifications/
| phillipcarter wrote:
| Hmmm, for long-lived processes and stream processing we use
| tracing just fine. What we do is make a cutoff of 60 seconds,
| which each chunk is its own trace. But our backend queries
| trace data directly, so we can still analyze the aggregate,
| long-term behavior and then dig into a particular 60 second
| chunk if it's problematic.
| ducharmdev wrote:
| Minor nitpick, but I wish this post started with defining what we
| mean by logging vs tracing, since some people use these
| interchangeably. The reader instead has to infer this from the
| criticisms of logging.
| ryanklee wrote:
| I've never encountered this confusion anywhere, so I wouldn't
| ever think to dispel it. Which isn't to say that I disagree
| with the more general point that defining your terms is good
| thing.
|
| In any case, the post itself (which is not long) illustrates
| and marks out many of the differences.
| jlokier wrote:
| I agree. I'm working with code that uses 'verbose "message"'
| for level 1 verbosity logs and 'trace "message"' for level 2
| verbosity. Makes sense in its world, but it's not the same
| meaning as how cloud-devops-observability culture uses those
| words.
| vkoskiv wrote:
| Nit to the author: 'rapala' seems like a mistranslation. It is a
| brand name of a company that makes fishing lures, as far as I can
| tell. It is not the Finnish word for "to bait", and is therefore
| only used to refer to a that particular brand. I'm not sure what
| the purpose of the text in parenthesis is here, but 'houkutella'
| would be the most apt translation in this case.
| lambda_garden wrote:
| Couldn't this be injected into the runtime so that no code
| changes are required?
|
| Perhaps really performance critical stuff could have a "notrace"
| annotation.
| thinkharderdev wrote:
| Sure, and a lot of tools will do this in one way or another.
| Either instrument code directly or provide annotations/macros
| to trace a specific method (something like tokio-tracing in the
| Rust ecosystem).
|
| However, tracing literally every method call would probably be
| prohibitively expensive so typically you have either:
|
| 1. Instrumentation with "understands" common
| frameworks/libraries and knows what to instrument (eg request
| handlers in web frameworks)
|
| 2. Full opt-in. They make it easy to add a trace for a method
| invocation with a simple annotation but nothing gets
| instrumented by default
| austinsharp wrote:
| Yes, OTel has autoinstrumentation libraries for some language
| that can pick up a fair amount by default. Though it's unlikely
| that that would ever be sufficient, it's a nice start.
|
| For Java:
| https://opentelemetry.io/docs/instrumentation/java/automatic...
| imiric wrote:
| There are several projects that leverage eBPF for automatic
| instrumentation[1].
|
| How accurate and useful these are vs. doing this manually will
| depend on the use case, but I reckon the automatic approach
| gets you most of the way there, and you can add the missing
| traces yourself, so if nothing else it saves a lot of work.
|
| [1]: https://ebpf.io/applications/
| hardwaresofton wrote:
| I think there's an alternate universe out there where:
|
| - we collectively realized that logs, events, traces, metrics,
| and errors are actually all just logs
|
| - we agreed on a single format that encapsulated all that
| information in a structured manner
|
| - we built firehose/stream processing tooling to provide modern
| o11y creature comforts
|
| I can't tell if that universe is better than this one, or worse.
| phillipcarter wrote:
| That's more or less the model Honeycomb uses. Every signal type
| is just a structured event. Reality is a bit messier, though.
| In particular, metrics are the oddball in this world and
| required a lot of work to make economical.
| dalyons wrote:
| Is that really an alternate universe? That's the universe that
| splunk and friends are selling, everything's a log. It's really
| expensive.
| andrewstuart2 wrote:
| Traces are just distributed "logs" (in the data structure
| sense; data ordered only by its appearance in _something_ )
| where you also pass around the tiniest bit of correlation
| context between apps. Traces are structured, timestamped, and
| can be indexed into much more debug-friendly structures like a
| call tree. But you could just as easily ignore all the data and
| print them out in streaming sorted order without any
| correlation.
|
| Honestly it sounds like you're pitching opentelemetry/otlp but
| where you only trace and leave all the other bits for later
| inside your opentelemetry collector, which can turn traces into
| metrics or traces into logs.
| thegrizzlyking wrote:
| Logs are mostly "Hi I reached this line of code, here is some
| metadata"
| jasonjmcghee wrote:
| I really enjoyed the content- it's a great article.
|
| Note to author: all but the last code block have a very odd
| mixture of rather large font sizes (at least on mobile) which
| vary line to line that make them pretty difficult to read.
|
| Also the link to "Observability Driven Development." was a blank
| slide deck AFAICT
| hello1234567 wrote:
| person writing this came to know some thing that he din't know
| earlier and decided to convert his light bulb moment into a blog
| post. not bad bad but failed to understand that logs are the
| generalisation of very thing they are talking about.
| jeffbee wrote:
| This is a great article because everyone should understand the
| similarity between logging and tracing. One thing worth pondering
| though is the differences in cost. If I am not planning to
| centrally collect and index informational logs, free-form text
| logging is extremely cheap. Even a complex log line with
| formatted strings and numbers can be emitted in < 1us on modern
| machines. If you are handling something like 100s or 1000s of
| requests per second per core, which is pretty respectable,
| putting a handful of informational log statements in the critical
| path won't hurt anyone.
|
| Off-the-shelf tracing libraries on the other hand are pretty
| expensive. You have one additional mandatory read of the system
| clock, to establish the span duration, plus you are still paying
| for a clock read on every span event, if you use span events.
| Every span has a PRNG call, too. Distributed tracing is worthless
| if you don't send the spans somewhere, so you have to budget for
| encoding your span into json, msgpack, protobuf, or whatever.
| It's a completely different ball game in terms of efficiency.
| nithril wrote:
| It is actually simpler to conceptualize the difference, one is
| stateless, the other one is stateful.
|
| Actually structured logging exists since years like in Java
| https://github.com/logfellow/logstash-logback-encoder
| xyzzy_plugh wrote:
| I will agree that conceptually logging can be much cheaper than
| tracing ever can, but in practice any semi-serious attempt at
| structured logging ends up looking very, very close to tracing.
| In fact I'd go so far as to say that the two are effectively
| interchangeable at a point. What you do with that information,
| whether you index it or build a graph, is up to you -- and that
| is where the cost creeps in.
|
| Adding timestamps and UUIDs and an encoding is par for the
| course in logging these days, I don't think that is the right
| angle to criticize efficiency.
|
| Tracing can be very cheap if you "simply" (and I'm glossing
| over a lot here) search for all messages in a liberal window
| matching each "span start" message and index the result sets.
| Offering a way to view results as a tree is just a bonus.
|
| Of course, in practice this ends up meaning something
| completely different, and far costlier. Why that is I cannot
| fathom.
| hyperpape wrote:
| I don't generally disagree, but using json for structured logs
| is a growing thing as well.
| h1fra wrote:
| Tracing is much more actionnable but barely usable without a
| platform. Which makes local programming dependent on third party.
| Also it requires passing context or have a way to get back the
| context in every function that requires it, which can be
| daunting.
|
| On my side I have opted to mixed structured/text, a generic
| message that can be easily understood while glancing over logs,
| and a data object attached for more details.
| candiddevmike wrote:
| You can add Jaeger to your local dev containers and run it in
| memory, it's really lightweight and easy to use.
| hinkley wrote:
| Someone got me excited about tracing and I started tweaking our
| stats API to optionally add tracing. Retrofitted it into a
| mature app, then immediately discovered that all of the data
| was being dropped because AWS only likes very tiny traces.
| Depth or fanout or both break it rather quickly.
|
| And OpenTelemetry has a very questionable implementation. For a
| nested trace, events fire when the trace closes, meaning that a
| parent ID is reported before it is seen in the stream. That
| can't be good for processing. Would be better to have a leading
| edge event (also helps with errors throwing and the parent
| never being reported).
|
| Kind of a bummer. Needs work.
| pcthrowaway wrote:
| > OpenTelemetry has a very questionable implementation
|
| The nice thing about OpenTelemetry is that it's a standard.
| The questionable implementation you're referencing isn't a
| source of truth. There isn't some canonical "questionable"
| implementation.
|
| There are many, slightly different, questionable
| implementations.
| hinkley wrote:
| If the wire protocol has a bug, that's not something an
| implementation can fix.
|
| I'm saying the wire protocol is wrong.
___________________________________________________________________
(page generated 2023-09-20 23:00 UTC)