https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/ Jeremy Morrell blog about rss Oct 22, 2024 A Practitioner's Guide to Wide Events Adopting Wide Event-style instrumentation has been one of the highest-leverage changes I've made in my engineering career. The feedback loop on all my changes tightened and debugging systems became so much easier. Systems that were scary to work on suddenly seemed a lot more manageable. Lately there have been a lot of good blog posts on what "Wide Events" mean and why they are important. Here are some of my recent favorites: * All you need is Wide Events, not "Metrics, Logs and Traces" by Ivan Burmistrov * Observability wide events 101 by Boris Tane * Is it time to version Observability? (Signs point to yes) by Charity Majors The tl;dr is that for each unit-of-work in your system (usually, but not always an HTTP request / response) you emit one "event" with all of the information you can collect about that work. "Event" is an over-loaded term in telemetry so replace that with "log line" or "span" if you like. They are all effectively the same thing. Charity Majors has been promoting this approach lately under the name "Observability 2.0", creating some new momentum around the concept, however, it is not a new idea. Brandur Leach wrote about "Canonical Log Lines" both on his own blog in 2016 and as used by Stripe in 2019 . And AWS has recommended it as a best-practice for ages. Okay... I think I get the idea... but how do I do "wide events"? This is where I find a lot of developers get tripped up. The idea sounds good in theory, and we should totally try that one day! But I have this stack of features to ship, that bug that's been keeping me up at night, and 30 new AI tools that came out yesterday to learn about. And like... where do you even start? What data should I add? Like anything in software, there are a lot of options for how to approach this, but I'll talk through one approach that has worked for me. We'll cover how to approach this in tooling and code, an extensive list of attributes to add, and I'll respond to some frequent objections that come up when discussing this approach. For this post we'll focus on web services, but you would apply a similar approach to any workload. Choose your tools We will need some way to instrument your code (traces or structured log lines) and somewhere to send the instrumentation to in order to query and visualize it. This approach is best paired with a tool that lets you query your data in quick iterations. I like Honeycomb for this, but any Observability tool backed by a modern OLAP database is likely going to work in a pinch. * Honeycomb has Retriever * DataDog has Husky * New Relic has NRDB * Baselime uses ClickHouse * SigNoz uses ClickHouse Honeycomb, New Relic, and DataDog built their own columnar OLAP data stores, though now with the availability of ClickHouse, InfluxDB IOx, Apache Pinot, and DuckDB there are new Observability tools popping up all the time. If you aren't constrained, I highly recommend defaulting to using OpenTelemetry and Honeycomb. Your life will be easier. However even if you are stuck in a corporate environment with a strong allergy to technology built after 2010 you can leverage log search tools like ElasticSearch in a pinch. Stripe's blog post goes over how to use Splunk for this. In any tool you want to focus on getting proficient at 3 core techniques in order to sift through your events. The faster you are able to apply these, iterate, and ask questions of your data, the better you'll be able to debug issues and see what your system is really doing. When observability folks refer to "slicing and dicing" data, this is what they are generally referring to. I'll represent queries using a made-up SQL dialect, but you should be able to find equivalents in your tool's query language. Visualizing Existing in a human body comes with its fair share of downsides, but the human visual cortex is really, really good at recognizing patterns. Give it a fighting chance by getting really good at summoning visualizations of the data your system is emitting. COUNT, COUNT_DISTINCT, HEATMAP, P90, MAX, MIN, Histogram. Learn to leverage whatever graphs your tool makes available to you. Practice it. Get fast. A Honeycomb screenshot of heatmap A Splunk screenshot of histogram Grouping With each new annotation that we add to our wide events, we create another dimension along which we can slice our data. GROUP BY allows us to look along that dimension and see if the values along that dimension match our expectations. GROUP BY instance.id GROUP BY client.OS, client.version Filtering Once we've narrowed in one dimension that is interesting, we usually want to dig further into that data. Filtering down so that we're only looking at data from one endpoint, or from one IP address, or sent by the iOS app, or only from users with a specific feature flag turned on allows us to narrow our focus to a very specific segment of traffic. WHERE http.route = "/user/account" WHERE http.route != "/health" WHERE http.user_agent_header contains "Android" Write a middleware to help you If you are using an OpenTelemetry SDK it is already creating a wrapping span around the request and response. You can access it by asking for the active span at any point during the processing of the request. let span = opentelemetry.trace.getActiveSpan(); span.setAttributes({ "user_agent.original": c.req.header("User-Agent"), }); However if anyone wraps any of your code in a child span the "active span" will change to be that new wrapping span! There is no first-class way of addressing this original "main" span in OpenTelemetry. However, we can work around this by saving a reference to this specific span in the context so we can always have access to the "main" wrapping span. // create a reference to store the span on the opentelemetry context object const MAIN_SPAN_CONTEXT_KEY = createContextKey("main_span_context_key"); function mainSpanMiddleware(req, res, next) { // pull the active span created by the http instrumentation let span = trace.getActiveSpan(); // get the current context let ctx = context.active(); // set any attributes we always want on the main span span.setAttribute("main", true); // OpenTelemetry context is immutable, so to modify it we create // a new version with our span added let newCtx = ctx.setValue(MAIN_SPAN_CONTEXT_KEY, span); // set that new context as active for the duration of the request context.with(newCtx, () => { next(); }); } // create another function that allows you to annotate this saved span easily function setMainSpanAttributes(attributes) { let mainSpan = context.active().getValue(MAIN_SPAN_CONTEXT_KEY); if (mainSpan) { mainSpan.setAttributes(attributes); } } Now our annotation code can look a little simpler, and we can always know that we're setting these attributes on the wrapping span. setMainSpanAttributes({ "user.id": "123", "user.type": "enterprise", "user.auth_method": "oauth", }); You can play around with a minimal running example here. At Heroku we had internal OpenTelemetry Distributions that set this up for you automatically and added as many automatic annotations as possible to these spans. If you are not using OpenTelemetry here's a gist that might help you get started. My previous post may help you put this logic together. What do I add to this "main" span? And how many dimensions do you plan to emit and pack into your wide events? MANY. Hundreds! The more you have, the better you can detect and correlate rare conditions with precision. As you adjust to the joys of debugging with rich context, you will itch for it everywhere. [?] -- Charity Majors (@mipsytipsy) January 9, 2024 We need to add attributes about the request, and there are likely far more of these than you would expect. It's easy to come up with a dozen or so, but in a well-instrumented code base there will be hundreds of attributes. Note that while this is a long list, it is definitely not exhaustive. OpenTelemetry defines sets of attribute names as Semantic Conventions that can also be used for inspiration. I have tried to follow these in my naming where possible. A convention to filter out everything else Traces contain lots of spans, so it's helpful to have a convention for identifying and searching for these "wide events". root and canon were floated as options, but I've landed on calling them main spans. Attribute Examples Description Present only for spans designated as a "wide main true event", usually wrapping a request / response, or a background job This convention allows you to quickly figure out "what does the traffic to this service look like?" with a single query: SELECT COUNT(*) WHERE main = true GROUP BY http.route Graph of traffic grouped by route over a week. There is an anomally. Service metadata Of course we need to add some information about the service we're running. Consider adding additional metadata about which team owns the system, or which Slack channel the owning team hangs out in, though note that this can be tedious to update if your workplace experiences frequent re-orgs. Tying these to a service catalog like Backstage is left as an exercise to the reader. Attribute Examples Description service.name api What is the name of this service? shoppingcart production service.environment staging Where is this service running? development web-services Which team owns this service. service.team dev-ex Useful for knowing who to page in during incidents. service.slack_channel web-services If I discover an issue with this dev-ex service, where should I reach out? How many services does each team run? SELECT COUNT_DISTINCT(service.name) WHERE service.environment = "production" GROUP BY service.team Ever look at the load on a system and then wonder "Is that appropriate for the machine this is running on?", and now you have to look through other tools or config files to get that information. Throw that context on the wide event so that it's available when you need it. Attribute Examples Description An ID that maps to this instance.id 656993bd-40e1-4c76-baff-0e50e158c6eb one instance of the service How much RAM instance.memory_mb 12336 is available to this service? 4 How many instance.cpu_count 8 cores are 196 available to this service? Does your vendor have a instance.type m6i.xlarge name for this type of instance? What are the services with the most memory that we run? What instance types do they use? SELECT service.name, instance.memory_mb, instance.type ORDER BY instance.memory_mb DESC GROUP BY service.name, instance.type LIMIT 10 However you're orchestrating your systems make sure that all of the relevant information is added. I've included some examples from the Kubernetes semantic conventions for inspiration. Attribute Examples Description container.id a3bf90e006b2 An ID used to identify Docker containers container.name nginx-proxy Container name used by wordpress-app container runtime Name of the kubernetes k8s.cluster.name api-cluster cluster your service is running in Name of the kubernetes k8s.pod.name nginx-2723453542-065rx pod your service is running in cloud.availability_zone us-east-1c AZ where you're running your service cloud.region us-east-1 Region where you're running your service But even if you're using a Platform-as-a-Service you can still pull out a lot of useful information! Attribute Examples Description heroku.dyno web.1 The env var DYNO that is set on worker.3 your app at runtime web The first part of the DYNO env var heroku.dyno_type worker before the .. Separating this makes it easier to query 1 The second part of the DYNO env heroku.dyno_index 3 var after the .. Separating this makes it easier to query heroku.dyno_size performance-m The selected dyno size heroku.space my-private-space The name of the private space that your are deployed into heroku.region virginia Which region is this app located oregon in? How many dynos are we running? What dyno types are they? For which services? SELECT COUNT_DISTINCT(heroku.dyno_index) GROUP BY service.name, heroku.dyno_type, instance.type Build info Inevitably some of the first questions asked in any incident are "Did something just go out?" or "What changed?". Instead of jumping to your deployment tool or looking through GitHub repositories, add that data to your telemetry. Threading this data from your build system through to your production system so that it's available at runtime can be a non-trivial amount of glue code, but having this information easily available during incidents is invaluable. Attribute Examples Description However you track your service.version v123 version, ex: a 9731945429d3d083eb78666c565c61bcef39a48f version string or a hash of the built image If your build system gives you an ID, this service.build.id acd8bb57-fb9f-4b2d-a750-4315e99dac64 context allows you to audit the build if something goes wrong The git SHA of the deployed service.build.git_hash 6f6466b0e693470729b669f3745358df29f97e8d commit so you can know exactly which code was running The url of the https://github.com/your-company/ pull request service.build.pull_request_url api-service/pull/121 that was merged that triggered the deploy A url that compares the https://github.com/your-company/ previously service.build.diff_url api-service/compare/c9d9380..05e5736 deployed commit against the newly deployed commit Timestamp when service.build.deployment.at 2024-10-14T19:47:38Z the deployment process started Which authenticated service.build.deployment.user [email protected] user kicked off the build? Could be a bot What triggered merge-to-main the deploy? slack-bot Extremely service.build.deployment.trigger api-request valuable context config-change during an deploy-triggered incident How old is this deploy? Shortcuts the service.build.deployment.age_minutes 1 frequent 10230 incident question "Did something just go out?" Won't this be a lot of repetitive data? These values do not change except between deploys! See Frequent Objections What systems have recently been deployed? SELECT service.name, MIN(service.build.deployment.age_minutes) as age WHERE service.build.deployment.age_minutes < 20 GROUP BY service.name ORDER BY age ASC LIMIT 10 What's up with the spike of 500s when we did the last deploy? SELECT COUNT(*) WHERE service.name = "api-service" AND main = true GROUP BY http.status_code, service.version Graph showing requests grouped by http status code and version. There is a spike of 500s correlating to v1 shutting down. HTTP You should get most of these from your tracing library instrumentation, but there are usually more you can add if, for example, your organization uses non-standard headers. Don't settle for only what OpenTelemetry gives you by default! Attribute Examples Description Name of the HTTP server.address example.com server that localhost received the request /checkout URI path after the url.path /account/123/ domain features url.scheme http, https URI scheme url.query q=test, ref=#### URI query component Platform request http.request.id 79104EXAMPLEB723 id: ex: x-request-id, x-amz-request-id GET http.request.method PUT HTTP request method POST OPTIONS Size of the request http.request.body_size 3495 payload body in bytes Value of a specific request header, "content-type" in http.request.header.content-type application/json this case, but there are many more. Pick out any that are important for your service 200 HTTP response http.response.status_code 404 status code 500 1284 Size of the http.response.body_size 2202009 response payload body in bytes Value of a specific response header, "content-type" in http.request.header.content-type text/html this case, but there are many more. Pick out any that are important for your service SELECT HEATMAP(http.response.body_size), WHERE main = true AND service.name = "api-service" A heatmap of response sizes. Most are within a fixed band, but there are sharp outliers that warrant more investigation. User-Agent headers contain a wealth of info. Don't rely on regex queries to try and make sense of them down the road. Parse them into structured data from the beginning. Attribute Examples Description Mozilla/5.0 (Windows NT The value of 10.0; Win64; x64) the HTTP user_agent.original AppleWebKit/537.36 (KHTML, User-Agent like Gecko) Chrome/ header 129.0.0.0 Safari/537.3 computer Device type user_agent.device tablet derived from phone the User-Agent header OS derived user_agent.OS Windows from the MacOS User-Agent header Chrome Browser user_agent.browser Safari derived from Firefox the User-Agent header Browser 129 version user_agent.browser_version 18.0 derived from the User-Agent header What browsers are my users using? SELECT COUNT(*) GROUP BY user_agent.browser, user_agent.browser_version If you have any custom user agents or headers used as a convention within your org parse that out too. Attribute Examples Description If you have a distributed architecture, have each user_agent.service api-gateway service send auth-service a custom User-Agent header with its name and version If you have a distributed architecture, have each user_agent.service_version v123 service send 6f6466b0e693470729b669f3745358df29f97e8d a custom User-Agent header with its name and version If a request is coming from a mobile user_agent.app iOS app, make android sure it includes which app and its version If a request is coming from a mobile user_agent.app_version v123 app, make 6f6466b0e693470729b669f3745358df29f97e8d sure it includes which app and its version Route info We're not done with HTTP attributes yet! One of the most important bits is the API endpoint that the request matched. OpenTelemetry SDKs will usually give this to you automagically but not always. Consider extracting the route parameters and query parameters as additional attributes. Attribute Examples Description /team/ The route pattern that the http.route {team_id}/user url path is matched against /{user_id} 14739 The extracted segment of the http.route.param.team_id team-name-slug url path as it is parsed for each parameter The query parameters that http.route.query.sort_dir asc are relevant to the response of your service. Ex: ? sort_dir=asc&... SELECT P99(duration_ms) WHERE main = true AND service.name = "api-service" GROUP BY http.route A chart of p99's broken down by route. There is a spike on only some of them. We should break down by version now to check if this was caused by a deploy. User and customer info Once you get the basics down, this is the most important piece of metadata that you can add. No automagic SDK will be able to encode the particulars of your user model. It's common for a single user or account to be responsible for a 10%+ of a business' revenue, and frequently their usage patterns look significantly different than the average user. They probably have more users, store more data, and hit limits and edge-cases that will never show up for the user paying $10 / month. Be sure you can separate their traffic from others. Attribute Examples Description The primary ID for a user. If this 2147483647 is an email and you're using a user.id [email protected] vendor, consider your org's policy on putting PII in external services. How does the business see this free type of user? Individual accounts user.type premium are sometimes responsible for 10%+ enterprise of a business' income. Make sure vip you can separate their traffic from others! token user.auth_method basic-auth How did this user authenticate jwt into your system? sso-github 5387 If you have a team construct, user.team.id web-services which one does this user belong to? 278 If this user is part of an user.org.id enterprise-name organization with an enterprise contract, track that! Not the user's literal age, but how long ago was this account user.age_days 0 created? Is this an issue 637 experienced by someone new to your app, or only once they've saved a lot of data? Have an internal way of assuming a user.assumed true user's identity for debugging? Be sure to track this user.assumed_by [email protected] And track which actual user is assuming the user's identity SELECT P99(duration_ms) WHERE main = true AND service.name = "api-service" GROUP BY user.type Rate limits Whatever your rate limiting strategy, make sure the current rate limit info gets added too. Can you quickly find examples of users that are being rate-limited by your service? Attribute Examples Description You might not now, but you will likely have users with ratelimit.limit 200000 different rate limits in the future, note down what the actual limit is for this request ratelimit.remaining 130000 What is the budget remaining for this user? How many requests have been ratelimit.used 70000 used in the current rate window ratelimit.reset_at 2024-10-14T19:47:38Z When will the rate limit be reset next? if applicable This user has a support ticket open about being rate-limited. Let's see what they were doing SELECT COUNT(*) WHERE main = true AND service.name = "api-service" AND user.id = 5838 GROUP BY http.route A graph of one users activity. There is a big spike hitting the same route a lot at the end. This gives us a starting point for investigation What routes are users who have burned most of their rate limit hitting? Does this activity look suspicious? SELECT COUNT(*) WHERE main = true AND service.name = "api-service" AND ratelimit.remaining < 100 GROUP BY http.route Caching For every code path where we could shortcut with a cache response, add whether or not it was successful Attribute Examples Description cache.session_info true Was the session info cached or did it false need to be re-fetched? cache.feature_flags true Were the feature flags cached for this false user or did they need to be re-fetched? Localization info What localization options has the user chosen? This can be a frequent source of bugs Attribute Examples Description localization.language_dir rtl, ltr Which direction is text laid out in their language? localization.country mexico, Which country are they from? uk localization.currency USD, CAD Which currency have they chosen to work with? Uptime Tracking how long the service has been running when it serves a request can help you visualize several classes of bugs: * Issues that show up on a reboot * Memory leaks that only start to show up when the service has been running for a long time * Frequent crashes / restarts if you have automatically restart the service on failure I recommend also either adding the log10 of the uptime or having some way of visualizing this. When graphed this emphasizes the important first few minutes of a service without being squished into the bottom of the graph by instances with several days or more of uptime. Attribute Examples Description How long has this instance of your app uptime_sec 1533 been running? Can be useful to visualize to see restarts Grows sub-linearly which allows you to uptime_sec_log_10 3.185 visualize long-running services and brand new ones on the same graph SELECT HEATMAP(uptime_sec), HEATMAP(uptime_sec_log_10) WHERE main = true AND service.name = "api-service" Heatmaps of uptime when a service enters a crash loop. It's far easier to distinguish in log scale Metrics This one might be a bit controversial, but I've found it helpful to tag spans with context about what the system was experiencing while processing the request. We fetch this information every ~10 seconds, cache it, and add it to every main span produced during that time. Capturing metrics in this way is not mathematically sound. Since you only get data when traffic is flowing, you can't calculate a P90 for cpu load that would stand up to any rigorous scrutiny, but that's actually fine in practice. It's close enough to get some quick signal while you're debugging without switching to a different tool, especially if you can avoid calculations and visualize with a heatmap. I wouldn't recommend setting alerts on this data though. Plain ol' metrics are great for that. Jessica Kerr recently wrote about this approach on the Honeycomb Blog . Attribute Examples Description 153 How much memory is being used metrics.memory_mb 2593 by the system at the time its service this request 0.57 CPU load of the system service metrics.cpu_load 5.89 this request. Given as # of active cores Last observed number of garbage collections. Could be metrics.gc_count 5390 cumulative (total since service started) or delta (ex: number in the last minute) Time spent in garbage metrics.gc_pause_time_ms 14 collections. Could also be 325 cumulative or delta. Pick one and document which metrics.go_routines_count 3 Number of go routines running 3000 Cumulative time spent waiting metrics.event_loop_latency_ms 0 on the next event loop tick. 340 An important metric for Node apps Are these requests getting slow because we're running out of memory or CPU? SELECT HEATMAP(duration_ms), HEATMAP(metrics.memory_mb), HEATMAP(metrics.cpu_load) WHERE main = true AND service.name = "api-service" GROUP BY instance.id An example showing using the metrics data tagged on the span to get context for whats happening with the system Async request summaries When using a tracing system async requests should get their own spans, but it can still be useful to roll up some stats to identify outliers and quickly find interesting traces. Attribute Examples Description 1 How many http requests were stats.http_requests_count 140 triggered during the processing of this request? stats.http_requests_duration_ms 849 Cumulative time spent in these http requests 7 How many Postgres queries stats.postgres_query_count 742 were triggered during the processing of this request? stats.postgres_query_duration_ms 1254 Cumulative time spent in these Postgres queries 3 How many redis queries were stats.redis_query_count 240 triggered during the processing of this request? stats.redis_query_duration_ms 43 Cumulative time spent in these redis queries How many calls to this stats.twilio_calls_count 1 vendors api were triggered 4 during the processing of this request? stats.twilio_calls_duration_ms 2153 Cumulative time spent in these vendor calls Surely my service makes a reasonable number of calls to the database... right? SELECT HEATMAP(stats.postgres_query_count) WHERE main = true AND service.name = "api-service" A heatmap of db queries per request. There is a bi-modal distribution but also some outliers that make a lot of requests Instead of adding this explicitly, couldn't we aggregate this by querying the whole trace? See Frequent Objections Sampling Once you start collecting fine-grained telemetry from your systems at a significant scale you run head-on into the problem of sampling. Running systems can produce a lot of data! Engineers frequently want to store and query all of it. Exact answers always! Make it fast! Also cheap! But it's trade-offs all the way down. Telemetry data is fundamentally different from the transaction data you're storing for your users, and you should think about it differently. Luckily you only really need a statistically significant subset of the full dataset. Even sampling 1 out of every 1000 requests can provide a suprisingly detailed picture of the overall traffic patterns in a system. Sampling is a suprisingly deep topic. Keep it simple if you're starting and do uniform random head sampling, but track your sample rate per-span so you can be ready for more sophisticated approaches down-the-line. Good tooling will weight your calculations with a per-span, so you don't have to mentally multiple the COUNT call by the sample_rate to get an accurate answer. Here are some relevant articles: * I was first introduced to this idea in the Scuba paper * Honeycomb supports per-event sample rates * Cloudflare's Analytics Engine will automatically sample for you based on volume Attribute Examples Description 1 N where 1 in N events will be sampled and stored sample_rate 500 and the rest dropped. If you're sampling 1% of requests, the sample_rate would be 100 Timings I find it super useful to break up the work that gets done to respond to a request into a handful of important chunks and track how long each segment took on the main span. Wait, isn't that what child spans are for? Wrapping absolutely everything in its own span is the most common failure mode I see when engineers first get access to tracing tools. You have to design the structure of your data for the way you want to query it. Child spans are helpful for waterfall visualization for a single request, but can be difficult to query and visualize across all of your requests. Putting that information on a single span makes it easier to query and also helps with tools like Honeycomb's BubbleUp which can then immediately tell you that that group of requests was slow because authentication took 10 seconds for some reason. Attribute Examples Description 52.2 How long did we spend performing auth.duration_ms 0.2 authentication during this request? payload_parse.duration_ms 22.1 Identify the core workloads of the 0.1 service and add timings for them Errors If you encounter an error and need to fail the operation, tag the span with the error information: type, stacktrace, etc. One approach that I have found super-valuable is tagging each location where we throw an error with a unique slug describing the error. If this string is unique within your codebase, it is easily found with a quick search. This allows someone to jump straight from a spike in errors on a dashboard to the exact line of code that throwing the error. It also provides a convenient low-cardinality field to GROUP BY. You're unlikely to be able to wrap all possible errors, but any time a failed request doesn't have an exception.slug that is a good sign that you have places in your code where your error handling could be improved. It's now really easy to find examples of requests that failed in ways you didn't anticipate. if isNotRecoverable(err) { // note the use of a plain string, not a variable, not dynamically generated // consider enforcing this with custom lint rules setErrorAttributes(err, "err-stripe-call-failed-exhausted-retries"); throw err; } Attribute Examples Description true Special field for error false whether the request failed or not Can't convert 'int' The exception message exception.message object to str encoded in the undefined is not a exception function exception.type IOError The programmatic type java.net.ConnectException of the exception ReferenceError: user is Capture the stack not defined trace if its available exception.stacktrace at myFunction (/path/to/ to help pin-point file.js:12:2) where the error is ... being thrown Is this an expected exception like a bot trying to hit a url exception.expected true, false that doesn't exist? Allows filtering out of exceptions we can't prevent but don't need to worry about Create a unique grepp-able slug-value auth-error to identify the code exception.slug invalid-route location of an error github-api-unavailable if its predictable during development time Which of our enterprise users hit the most errors last week? And which one? SELECT COUNT_DISTINCT(user.id) WHERE main = true AND service.name = "api-service" AND user.type = "enterprise" GROUP BY exception.slug Show me traces where we likely need to improve our error handling SELECT trace.trace_id WHERE main = true AND service.name = "api-service" AND error = true AND exception.slug = NULL GROUP BY trace.trace_id Feature flags Fine-grained feature flags are a developer super power that allows you to test code changes in production with only a fraction of your users or traffic. Adding the flag information per-request allows you to compare how the new code is working as you opt more of your traffic into the new code path. Coupled with the broad visibility you can get with wide events, and this can make even tricky migrations vastly more manageable and allow you to ship code with confidence. Note that semantic conventions differ here and suggest adding feature flag information as events on the span. I would suggest following that standard since it will ultimately have the best support from vendors if it's moved to stable, but especially in the mean time, I'm also putting this info on the main span. Attribute Examples Description true The value of a feature_flag.auth_v2 false particular feature flag for this request true The value of a feature_flag.double_write_to_new_db false particular feature flag for this request What errors are the users in the new authentication flows hitting? How does it compare to the control group? SELECT COUNT WHERE main = true AND service.name = "api-service" AND GROUP BY feature_flag.auth_v2, exception.slug Versions of important things Runtimes, frameworks, and any major libraries you are using can be really helpful context. Attribute Examples Description go.version go1.23.2 What version of your language runtime are you using? rails.version 7.2.1.1 Pick out any core libraries like web frameworks and track their version too postgres.version 16.4 If you can add the versions of any datastores you're using, even better A security issue with Rails just got announced. What versions of the framework are our services using? SELECT COUNT_DISTINCT(service.name) WHERE service.environment = "production" GROUP BY rails.version Our memory usage seems higher that it used to be. Didn't we upgrade the runtime recently? Does that correlate? SELECT HEATMAP(metrics.memory_mb) WHERE main = true AND service.name = "api-service" GROUP BY go.version Your specific application Now we go off the map and get to the really valuable stuff. Your app likely does something unique or works in a particular domain. You might need to really care about which professional credentials a Dentist using your app has, or which particular storage warehouse a package is in, or which chip is in the embedded tracking device installed in the cat that your app exists to track. No framework is going to be able to understand what parts of your domain are important to track and automate this for you, you have to do that. Attribute Examples Description If you upload asset_upload.s3_bucket_path s3://bucket-name/path/to/asset.jpg something, add context about where If you interact with a vendor, track email_vendor.transaction_id 62449c60-b51e-4d5c-8464-49217d91c441 whatever transaction ID they give you in case you need to follow up with them If there are 3-4 types that something might fall into, be sure to add that context. Ex: github If 2% of vcs_integration.vendor gitlab requests bitbucket start failing because bitbucket is experiencing issues, this will help identify the source of the issue immediately. Any time you interact with a queue, see process_submission.queue_length 153 if you can 1 get the current length during submission Things to note You should probably add the thing If you find yourself asking "Am I ever really going to need this bit of data?", default to throwing the attribute on. The marginal cost of each extra attribute is very small. If the data volume does start to grow, prefer wider, more context-rich events and a higher sample rate vs smaller events with a lower sample rate. Heatmaps are your friend Honeycomb's heatmaps are amazing at helping you find outliers, seeing multi-modal distributions, and getting a feel for your data. I wish more tooling supported them. I am not sure I can build software without them any more. Embrace the feedback loop When you are modifying code, make a change to the telemetry so that you can see the impact of the new code running. Once the code is released, check to make sure that you see the outcome you expected. Don't hesitate to add specific fields for one release and them remove them after. Tighter feedback loops are like going faster on a bicycle. They make for more stable systems and let you move faster with confidence. Semantic conventions and naming consistency I've tried to embrace semantic conventions in my naming, but would not be surprised if I've made multiple errors. Naming is hard! It's also hard to get consistency right within an organization or even across multiple systems owned by the same team. I would recommend trying to use semantic conventions as a guide, but do prioritize getting data out of your system in some form and getting some early wins over exacting adherence to an evolving specification. Once this data has proven its value within your organization, then you will have the leverage to spend engineering cycles on making things consistent. In the long run semantic conventions should allow Observability vendors to build new value and understanding on top of the telemetry you emit, but this effort is only just getting started. Frequent Objections Does this really work?? I have done this for dozens of production systems. Every single time the data has been invaluable for digging in and understanding what the system is actually doing, and we've found something surprising, even for the engineers who had worked on the system for many years. Things like: * Oh, actually 90% of the traffic of this system comes from one user * Wait, one of our worker processes is actually running a month old version of the code somehow? * This API endpoint usually has payloads of 1-2kb, but there is an edge case affecting one user where it's 40+MB. This causes their page loads to be several minutes longer than the p99. * After instrumenting the authentication middleware, around 20% of requests still didn't have user info. There was a whole second authentication system for a different class of users that hadn't been touched in years. * This endpoint that we'd like to deprecate accepts data in the form of A, B, and C, but none of our traffic ever even uses C. We can just drop support for that now. I don't like it. This feels wrong For anyone feeling that way now, I ask you to give it five minutes. I find that when a log line wraps around your terminal window multiple times, most developers have a pretty visceral negative reaction. This feels right: [2024-09-18 22:48:32.990] Request started http_path=/v1/charges request_id=req_123 [2024-09-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123 [2024-09-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99 [2024-09-18 22:48:32.998] Charge created charge_id=ch_123 permissions_used=account_write request_id=req_123 [2024-09-18 22:48:32.999] Request finished http_status=200 request_id=req_123 But this feels wrong: [2024-10-20T14:43:36.851Z] duration_ms=1266.1819686777117 main=true http.ip_address=92.21.101.252 instance.id=api-1 instance.memory_mb=12336 instance.cpu_count=4 instance.type=t3.small http.request.method=GET http.request.path=/api/categories/substantia-trado http.route=/api/categories/:slug http.request.body.size=293364 http.request.header.content_type=application/xml user_agent.original="Mozilla/5.0 (X11; Linux i686 AppleWebKit/535.1.2 (KHTML, like Gecko) Chrome/39.0.826.0 Safari/535.1.2" user_agent.device=phone user_agent.os=Windows user_agent.browser=Edge user_agent.browser_version=3.0 url.scheme=https url.host=api-service.com service.name=api-service service.version=1.0.0 build.id=1234567890 go.version=go1.23.2 rails.version=7.2.1.1 service.environment=production service.team=api-team service.slack_channel=#api-alerts service.build.deployment.at=2024-10-14T19:47:38Z service.build.diff_url=https://github.com/your-company/api-service/compare/c9d9380..05e5736 service.build.pull_request_url=https://github.com/your-company/api-service/pull/123 service.build.git_hash=05e5736 service.build.deployment.user=keanu.reeves@your-company.com service.build.deployment.trigger=manual container.id=1234567890 container.name=api-service-1234567890 cloud.availability_zone=us-east-1 cloud.region=us-east-1 k8s.pod.name=api-service-1234567890 k8s.cluster.name=api-service-cluster feature_flag.auth_v2=true http.response.status_code=401 user.id=Samanta27@gmail.com user.type=vip user.auth_method=sso-google user.team_id=team-1 You are structuring data so that it can be read efficiently by machines, not humans. Our systems emit too much data to waste precious human lifetimes using our eyeballs to scan lines of text looking for patterns to jump out. Let the robots help. This seems like a lot of work If you want to implement everything I've talked about in this post that would be a ton of work. However, even implementing the easiest subset is going to provide a lot of value. Not doing this results in so much more work building a mental model of your system, trying to debug by thinking through the code and hoping your mental model matches reality. A lot of this logic can be put into shared libraries within your org, though getting them adopted, keeping them updated and in-sync, and getting engineers used to these tools presents a whole different set of challenges. Many of these things could be surfaced to you by opinionated platforms or frameworks. I would love to see things move in this direction. Isn't this a lot of data? Won't it cost a lot?? Hacker News comment: This isn't an unknown idea outside of Meta, it's just really expensive, especially if you're using a vendor and not building your own tooling. Prohibitively so, even with sampling. First, you should compare this to your current log volume per request. I have seen many systems where this approach would reduce overall log volume. However storing this data for every request against your system could be too expensive at scale. That's where sampling comes in. Sampling gives you the controls to determine what you want to spend vs the value you receive from storing and making that data available to query. Realtime OLAP systems are also getting cheaper all the time. Once upon a time Scuba held all data in memory to make these types of questions quick to answer. Now most OLAP systems are evolving to columnar files stored on cloud object storage with queries handled by ephemeral compute which is many orders of magnitude cheaper. In the next section I'll show just how much cheaper. Repeated data Many of these fields will be the same for every request. Isn't that really inefficient? This is where our intuitions can lie to us. Let's look at a concrete example. I wrote a script [1] to generate a newline-delimited JSON file with a lot of the above fields and at least somewhat reasonable fake values. Let's say our service is serving 1000 req/s all day and sampling 1% of that traffic. Rounding to a whole number, that's about a million events. Generating a million example wide events results in a 1.6GB file. http_logs.ndjson 1607.61 MB But we repeat the keys on every single line. Even just turning it into a CSV cuts the size by more than 50%. http_logs.csv 674.72 MB Gzipping the file shows an amazing amount of compression, hinting that this isn't as much data as we might think. http_logs.ndjson.gz 101.67 MB Column store formats like parquet and Duckdb's native format can do even better. http_logs.parquet 88.83 MB http_logs.duckdb 80.01 MB They store all of the data for a specific column contiguously, which lends itself to different compression approachs. In the simplest case, if the column is always the same value, it can store that fact only once. Values that are the same across an entire row group are incredibly cheap. DuckDB diagram showing how a constant value along a whole column gets compressed If there are 2-3 different values, it can use dictionary-encoding to bit-pack these values really tightly. This also speeds up queries against this column. DuckDB diagram showing how values get compressed using dictionary encoding DuckDB has a great writeup on this which goes into much more detail. All of the data remains available and is easily (and quickly!) queryable. This is hardly "big data". Storing this on Cloudflare's R2 for a month would cost US$ 0.0012. You could keep 60 days of retention for US$ 0.072 / month. duckdb http_logs.duckdb D SELECT COUNT(*) FROM http_logs; +--------------+ | count_star() | | int64 | +--------------+ | 1000000 | +--------------+ Run Time (s): real 0.002 user 0.002350 sys 0.000946 D SELECT SUM(duration_ms) FROM http_logs; +-------------------+ | sum(duration_ms) | | double | +-------------------+ | 999938387.7714149 | +-------------------+ Run Time (s): real 0.003 user 0.008020 sys 0.000415 There are even in-memory and transport formats to help reduce size in memory and on the wire. OpenTelemetry is adopting arrow for its payloads for this reason. I found this podcast on the FDAP stack particularly helpful in understanding this space. Couldn't we JOIN data from multiple spans together to get this information? Query the whole trace at once? This is certainly possible. Honeycomb has started allowing you to filter on fields on other spans in the same trace. However I'd qualify this as very advanced. You want to make the right thing the easiest thing, and if you make it harder to ask questions, people will simply ask fewer questions. There are already a million things competing for our attention. Keep it simple. Make it fast. Does this mean I don't need metrics? You should probably still generate high-level metrics, though I bet you will need far fewer. Metrics are great when you know you want an exact answer to a very specific question that you know ahead of time. Questions like "How many requests did a serve yesterday?" or "What was my CPU usage like last month?" --------------------------------------------------------------------- [1] Well... mostly Cursor wrote it (c) 2024 | Jeremy Morrell