https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/

 
Jeremy Morrell
blog about rss
Oct 22, 2024
A Practitioner's Guide to Wide Events

Adopting Wide Event-style instrumentation has been one of the
highest-leverage changes I've made in my engineering career. The
feedback loop on all my changes tightened and debugging systems
became so much easier. Systems that were scary to work on suddenly
seemed a lot more manageable.

Lately there have been a lot of good blog posts on what "Wide Events"
mean and why they are important. Here are some of my recent
favorites:

  * All you need is Wide Events, not "Metrics, Logs and Traces" by
    Ivan Burmistrov
  * Observability wide events 101 by Boris Tane
  * Is it time to version Observability? (Signs point to yes) by
    Charity Majors

The tl;dr is that for each unit-of-work in your system (usually, but
not always an HTTP request / response) you emit one "event" with all
of the information you can collect about that work. "Event" is an
over-loaded term in telemetry so replace that with "log line" or
"span" if you like. They are all effectively the same thing.

Charity Majors has been promoting this approach lately under the name
"Observability 2.0", creating some new momentum around the concept,
however, it is not a new idea. Brandur Leach wrote about "Canonical
Log Lines" both on his own blog in 2016 and as used by Stripe in 2019
. And AWS has recommended it as a best-practice for ages.

Okay... I think I get the idea... but how do I do "wide events"?

This is where I find a lot of developers get tripped up. The idea
sounds good in theory, and we should totally try that one day! But I
have this stack of features to ship, that bug that's been keeping me
up at night, and 30 new AI tools that came out yesterday to learn
about. And like... where do you even start? What data should I add?

Like anything in software, there are a lot of options for how to
approach this, but I'll talk through one approach that has worked for
me.

We'll cover how to approach this in tooling and code, an extensive
list of attributes to add, and I'll respond to some frequent
objections that come up when discussing this approach.

For this post we'll focus on web services, but you would apply a
similar approach to any workload.

Choose your tools

We will need some way to instrument your code (traces or structured
log lines) and somewhere to send the instrumentation to in order to
query and visualize it.

This approach is best paired with a tool that lets you query your
data in quick iterations. I like Honeycomb for this, but any
Observability tool backed by a modern OLAP database is likely going
to work in a pinch.

  * Honeycomb has Retriever
  * DataDog has Husky
  * New Relic has NRDB
  * Baselime uses ClickHouse
  * SigNoz uses ClickHouse

Honeycomb, New Relic, and DataDog built their own columnar OLAP data
stores, though now with the availability of ClickHouse, InfluxDB IOx,
Apache Pinot, and DuckDB there are new Observability tools popping up
all the time.

If you aren't constrained, I highly recommend defaulting to using
OpenTelemetry and Honeycomb. Your life will be easier.

However even if you are stuck in a corporate environment with a
strong allergy to technology built after 2010 you can leverage log
search tools like ElasticSearch in a pinch. Stripe's blog post goes
over how to use Splunk for this.

In any tool you want to focus on getting proficient at 3 core
techniques in order to sift through your events. The faster you are
able to apply these, iterate, and ask questions of your data, the
better you'll be able to debug issues and see what your system is
really doing. When observability folks refer to "slicing and dicing"
data, this is what they are generally referring to. I'll represent
queries using a made-up SQL dialect, but you should be able to find
equivalents in your tool's query language.

Visualizing

Existing in a human body comes with its fair share of downsides, but
the human visual cortex is really, really good at recognizing
patterns. Give it a fighting chance by getting really good at
summoning visualizations of the data your system is emitting. COUNT,
COUNT_DISTINCT, HEATMAP, P90, MAX, MIN, Histogram. Learn to leverage
whatever graphs your tool makes available to you. Practice it. Get
fast.

A Honeycomb screenshot of heatmap

A Splunk screenshot of histogram

Grouping

With each new annotation that we add to our wide events, we create
another dimension along which we can slice our data. GROUP BY allows
us to look along that dimension and see if the values along that
dimension match our expectations.

GROUP BY instance.id

GROUP BY client.OS, client.version

Filtering

Once we've narrowed in one dimension that is interesting, we usually
want to dig further into that data. Filtering down so that we're only
looking at data from one endpoint, or from one IP address, or sent by
the iOS app, or only from users with a specific feature flag turned
on allows us to narrow our focus to a very specific segment of
traffic.

WHERE http.route = "/user/account"

WHERE http.route != "/health"

WHERE http.user_agent_header contains "Android"

Write a middleware to help you

If you are using an OpenTelemetry SDK it is already creating a
wrapping span around the request and response. You can access it by
asking for the active span at any point during the processing of the
request.

let span = opentelemetry.trace.getActiveSpan();
span.setAttributes({
  "user_agent.original": c.req.header("User-Agent"),
});

However if anyone wraps any of your code in a child span the "active
span" will change to be that new wrapping span! There is no
first-class way of addressing this original "main" span in
OpenTelemetry. However, we can work around this by saving a reference
to this specific span in the context so we can always have access to
the "main" wrapping span.

// create a reference to store the span on the opentelemetry context object
const MAIN_SPAN_CONTEXT_KEY = createContextKey("main_span_context_key");

function mainSpanMiddleware(req, res, next) {
  // pull the active span created by the http instrumentation
  let span = trace.getActiveSpan();

  // get the current context
  let ctx = context.active();

  // set any attributes we always want on the main span
  span.setAttribute("main", true);

  // OpenTelemetry context is immutable, so to modify it we create
  // a new version with our span added
  let newCtx = ctx.setValue(MAIN_SPAN_CONTEXT_KEY, span);

  // set that new context as active for the duration of the request
  context.with(newCtx, () => {
    next();
  });
}

// create another function that allows you to annotate this saved span easily
function setMainSpanAttributes(attributes) {
  let mainSpan = context.active().getValue(MAIN_SPAN_CONTEXT_KEY);
  if (mainSpan) {
    mainSpan.setAttributes(attributes);
  }
}

Now our annotation code can look a little simpler, and we can always
know that we're setting these attributes on the wrapping span.

setMainSpanAttributes({
  "user.id": "123",
  "user.type": "enterprise",
  "user.auth_method": "oauth",
});

You can play around with a minimal running example here.

At Heroku we had internal OpenTelemetry Distributions that set this
up for you automatically and added as many automatic annotations as
possible to these spans.

If you are not using OpenTelemetry here's a gist that might help you
get started. My previous post may help you put this logic together.

What do I add to this "main" span?

    And how many dimensions do you plan to emit and pack into your
    wide events?

    MANY. Hundreds! The more you have, the better you can detect and
    correlate rare conditions with precision.

    As you adjust to the joys of debugging with rich context, you
    will itch for it everywhere. [?]

    -- Charity Majors (@mipsytipsy) January 9, 2024

We need to add attributes about the request, and there are likely far
more of these than you would expect. It's easy to come up with a
dozen or so, but in a well-instrumented code base there will be
hundreds of attributes.

Note that while this is a long list, it is definitely not exhaustive.
OpenTelemetry defines sets of attribute names as Semantic Conventions
that can also be used for inspiration. I have tried to follow these
in my naming where possible.

A convention to filter out everything else

Traces contain lots of spans, so it's helpful to have a convention
for identifying and searching for these "wide events". root and canon
were floated as options, but I've landed on calling them main spans.

Attribute Examples                    Description
                   Present only for spans designated as a "wide
main      true     event", usually wrapping a request / response, or
                   a background job

This convention allows you to quickly figure out "what does the
traffic to this service look like?" with a single query:

SELECT
  COUNT(*)
WHERE
  main = true
GROUP BY http.route

Graph of traffic grouped by route over a week. There is an anomally.

Service metadata

Of course we need to add some information about the service we're
running. Consider adding additional metadata about which team owns
the system, or which Slack channel the owning team hangs out in,
though note that this can be tedious to update if your workplace
experiences frequent re-orgs. Tying these to a service catalog like
Backstage is left as an exercise to the reader.

      Attribute         Examples              Description
service.name          api          What is the name of this service?
                      shoppingcart
                      production
service.environment   staging      Where is this service running?
                      development
                      web-services Which team owns this service.
service.team          dev-ex       Useful for knowing who to page in
                                   during incidents.
service.slack_channel web-services If I discover an issue with this
                      dev-ex       service, where should I reach out?

    How many services does each team run?

SELECT
  COUNT_DISTINCT(service.name)
WHERE
  service.environment = "production"
GROUP BY service.team

Ever look at the load on a system and then wonder "Is that
appropriate for the machine this is running on?", and now you have to
look through other tools or config files to get that information.
Throw that context on the wide event so that it's available when you
need it.

    Attribute                    Examples                Description
                                                        An ID that
                                                        maps to this
instance.id        656993bd-40e1-4c76-baff-0e50e158c6eb one instance
                                                        of the
                                                        service
                                                        How much RAM
instance.memory_mb 12336                                is available
                                                        to this
                                                        service?
                   4                                    How many
instance.cpu_count 8                                    cores are
                   196                                  available to
                                                        this service?
                                                        Does your
                                                        vendor have a
instance.type      m6i.xlarge                           name for this
                                                        type of
                                                        instance?

    What are the services with the most memory that we run? What
    instance types do they use?

SELECT
  service.name,
  instance.memory_mb,
  instance.type
ORDER BY instance.memory_mb DESC
GROUP BY service.name, instance.type
LIMIT 10

However you're orchestrating your systems make sure that all of the
relevant information is added. I've included some examples from the
Kubernetes semantic conventions for inspiration.

       Attribute               Examples             Description
container.id            a3bf90e006b2           An ID used to identify
                                               Docker containers
container.name          nginx-proxy            Container name used by
                        wordpress-app          container runtime
                                               Name of the kubernetes
k8s.cluster.name        api-cluster            cluster your service
                                               is running in
                                               Name of the kubernetes
k8s.pod.name            nginx-2723453542-065rx pod your service is
                                               running in
cloud.availability_zone us-east-1c             AZ where you're
                                               running your service
cloud.region            us-east-1              Region where you're
                                               running your service

But even if you're using a Platform-as-a-Service you can still pull
out a lot of useful information!

    Attribute         Examples                Description
heroku.dyno       web.1            The env var DYNO that is set on
                  worker.3         your app at runtime
                  web              The first part of the DYNO env var
heroku.dyno_type  worker           before the .. Separating this
                                   makes it easier to query
                  1                The second part of the DYNO env
heroku.dyno_index 3                var after the .. Separating this
                                   makes it easier to query
heroku.dyno_size  performance-m    The selected dyno size
heroku.space      my-private-space The name of the private space that
                                   your are deployed into
heroku.region     virginia         Which region is this app located
                  oregon           in?

    How many dynos are we running? What dyno types are they? For
    which services?

SELECT
  COUNT_DISTINCT(heroku.dyno_index)
GROUP BY service.name, heroku.dyno_type, instance.type

Build info

Inevitably some of the first questions asked in any incident are "Did
something just go out?" or "What changed?". Instead of jumping to
your deployment tool or looking through GitHub repositories, add that
data to your telemetry.

Threading this data from your build system through to your production
system so that it's available at runtime can be a non-trivial amount
of glue code, but having this information easily available during
incidents is invaluable.

             Attribute                               Examples                   Description
                                                                              However you
                                                                              track your
service.version                      v123                                     version, ex: a
                                     9731945429d3d083eb78666c565c61bcef39a48f version string
                                                                              or a hash of the
                                                                              built image
                                                                              If your build
                                                                              system gives you
                                                                              an ID, this
service.build.id                     acd8bb57-fb9f-4b2d-a750-4315e99dac64     context allows
                                                                              you to audit the
                                                                              build if
                                                                              something goes
                                                                              wrong
                                                                              The git SHA of
                                                                              the deployed
service.build.git_hash               6f6466b0e693470729b669f3745358df29f97e8d commit so you
                                                                              can know exactly
                                                                              which code was
                                                                              running
                                                                              The url of the
                                     https://github.com/your-company/         pull request
service.build.pull_request_url       api-service/pull/121                     that was merged
                                                                              that triggered
                                                                              the deploy
                                                                              A url that
                                                                              compares the
                                     https://github.com/your-company/         previously
service.build.diff_url               api-service/compare/c9d9380..05e5736     deployed commit
                                                                              against the
                                                                              newly deployed
                                                                              commit
                                                                              Timestamp when
service.build.deployment.at          2024-10-14T19:47:38Z                     the deployment
                                                                              process started
                                                                              Which
                                                                              authenticated
service.build.deployment.user        [email protected]                        user kicked off
                                                                              the build? Could
                                                                              be a bot
                                                                              What triggered
                                     merge-to-main                            the deploy?
                                     slack-bot                                Extremely
service.build.deployment.trigger     api-request                              valuable context
                                     config-change                            during an
                                                                              deploy-triggered
                                                                              incident
                                                                              How old is this
                                                                              deploy?
                                                                              Shortcuts the
service.build.deployment.age_minutes 1                                        frequent
                                     10230                                    incident
                                                                              question "Did
                                                                              something just
                                                                              go out?"

Won't this be a lot of repetitive data? These values do not change
except between deploys! See Frequent Objections

    What systems have recently been deployed?

SELECT
  service.name,
  MIN(service.build.deployment.age_minutes) as age
WHERE
  service.build.deployment.age_minutes < 20
GROUP BY service.name
ORDER BY age ASC
LIMIT 10

    What's up with the spike of 500s when we did the last deploy?

SELECT
  COUNT(*)
WHERE
  service.name = "api-service" AND
  main = true
GROUP BY http.status_code, service.version

Graph showing requests grouped by http status code and version. There
is a spike of 500s correlating to v1 shutting down.

HTTP

You should get most of these from your tracing library
instrumentation, but there are usually more you can add if, for
example, your organization uses non-standard headers. Don't settle
for only what OpenTelemetry gives you by default!

           Attribute                 Examples         Description
                                                  Name of the HTTP
server.address                   example.com      server that
                                 localhost        received the
                                                  request
                                 /checkout        URI path after the
url.path                         /account/123/    domain
                                 features
url.scheme                       http, https      URI scheme
url.query                        q=test, ref=#### URI query component
                                                  Platform request
http.request.id                  79104EXAMPLEB723 id: ex:
                                                  x-request-id,
                                                  x-amz-request-id
                                 GET
http.request.method              PUT              HTTP request method
                                 POST
                                 OPTIONS
                                                  Size of the request
http.request.body_size           3495             payload body in
                                                  bytes
                                                  Value of a specific
                                                  request header,
                                                  "content-type" in
http.request.header.content-type application/json this case, but
                                                  there are many
                                                  more. Pick out any
                                                  that are important
                                                  for your service
                                 200              HTTP response
http.response.status_code        404              status code
                                 500
                                 1284             Size of the
http.response.body_size          2202009          response payload
                                                  body in bytes
                                                  Value of a specific
                                                  response header,
                                                  "content-type" in
http.request.header.content-type text/html        this case, but
                                                  there are many
                                                  more. Pick out any
                                                  that are important
                                                  for your service

SELECT
  HEATMAP(http.response.body_size),
WHERE
  main = true AND
  service.name = "api-service"

A heatmap of response sizes. Most are within a fixed band, but there
are sharp outliers that warrant more investigation.

User-Agent headers contain a wealth of info. Don't rely on regex
queries to try and make sense of them down the road. Parse them into
structured data from the beginning.

        Attribute                   Examples            Description
                           Mozilla/5.0 (Windows NT     The value of
                           10.0; Win64; x64)           the HTTP
user_agent.original        AppleWebKit/537.36 (KHTML,  User-Agent
                           like Gecko) Chrome/         header
                           129.0.0.0 Safari/537.3
                           computer                    Device type
user_agent.device          tablet                      derived from
                           phone                       the User-Agent
                                                       header
                                                       OS derived
user_agent.OS              Windows                     from the
                           MacOS                       User-Agent
                                                       header
                           Chrome                      Browser
user_agent.browser         Safari                      derived from
                           Firefox                     the User-Agent
                                                       header
                                                       Browser
                           129                         version
user_agent.browser_version 18.0                        derived from
                                                       the User-Agent
                                                       header

    What browsers are my users using?

SELECT
  COUNT(*)
GROUP BY user_agent.browser, user_agent.browser_version

If you have any custom user agents or headers used as a convention
within your org parse that out too.

        Attribute                          Examples                  Description
                                                                    If you have a
                                                                    distributed
                                                                    architecture,
                                                                    have each
user_agent.service         api-gateway                              service send
                           auth-service                             a custom
                                                                    User-Agent
                                                                    header with
                                                                    its name and
                                                                    version
                                                                    If you have a
                                                                    distributed
                                                                    architecture,
                                                                    have each
user_agent.service_version v123                                     service send
                           6f6466b0e693470729b669f3745358df29f97e8d a custom
                                                                    User-Agent
                                                                    header with
                                                                    its name and
                                                                    version
                                                                    If a request
                                                                    is coming
                                                                    from a mobile
user_agent.app             iOS                                      app, make
                           android                                  sure it
                                                                    includes
                                                                    which app and
                                                                    its version
                                                                    If a request
                                                                    is coming
                                                                    from a mobile
user_agent.app_version     v123                                     app, make
                           6f6466b0e693470729b669f3745358df29f97e8d sure it
                                                                    includes
                                                                    which app and
                                                                    its version

Route info

We're not done with HTTP attributes yet! One of the most important
bits is the API endpoint that the request matched. OpenTelemetry SDKs
will usually give this to you automagically but not always. Consider
extracting the route parameters and query parameters as additional
attributes.

        Attribute            Examples            Description
                          /team/         The route pattern that the
http.route                {team_id}/user url path is matched against
                          /{user_id}
                          14739          The extracted segment of the
http.route.param.team_id  team-name-slug url path as it is parsed for
                                         each parameter
                                         The query parameters that
http.route.query.sort_dir asc            are relevant to the response
                                         of your service. Ex: ?
                                         sort_dir=asc&...

SELECT
  P99(duration_ms)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY http.route

A chart of p99&#x27;s broken down by route. There is a spike on only
some of them. We should break down by version now to check if this
was caused by a deploy.

User and customer info

Once you get the basics down, this is the most important piece of
metadata that you can add. No automagic SDK will be able to encode
the particulars of your user model.

It's common for a single user or account to be responsible for a 10%+
of a business' revenue, and frequently their usage patterns look
significantly different than the average user. They probably have
more users, store more data, and hit limits and edge-cases that will
never show up for the user paying $10 / month. Be sure you can
separate their traffic from others.

   Attribute         Examples                 Description
                                   The primary ID for a user. If this
                 2147483647        is an email and you're using a
user.id          [email protected] vendor, consider your org's policy
                                   on putting PII in external
                                   services.
                                   How does the business see this
                 free              type of user? Individual accounts
user.type        premium           are sometimes responsible for 10%+
                 enterprise        of a business' income. Make sure
                 vip               you can separate their traffic
                                   from others!
                 token
user.auth_method basic-auth        How did this user authenticate
                 jwt               into your system?
                 sso-github
                 5387              If you have a team construct,
user.team.id     web-services      which one does this user belong
                                   to?
                 278               If this user is part of an
user.org.id      enterprise-name   organization with an enterprise
                                   contract, track that!
                                   Not the user's literal age, but
                                   how long ago was this account
user.age_days    0                 created? Is this an issue
                 637               experienced by someone new to your
                                   app, or only once they've saved a
                                   lot of data?
                                   Have an internal way of assuming a
user.assumed     true              user's identity for debugging? Be
                                   sure to track this
user.assumed_by  [email protected] And track which actual user is
                                   assuming the user's identity

SELECT
  P99(duration_ms)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY user.type

Rate limits

Whatever your rate limiting strategy, make sure the current rate
limit info gets added too. Can you quickly find examples of users
that are being rate-limited by your service?

     Attribute            Examples               Description
                                         You might not now, but you
                                         will likely have users with
ratelimit.limit     200000               different rate limits in the
                                         future, note down what the
                                         actual limit is for this
                                         request
ratelimit.remaining 130000               What is the budget remaining
                                         for this user?
                                         How many requests have been
ratelimit.used      70000                used in the current rate
                                         window
ratelimit.reset_at  2024-10-14T19:47:38Z When will the rate limit be
                                         reset next? if applicable

    This user has a support ticket open about being rate-limited.
    Let's see what they were doing

SELECT
  COUNT(*)
WHERE
  main = true AND
  service.name = "api-service" AND
  user.id = 5838
GROUP BY http.route

A graph of one users activity. There is a big spike hitting the same
route a lot at the end. This gives us a starting point for
investigation

    What routes are users who have burned most of their rate limit
    hitting? Does this activity look suspicious?

SELECT
  COUNT(*)
WHERE
  main = true AND
  service.name = "api-service" AND
  ratelimit.remaining < 100
GROUP BY http.route

Caching

For every code path where we could shortcut with a cache response,
add whether or not it was successful

     Attribute      Examples               Description
cache.session_info  true     Was the session info cached or did it
                    false    need to be re-fetched?
cache.feature_flags true     Were the feature flags cached for this
                    false    user or did they need to be re-fetched?

Localization info

What localization options has the user chosen? This can be a frequent
source of bugs

        Attribute         Examples            Description
localization.language_dir rtl, ltr Which direction is text laid out
                                   in their language?
localization.country      mexico,  Which country are they from?
                          uk
localization.currency     USD, CAD Which currency have they chosen to
                                   work with?

Uptime

Tracking how long the service has been running when it serves a
request can help you visualize several classes of bugs:

  * Issues that show up on a reboot
  * Memory leaks that only start to show up when the service has been
    running for a long time
  * Frequent crashes / restarts if you have automatically restart the
    service on failure

I recommend also either adding the log10 of the uptime or having some
way of visualizing this. When graphed this emphasizes the important
first few minutes of a service without being squished into the bottom
of the graph by instances with several days or more of uptime.

    Attribute     Examples                Description
                           How long has this instance of your app
uptime_sec        1533     been running? Can be useful to visualize
                           to see restarts
                           Grows sub-linearly which allows you to
uptime_sec_log_10 3.185    visualize long-running services and brand
                           new ones on the same graph

SELECT
  HEATMAP(uptime_sec),
  HEATMAP(uptime_sec_log_10)
WHERE
  main = true AND
  service.name = "api-service"

Heatmaps of uptime when a service enters a crash loop. It&#x27;s far
easier to distinguish in log scale

Metrics

This one might be a bit controversial, but I've found it helpful to
tag spans with context about what the system was experiencing while
processing the request. We fetch this information every ~10 seconds,
cache it, and add it to every main span produced during that time.

Capturing metrics in this way is not mathematically sound. Since you
only get data when traffic is flowing, you can't calculate a P90 for
cpu load that would stand up to any rigorous scrutiny, but that's
actually fine in practice. It's close enough to get some quick signal
while you're debugging without switching to a different tool,
especially if you can avoid calculations and visualize with a
heatmap.

I wouldn't recommend setting alerts on this data though. Plain ol'
metrics are great for that.

Jessica Kerr recently wrote about this approach on the Honeycomb Blog
.

          Attribute           Examples          Description
                              153      How much memory is being used
metrics.memory_mb             2593     by the system at the time its
                                       service this request
                              0.57     CPU load of the system service
metrics.cpu_load              5.89     this request. Given as # of
                                       active cores
                                       Last observed number of
                                       garbage collections. Could be
metrics.gc_count              5390     cumulative (total since
                                       service started) or delta (ex:
                                       number in the last minute)
                                       Time spent in garbage
metrics.gc_pause_time_ms      14       collections. Could also be
                              325      cumulative or delta. Pick one
                                       and document which
metrics.go_routines_count     3        Number of go routines running
                              3000
                                       Cumulative time spent waiting
metrics.event_loop_latency_ms 0        on the next event loop tick.
                              340      An important metric for Node
                                       apps

    Are these requests getting slow because we're running out of
    memory or CPU?

SELECT
  HEATMAP(duration_ms),
  HEATMAP(metrics.memory_mb),
  HEATMAP(metrics.cpu_load)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY instance.id

An example showing using the metrics data tagged on the span to get
context for whats happening with the system

Async request summaries

When using a tracing system async requests should get their own
spans, but it can still be useful to roll up some stats to identify
outliers and quickly find interesting traces.

           Attribute             Examples         Description
                                 1        How many http requests were
stats.http_requests_count        140      triggered during the
                                          processing of this request?
stats.http_requests_duration_ms  849      Cumulative time spent in
                                          these http requests
                                 7        How many Postgres queries
stats.postgres_query_count       742      were triggered during the
                                          processing of this request?
stats.postgres_query_duration_ms 1254     Cumulative time spent in
                                          these Postgres queries
                                 3        How many redis queries were
stats.redis_query_count          240      triggered during the
                                          processing of this request?
stats.redis_query_duration_ms    43       Cumulative time spent in
                                          these redis queries
                                          How many calls to this
stats.twilio_calls_count         1        vendors api were triggered
                                 4        during the processing of
                                          this request?
stats.twilio_calls_duration_ms   2153     Cumulative time spent in
                                          these vendor calls

    Surely my service makes a reasonable number of calls to the
    database... right?

SELECT
  HEATMAP(stats.postgres_query_count)
WHERE
  main = true AND
  service.name = "api-service"

A heatmap of db queries per request. There is a bi-modal distribution
but also some outliers that make a lot of requests

Instead of adding this explicitly, couldn't we aggregate this by
querying the whole trace? See Frequent Objections

Sampling

Once you start collecting fine-grained telemetry from your systems at
a significant scale you run head-on into the problem of sampling.
Running systems can produce a lot of data! Engineers frequently want
to store and query all of it. Exact answers always! Make it fast!
Also cheap! But it's trade-offs all the way down. Telemetry data is
fundamentally different from the transaction data you're storing for
your users, and you should think about it differently.

Luckily you only really need a statistically significant subset of
the full dataset. Even sampling 1 out of every 1000 requests can
provide a suprisingly detailed picture of the overall traffic
patterns in a system.

Sampling is a suprisingly deep topic. Keep it simple if you're
starting and do uniform random head sampling, but track your sample
rate per-span so you can be ready for more sophisticated approaches
down-the-line.

Good tooling will weight your calculations with a per-span, so you
don't have to mentally multiple the COUNT call by the sample_rate to
get an accurate answer. Here are some relevant articles:

  * I was first introduced to this idea in the Scuba paper
  * Honeycomb supports per-event sample rates
  * Cloudflare's Analytics Engine will automatically sample for you
    based on volume

 Attribute  Examples                   Description
            1        N where 1 in N events will be sampled and stored
sample_rate 500      and the rest dropped. If you're sampling 1% of
                     requests, the sample_rate would be 100

Timings

I find it super useful to break up the work that gets done to respond
to a request into a handful of important chunks and track how long
each segment took on the main span.

    Wait, isn't that what child spans are for?

Wrapping absolutely everything in its own span is the most common
failure mode I see when engineers first get access to tracing tools.
You have to design the structure of your data for the way you want to
query it.

Child spans are helpful for waterfall visualization for a single
request, but can be difficult to query and visualize across all of
your requests. Putting that information on a single span makes it
easier to query and also helps with tools like Honeycomb's BubbleUp
which can then immediately tell you that that group of requests was
slow because authentication took 10 seconds for some reason.

        Attribute         Examples            Description
                          52.2     How long did we spend performing
auth.duration_ms          0.2      authentication during this
                                   request?
payload_parse.duration_ms 22.1     Identify the core workloads of the
                          0.1      service and add timings for them

Errors

If you encounter an error and need to fail the operation, tag the
span with the error information: type, stacktrace, etc.

One approach that I have found super-valuable is tagging each
location where we throw an error with a unique slug describing the
error. If this string is unique within your codebase, it is easily
found with a quick search. This allows someone to jump straight from
a spike in errors on a dashboard to the exact line of code that
throwing the error. It also provides a convenient low-cardinality
field to GROUP BY.

You're unlikely to be able to wrap all possible errors, but any time
a failed request doesn't have an exception.slug that is a good sign
that you have places in your code where your error handling could be
improved. It's now really easy to find examples of requests that
failed in ways you didn't anticipate.

if isNotRecoverable(err) {
  // note the use of a plain string, not a variable, not dynamically generated
  // consider enforcing this with custom lint rules
  setErrorAttributes(err, "err-stripe-call-failed-exhausted-retries");
  throw err;
}

     Attribute               Examples               Description
                     true                      Special field for
error                false                     whether the request
                                               failed or not
                     Can't convert 'int'       The exception message
exception.message    object to str             encoded in the
                     undefined is not a        exception
                     function
exception.type       IOError                   The programmatic type
                     java.net.ConnectException of the exception
                     ReferenceError: user is   Capture the stack
                     not defined               trace if its available
exception.stacktrace at myFunction (/path/to/  to help pin-point
                     file.js:12:2)             where the error is
                     ...                       being thrown
                                               Is this an expected
                                               exception like a bot
                                               trying to hit a url
exception.expected   true, false               that doesn't exist?
                                               Allows filtering out
                                               of exceptions we can't
                                               prevent but don't need
                                               to worry about
                                               Create a unique
                                               grepp-able slug-value
                     auth-error                to identify the code
exception.slug       invalid-route             location of an error
                     github-api-unavailable    if its predictable
                                               during development
                                               time

    Which of our enterprise users hit the most errors last week? And
    which one?

SELECT
  COUNT_DISTINCT(user.id)
WHERE
  main = true AND
  service.name = "api-service" AND
  user.type = "enterprise"
GROUP BY exception.slug

    Show me traces where we likely need to improve our error handling

SELECT
  trace.trace_id
WHERE
  main = true AND
  service.name = "api-service" AND
  error = true AND
  exception.slug = NULL
GROUP BY trace.trace_id

Feature flags

Fine-grained feature flags are a developer super power that allows
you to test code changes in production with only a fraction of your
users or traffic. Adding the flag information per-request allows you
to compare how the new code is working as you opt more of your
traffic into the new code path. Coupled with the broad visibility you
can get with wide events, and this can make even tricky migrations
vastly more manageable and allow you to ship code with confidence.

Note that semantic conventions differ here and suggest adding feature
flag information as events on the span. I would suggest following
that standard since it will ultimately have the best support from
vendors if it's moved to stable, but especially in the mean time, I'm
also putting this info on the main span.

             Attribute              Examples       Description
                                    true     The value of a
feature_flag.auth_v2                false    particular feature flag
                                             for this request
                                    true     The value of a
feature_flag.double_write_to_new_db false    particular feature flag
                                             for this request

    What errors are the users in the new authentication flows
    hitting? How does it compare to the control group?

SELECT
  COUNT
WHERE
  main = true AND
  service.name = "api-service" AND
GROUP BY feature_flag.auth_v2, exception.slug

Versions of important things

Runtimes, frameworks, and any major libraries you are using can be
really helpful context.

   Attribute     Examples                 Description
go.version       go1.23.2 What version of your language runtime are
                          you using?
rails.version    7.2.1.1  Pick out any core libraries like web
                          frameworks and track their version too
postgres.version 16.4     If you can add the versions of any
                          datastores you're using, even better

    A security issue with Rails just got announced. What versions of
    the framework are our services using?

SELECT
  COUNT_DISTINCT(service.name)
WHERE
  service.environment = "production"
GROUP BY rails.version

    Our memory usage seems higher that it used to be. Didn't we
    upgrade the runtime recently? Does that correlate?

SELECT
  HEATMAP(metrics.memory_mb)
WHERE
  main = true AND
  service.name = "api-service"
GROUP BY go.version

Your specific application

Now we go off the map and get to the really valuable stuff. Your app
likely does something unique or works in a particular domain. You
might need to really care about which professional credentials a
Dentist using your app has, or which particular storage warehouse a
package is in, or which chip is in the embedded tracking device
installed in the cat that your app exists to track.

No framework is going to be able to understand what parts of your
domain are important to track and automate this for you, you have to
do that.

           Attribute                          Examples               Description
                                                                     If you
                                                                     upload
asset_upload.s3_bucket_path     s3://bucket-name/path/to/asset.jpg   something,
                                                                     add context
                                                                     about where
                                                                     If you
                                                                     interact
                                                                     with a
                                                                     vendor,
                                                                     track
email_vendor.transaction_id     62449c60-b51e-4d5c-8464-49217d91c441 whatever
                                                                     transaction
                                                                     ID they give
                                                                     you in case
                                                                     you need to
                                                                     follow up
                                                                     with them
                                                                     If there are
                                                                     3-4 types
                                                                     that
                                                                     something
                                                                     might fall
                                                                     into, be
                                                                     sure to add
                                                                     that
                                                                     context. Ex:
                                github                               If 2% of
vcs_integration.vendor          gitlab                               requests
                                bitbucket                            start
                                                                     failing
                                                                     because
                                                                     bitbucket is
                                                                     experiencing
                                                                     issues, this
                                                                     will help
                                                                     identify the
                                                                     source of
                                                                     the issue
                                                                     immediately.
                                                                     Any time you
                                                                     interact
                                                                     with a
                                                                     queue, see
process_submission.queue_length 153                                  if you can
                                1                                    get the
                                                                     current
                                                                     length
                                                                     during
                                                                     submission

Things to note

You should probably add the thing

If you find yourself asking "Am I ever really going to need this bit
of data?", default to throwing the attribute on. The marginal cost of
each extra attribute is very small. If the data volume does start to
grow, prefer wider, more context-rich events and a higher sample rate
vs smaller events with a lower sample rate.

Heatmaps are your friend

Honeycomb's heatmaps are amazing at helping you find outliers, seeing
multi-modal distributions, and getting a feel for your data. I wish
more tooling supported them. I am not sure I can build software
without them any more.

Embrace the feedback loop

When you are modifying code, make a change to the telemetry so that
you can see the impact of the new code running. Once the code is
released, check to make sure that you see the outcome you expected.
Don't hesitate to add specific fields for one release and them remove
them after.

Tighter feedback loops are like going faster on a bicycle. They make
for more stable systems and let you move faster with confidence.

Semantic conventions and naming consistency

I've tried to embrace semantic conventions in my naming, but would
not be surprised if I've made multiple errors. Naming is hard!

It's also hard to get consistency right within an organization or
even across multiple systems owned by the same team. I would
recommend trying to use semantic conventions as a guide, but do
prioritize getting data out of your system in some form and getting
some early wins over exacting adherence to an evolving specification.
Once this data has proven its value within your organization, then
you will have the leverage to spend engineering cycles on making
things consistent.

In the long run semantic conventions should allow Observability
vendors to build new value and understanding on top of the telemetry
you emit, but this effort is only just getting started.

Frequent Objections

Does this really work??

I have done this for dozens of production systems. Every single time
the data has been invaluable for digging in and understanding what
the system is actually doing, and we've found something surprising,
even for the engineers who had worked on the system for many years.

Things like:

  * Oh, actually 90% of the traffic of this system comes from one
    user
  * Wait, one of our worker processes is actually running a month old
    version of the code somehow?
  * This API endpoint usually has payloads of 1-2kb, but there is an
    edge case affecting one user where it's 40+MB. This causes their
    page loads to be several minutes longer than the p99.
  * After instrumenting the authentication middleware, around 20% of
    requests still didn't have user info. There was a whole second
    authentication system for a different class of users that hadn't
    been touched in years.
  * This endpoint that we'd like to deprecate accepts data in the
    form of A, B, and C, but none of our traffic ever even uses C. We
    can just drop support for that now.

I don't like it. This feels wrong

For anyone feeling that way now, I ask you to give it five minutes.

I find that when a log line wraps around your terminal window
multiple times, most developers have a pretty visceral negative
reaction.

This feels right:

[2024-09-18 22:48:32.990] Request started http_path=/v1/charges request_id=req_123
[2024-09-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123
[2024-09-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99
[2024-09-18 22:48:32.998] Charge created charge_id=ch_123 permissions_used=account_write request_id=req_123
[2024-09-18 22:48:32.999] Request finished http_status=200 request_id=req_123

But this feels wrong:

[2024-10-20T14:43:36.851Z] duration_ms=1266.1819686777117 main=true http.ip_address=92.21.101.252 instance.id=api-1 instance.memory_mb=12336
instance.cpu_count=4 instance.type=t3.small http.request.method=GET http.request.path=/api/categories/substantia-trado
http.route=/api/categories/:slug http.request.body.size=293364 http.request.header.content_type=application/xml
user_agent.original="Mozilla/5.0 (X11; Linux i686 AppleWebKit/535.1.2 (KHTML, like Gecko) Chrome/39.0.826.0 Safari/535.1.2" user_agent.device=phone
user_agent.os=Windows user_agent.browser=Edge user_agent.browser_version=3.0 url.scheme=https url.host=api-service.com service.name=api-service
service.version=1.0.0 build.id=1234567890 go.version=go1.23.2 rails.version=7.2.1.1 service.environment=production service.team=api-team
service.slack_channel=#api-alerts service.build.deployment.at=2024-10-14T19:47:38Z
service.build.diff_url=https://github.com/your-company/api-service/compare/c9d9380..05e5736
service.build.pull_request_url=https://github.com/your-company/api-service/pull/123
service.build.git_hash=05e5736 service.build.deployment.user=keanu.reeves@your-company.com
service.build.deployment.trigger=manual container.id=1234567890 container.name=api-service-1234567890 cloud.availability_zone=us-east-1
cloud.region=us-east-1 k8s.pod.name=api-service-1234567890 k8s.cluster.name=api-service-cluster feature_flag.auth_v2=true
http.response.status_code=401 user.id=Samanta27@gmail.com user.type=vip user.auth_method=sso-google user.team_id=team-1

You are structuring data so that it can be read efficiently by
machines, not humans. Our systems emit too much data to waste
precious human lifetimes using our eyeballs to scan lines of text
looking for patterns to jump out. Let the robots help.

This seems like a lot of work

If you want to implement everything I've talked about in this post
that would be a ton of work. However, even implementing the easiest
subset is going to provide a lot of value. Not doing this results in
so much more work building a mental model of your system, trying to
debug by thinking through the code and hoping your mental model
matches reality.

A lot of this logic can be put into shared libraries within your org,
though getting them adopted, keeping them updated and in-sync, and
getting engineers used to these tools presents a whole different set
of challenges.

Many of these things could be surfaced to you by opinionated
platforms or frameworks. I would love to see things move in this
direction.

Isn't this a lot of data? Won't it cost a lot??

Hacker News comment: This isn&#x27;t an unknown idea outside of Meta,
it&#x27;s just really expensive, especially if you&#x27;re using a
vendor and not building your own tooling. Prohibitively so, even with
sampling.

First, you should compare this to your current log volume per
request. I have seen many systems where this approach would reduce
overall log volume.

However storing this data for every request against your system could
be too expensive at scale. That's where sampling comes in. Sampling
gives you the controls to determine what you want to spend vs the
value you receive from storing and making that data available to
query.

Realtime OLAP systems are also getting cheaper all the time. Once
upon a time Scuba held all data in memory to make these types of
questions quick to answer. Now most OLAP systems are evolving to
columnar files stored on cloud object storage with queries handled by
ephemeral compute which is many orders of magnitude cheaper.

In the next section I'll show just how much cheaper.

Repeated data

    Many of these fields will be the same for every request. Isn't
    that really inefficient?

This is where our intuitions can lie to us. Let's look at a concrete
example.

I wrote a script [1] to generate a newline-delimited JSON file with a
lot of the above fields and at least somewhat reasonable fake values.

Let's say our service is serving 1000 req/s all day and sampling 1%
of that traffic. Rounding to a whole number, that's about a million
events. Generating a million example wide events results in a 1.6GB
file.

http_logs.ndjson     1607.61 MB

But we repeat the keys on every single line. Even just turning it
into a CSV cuts the size by more than 50%.

http_logs.csv         674.72 MB

Gzipping the file shows an amazing amount of compression, hinting
that this isn't as much data as we might think.

http_logs.ndjson.gz   101.67 MB

Column store formats like parquet and Duckdb's native format can do
even better.

http_logs.parquet      88.83 MB
http_logs.duckdb       80.01 MB

They store all of the data for a specific column contiguously, which
lends itself to different compression approachs. In the simplest
case, if the column is always the same value, it can store that fact
only once. Values that are the same across an entire row group are
incredibly cheap.

DuckDB diagram showing how a constant value along a whole column gets
compressed

If there are 2-3 different values, it can use dictionary-encoding to
bit-pack these values really tightly. This also speeds up queries
against this column.

DuckDB diagram showing how values get compressed using dictionary
encoding

DuckDB has a great writeup on this which goes into much more detail.
All of the data remains available and is easily (and quickly!)
queryable.

This is hardly "big data". Storing this on Cloudflare's R2 for a
month would cost US$ 0.0012. You could keep 60 days of retention for
US$ 0.072 / month.

 duckdb http_logs.duckdb
D SELECT COUNT(*) FROM http_logs;
+--------------+
| count_star() |
|    int64     |
+--------------+
|      1000000 |
+--------------+
Run Time (s): real 0.002 user 0.002350 sys 0.000946
D SELECT SUM(duration_ms) FROM http_logs;
+-------------------+
| sum(duration_ms)  |
|      double       |
+-------------------+
| 999938387.7714149 |
+-------------------+
Run Time (s): real 0.003 user 0.008020 sys 0.000415

There are even in-memory and transport formats to help reduce size in
memory and on the wire. OpenTelemetry is adopting arrow for its
payloads for this reason.

I found this podcast on the FDAP stack particularly helpful in
understanding this space.

Couldn't we JOIN data from multiple spans together to get this
information? Query the whole trace at once?

This is certainly possible. Honeycomb has started allowing you to
filter on fields on other spans in the same trace. However I'd
qualify this as very advanced. You want to make the right thing the
easiest thing, and if you make it harder to ask questions, people
will simply ask fewer questions. There are already a million things
competing for our attention. Keep it simple. Make it fast.

Does this mean I don't need metrics?

You should probably still generate high-level metrics, though I bet
you will need far fewer.

Metrics are great when you know you want an exact answer to a very
specific question that you know ahead of time. Questions like "How
many requests did a serve yesterday?" or "What was my CPU usage like
last month?"

---------------------------------------------------------------------

[1] Well... mostly Cursor wrote it

(c) 2024 | Jeremy Morrell