[HN Gopher] Migrating to OpenTelemetry
___________________________________________________________________
Migrating to OpenTelemetry
Author : kkoppenhaver
Score : 127 points
Date : 2023-11-16 17:29 UTC (5 hours ago)
(HTM) web link (www.airplane.dev)
(TXT) w3m dump (www.airplane.dev)
| caust1c wrote:
| Curious about the code implemented for logs! Hopefully that's
| something that can be shared at some point. Also curious if it
| integrates with `log/slog` :-)
|
| Congrats too! As I understand it from stories I've heard from
| others, migrating to OTel is no easy undertaking.
| bhyolken wrote:
| Thanks! For logs, we actually use github.com/segmentio/events
| and just implemented a handler for that library that batches
| logs and periodically flushes them out to our collector using
| the underlying protocol buffer interface. We plan on migrating
| to log/slog soon, and once we do that we'll adapt our handler
| and can share the code.
| caust1c wrote:
| Awesome! Great work and thanks for sharing your experience!
| MajimasEyepatch wrote:
| It's interesting that you're using both Honeycomb and Datadog.
| With everything migrated to OTel, would there be advantages to
| consolidating on just Honeycomb (or Datadog)? Have you found
| they're useful for different things, or is there enough overlap
| that you could use just one or the other?
| bhyolken wrote:
| Author here, thanks for the question! The current split
| developed from the personal preferences of the engineers who
| initially set up our observability systems, based on what they
| had used (and liked) at previous jobs.
|
| We're definitely open to doing more consolidation in the
| future, especially if we can save money by doing that, but from
| a usability standpoint we've been pretty happy with Honeycomb
| for traces and Datadog for everything else so far. And, that
| seems to be aligned with what each vendor is best at at the
| moment.
| MuffinFlavored wrote:
| > from the personal preferences of the engineers
|
| https://www.honeycomb.io/pricing
|
| https://www.datadoghq.com/pricing/
|
| Am I wrong to say... having 2 is "expensive"? Maybe not if
| 50% of your stuff is going to Honeycomb and 50% going to
| DataDog. Could you save money/complexity (less places to look
| for things) having just DataDog or just Honeycomb?
| bhyolken wrote:
| Right now, there isn't much duplication of what we're
| sending to each vendor, so I don't think we'd save a ton by
| consolidating, at least based on list prices. We could
| maybe negotiate better prices based on higher volumes, but
| I'm not sure if Airplane is spending enough at this point
| to get massive discounts there.
|
| Another potential benefit would definitely be reduced
| complexity and better integration for the engineering team.
| So, for instance, you could look at a log and then more
| easily navigate to the UI for the associated trace.
| Currently, we do this by putting Honeycomb URLs in our
| Datadog log events, which works but isn't quite as
| seamless. But, given that our team is pretty small at this
| point and that we're not spending a ton of our time on
| performance optimizations, we don't feel an urgent need to
| consolidate (yet).
| MuffinFlavored wrote:
| When you say DataDog for everything else (as in not
| traces), besides logs, what else do you mean?
| claytonjy wrote:
| Metrics, probably? The article calls out logs, metrics,
| and traces as the 3 pillars of observability.
| bhyolken wrote:
| Yeah, metrics and logs, plus a few other things that
| depend on these (alerts, SLOs, metric-based dashboards,
| etc.).
| tapoxi wrote:
| I made this switch very recently. For our Java apps it was as
| simple as loading the otel agent in place of the Datadog SDK,
| basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in
| our args.
|
| The collector (which processes and ships metrics) can be
| installed in K8S through Helm or an operator, and we just added a
| variable to our charts so the agent can be pointed at the
| collector. The collector speaks OTLP which is the fancy combined
| metrics/traces/logs protocol the OTEL SDKs/agents use, but it
| also speaks Prometheus, Zipkin, etc to give you an easy migration
| path. We currently ship to Datadog as well as an internal
| service, with the end goal being migrating off of Datadog
| gradually.
| andrewstuart2 wrote:
| We tried this about a year and a half ago and ended up going
| somewhat backwards into DD entrenchment, because they've
| decided that anything not an official DD metric (that is,
| collected by their agent typically) is custom and then becomes
| substantially more expensive. We wanted a nice migration path
| from any vendor to any other vendor but they have a fairly
| effective strategy for making gradual migrations more expensive
| for heavy telemetry users. At least our instrumentation these
| days is otel, but it's the metrics we expected to just scrape
| from prometheus that we had to dial back and start using more
| official DD agent metrics and configs to get, lest our bill
| balloon by 10x. It's a frustrating place to be. Especially
| since it's still not remotely cheap, just that it could be way
| worse.
|
| I know this isn't a DataDog post, and I'm a bit off topic, but
| I try to do my best to warn against DD these days.
| shawnb576 wrote:
| This has been a concern for me too. But the agent is just a
| statsd receiver with some extra magic, so this seems like a
| thing that could be solved with the collector sending traffic
| to an agent rather than the HTTP APIs?
|
| I looked at the OTel DD stuff and did not see any support for
| this, fwiw, maybe it doesn't work b/c the agent expects more
| context from the pod (e.g. app and label?)
| andrewstuart2 wrote:
| Yeah, the DD agent and the otel-collector DD exporter
| actually use the same code paths for the most part. The
| relevant difference tends to be in metrics, where the
| official path involves the DD agent doing collection
| directly, for example, collecting redis metrics by giving
| the agent your redis database hostname and creds. It can
| then pack those into the specific shape that DD knows about
| and they get sent with the right name, values, etc so that
| DD calls them regular metrics.
|
| If you instead went the more flexible route of using many
| of the de-facto standard prometheus exporters like the one
| for redis, or built-in prometheus metrics from something
| like istio, and forward those to your agent or configure
| your agent to poll those prometheus metrics, it won't do
| any reshaping (which I can see the arguments for, kinda,
| knowing a bit about their backend) and they just end up in
| the DD backend as custom metrics, and charge you at
| $0.10/mo per 100 time series. If you've used prometheus
| before for any realistic deployments with enrichment etc,
| you can probably see this gets expensive ridiculously fast.
|
| What I wish they'd do instead is have some form of adapter
| from those de facto standards, so I can still collect
| metrics 99% my own way, in a portable fashion, and then add
| DD as my backend without ending up as custom everything,
| costing significantly more.
| xyst wrote:
| > somewhat backwards into DD entrenchment, because they've
| decided that anything not an official DD metric (that is,
| collected by their agent typically) is custom and then
| becomes substantially more expensive.
|
| It a vendor pulled shit like this on me. That's when I would
| counsel them. Of course most big orgs would rather not do the
| leg work to actually become portable, migrate off vendor. So
| of course they will just pay the bill.
|
| Vendors love the custom shit they build because they know
| once it's infiltrated the stack then it's basically like
| gangrene (have to cut off the appendage to save the host)
| k__ wrote:
| I had the impression, logs and metrics are a pre-observability
| thing.
| SteveNuts wrote:
| I've never heard the term "pre-observability", what does that
| mean?
| renegade-otter wrote:
| The era when "debugging in production" wasn't standard.
| marcosdumay wrote:
| Observability is about logs and metrics, and pre-observability
| (I guess you mean the high-level-only records simpler
| environments keep) is also about logs and metrics.
|
| Anything you register to keep track of your environment has the
| form of either logs or metrics. The difference is about the
| contents of such logs and metrics.
| tsamba wrote:
| Interesting read. What did you find easier about using GCP's log
| tooling for your internal system logs, rather than the OTel
| collector?
| roskilli wrote:
| > Moreover, we encountered some rough edges in the metrics-
| related functionality of the Go SDK referenced above. Ultimately,
| we had to write a conversion layer on top of the OTel metrics API
| that allowed for simple, Prometheus-like counters, gauges, and
| histograms.
|
| Have encountered this a lot from teams attempting to use the
| metrics SDK.
|
| Are you open to comment on specifics here and also what kind of
| shim you had to put in front of the SDK? It would be great to
| continue to retrieve feedback so that we can as a community have
| a good idea of what remains before it's possible to use the SDK
| for real world production use cases in anger. Just wiring up the
| setup in your app used to be fairly painful but that has gotten
| somewhat better over the last 12-24 months, I'd love to also hear
| what is currently causing compatibility issues w/ the metric
| types themselves using the SDK which requires a shim and what the
| shim is doing to achieve compatibility.
| bhyolken wrote:
| Sure, happy to provide more specifics!
|
| Our main issue was the lack of a synchronous gauge. The
| officially supported asynchronous API of registering a callback
| function to report a gauge metric is very different from how we
| were doing things before, and would have required lots of
| refactoring of our code. Instead, we wrote a wrapper that
| exposes a synchronous-like API: https://gist.github.com/yolken-
| airplane/027867b753840f7d15d6....
|
| It seems like this is a common feature request across many of
| the SDKs, and it's in the process of being fixed in some of
| them (https://github.com/open-telemetry/opentelemetry-
| specificatio...)? I'm not sure what the plans are for the
| golang SDK specifically.
|
| Another, more minor issue, is the lack of support for
| "constant" attributes that are applied to all observations of a
| metric. We use these to identify the app, among other use
| cases, so we added wrappers around the various "Add", "Record",
| "Observe", etc. calls that automatically add these. (It's
| totally possible that this is supported and I missed it, in
| which case please let me know.)
|
| Overall, the SDK was generally well-written and well-
| documented, we just needed some extra work to make the
| interfaces more similar to the ones we were using before.
| arccy wrote:
| the official SDKs will only support an api once there's a
| spec that allows it.
|
| for const attributes, generally these should be defined at
| the resource / provider level: https://pkg.go.dev/go.opentele
| metry.io/otel/sdk/metric#WithR...
| CSMastermind wrote:
| > The data collected from these streams is sent to several
| vendors including Datadog (for application logs and metrics),
| Honeycomb (for traces), and Google Cloud Logging (for
| infrastructure logs).
|
| It sounds like they were in a place that a lot of companies are
| in where they don't have a single pane of glass for
| observability. One of if not the main benefit I've gotten out of
| Datadog is having everything in Datadog so that it's all
| connected and I can easily jump from a trace to logs for
| instance.
|
| One of the terrible mistakes I see companies make with this
| tooling is fragmenting like this. Everyone has their own personal
| preference for tool and ultimately the collective experience is
| significantly worse than the sum of its parts.
| devin wrote:
| Eh, personally I view honeycomb and datadog as different enough
| offerings that I can see why you'd choose to have both.
| dexterdog wrote:
| Depending on your usage it can be prohibitively expensive to
| use datadog for everything like that. We have it for just our
| prod env because it's just not worth what it brings to the
| table to put all of our logs into it.
| maccard wrote:
| I've spent a small amount of time in datadog, lots in grafana,
| and somewhere in between in honeycomb. Out applications are
| designed to emit traces, and comparing honeycomb with tracing
| to a traditional app with metrics and logs, I would choose
| tracing every time.
|
| It annoys me that logs are overlooked in honeycomb, (and
| metrics are... fine). But, given the choice between a single
| pane of glass in grafana or having to do logs (and metrics
| sometimes) in cloudwatch but spending 95% of my time in
| honeycomb - I'd pick honeycomb every time
| mdtusz wrote:
| Agreed - honeycomb has been a boon, however some improvements
| to metric displays and the ability to set the default "board"
| used in the home page would be very welcome. Also would be
| pretty happy if there was a way to drop events on the
| honeycomb side for a way to dynamically filter - e.g. "don't
| even bother storing this trace if it has a http.status_code <
| 400". This is surprisingly painful to implement on the
| application side (at least in rust).
|
| Hopefully someone that works there is reading this.
| masterj wrote:
| It sounds like you should look into their tail-sampling
| Refinery tool https://docs.honeycomb.io/manage-data-
| volume/refinery/
| viraptor wrote:
| Have you tried the traces in grafana/tempo yet?
| https://grafana.com/docs/grafana/latest/panels-
| visualization...
|
| It seems to miss some aggregation stuff, but also it's
| improving every time I check. I wonder if anyone's used it in
| anger yet and how far is it from replacing datadog or
| honeycomb.
| arccy wrote:
| tempo still feels very much: look at a trace that you found
| from elsewhere (like logs)
| rewmie wrote:
| > It sounds like they were in a place that a lot of companies
| are in where they don't have a single pane of glass for
| observability.
|
| One of the biggest features of AWS which is very easy to take
| for granted and go unnoticed is Amazon CloudWatch. It supports
| metrics, logging, alarms, metrics from alarms, alarms from
| alarms, querying historical logs, trigger actions, etc etc etc.
| and it covers each and every single service provided by AWS
| including metaservices like AWS Config and Cloudtrail.
|
| And you barely notice it. It's just there, and you can see
| everything.
|
| > One of the terrible mistakes I see companies make with this
| tooling is fragmenting like this.
|
| So much this. It's not fun at all to have to go through logs
| and metrics on any application,and much less so if for some
| reason their maintainers scattered their metrics emission to
| the four winds. However, with AWS all roads lead to Cloudwatch,
| and everything is so much better.
| badloginagain wrote:
| I feel we hold up single-observability-solution as the Holy
| Grail, and I can see the argument for it- one place to
| understand the health of your services.
|
| But I've also been in terrible vendor lock-in situations, being
| bent over the barrel because switching to a better solution is
| so damn expensive.
|
| At least now with OTel you have an open standard that allows
| you to switch easier, but even then I'd rather have 2 solutions
| that meet my exact observability requirements than a single
| solution that does everything OKish.
| nevon wrote:
| I would love to save a few hundred thousands a year by running
| Otel collector over Datadog agents, just on the cost-per-host
| alone. Unfortunately that would also mean giving up Datatog APM
| and NPM, as far as I can tell, which have been really valuable.
| Going back to just metrics and traces would feel like quite the
| step backwards and be a hard sell.
| arccy wrote:
| you can submit opentelemetry traces to datadog which should be
| the equivalent of apm/npm, though maybe with a less polished
| integration.
| throwaway084t95 wrote:
| What is the "first principles" argument that observability
| decomposes into logs, metrics, and tracing? I see this dogma
| accepted everywhere, but I'm inquisitive about it
| yannyu wrote:
| First you had logs. Everyone uses logs because it's easy. Logs
| are great, but suddenly you're spending a crapton of time or
| money maintaining terabytes or petabytes of storage and ingest
| of logs. And even worse, in some cases for these logs, you
| don't actually care about 99% of the log line and simply want a
| single number, such as CPU utilization or the value of the
| shopping cart or latency.
|
| So, someone says, "let's make something smaller and more
| portable than logs. We need to track numerical data over time
| more easily, so that we can see pretty charts of when these
| values are outside of where they should be." This ends up being
| metrics and a time-series database (TSDB), built to handle not
| arbitrary lines of text but instead meant to parse out metadata
| and append numerical data to existing time-series based on that
| metadata.
|
| Between metrics and logs, you end up with a good idea of what's
| going on with your infrastructure, but logs are still too
| verbose to understand what's happening with your applications
| past a certain point. If you have an application crashing
| repeatedly, or if you've got applications running slowly,
| metrics and logs can't really help you there. So companies
| built out Application Performance Monitoring, meant to tap
| directly into the processes running on the box and spit out all
| sorts of interesting runtime metrics and events about not just
| the applications, but the specific methods and calls those
| applications are utilizing within their stack/code.
|
| Initially, this works great if you're running these APM tools
| on a single box within monolithic stacks, but as the world
| moved toward Cloud Service Providers and
| containerized/ephemeral infrastructure, APM stopped being as
| effective. When a transaction starts to go through multiple
| machines and microservices, APM deployed on those boxes
| individually can't give you the context of how these disparate
| calls relate to a holistic transaction.
|
| So someone says, "hey, what if we include transaction IDs in
| these service calls, so that we can post-hoc stitch together
| these individual transaction lines into a whole transaction,
| end-to-end?" Which is how you end up with the concept of spans
| and traces, taking what worked well with Application
| Performance Monitoring and generalizing that out into the
| modern microservices architectures that are more common today.
| shoelessone wrote:
| I really really want to use OTel for a small project but have
| always had a really tough time finding a path that is cheap or
| free for a personal project.
|
| In theory you can send telemetry data with OTel to Cloud Watch,
| but I've struggle to connect the dots with the front end
| application (e.g. React/Next.js).
| arccy wrote:
| grafana cloud, honeycomb, etc have free tiers, though you'll
| have to watch how much data you send them. or you can self host
| something like signoz or the elastic stack. frontend will
| typically go to an instance of opentelemetry collector to
| filter/convert to the protocol for the storage backend.
___________________________________________________________________
(page generated 2023-11-16 23:00 UTC)