[HN Gopher] I got OpenTelemetry to work. But why was it so compl...
___________________________________________________________________
I got OpenTelemetry to work. But why was it so complicated?
Author : paltaie
Score : 150 points
Date : 2025-01-10 12:38 UTC (10 hours ago)
(HTM) web link (iconsolutions.com)
(TXT) w3m dump (iconsolutions.com)
| cglan wrote:
| I agree. I tried to get it to work recently with datadog, but
| there was so many hiccups. I ended up having to use datadogs
| solution mostly. The documentation across everything is also kind
| of confusing
| SomaticPirate wrote:
| imo Datadog is pretty hostile to OTel too. Ever since
| https://github.com/open-telemetry/opentelemetry-collector-co...
| was nearly killed by them I never felt like they fully
| supported the standard (perhaps for good reasons)
|
| OTel is a bear though. I think the biggest advantage it gives
| you is the ability to move across tracing providers
| bebop wrote:
| I worry that vision is not going to become reality if the
| large observability vendors don't want to support the
| standard.
| phillipcarter wrote:
| FWIW the "datadog doesn't like otel" thing is kind of old
| hat, and the story was a little more complicated at the
| time too.
|
| Nowadays they're contributing more to the project directly
| and have built some support to embed the collector into
| their DD agent. Other vendors (splunk, dynatrace, new
| relic, grafana, honeycomb, sumo logic, etc.) contribute to
| the project a bunch and typically recommend using OTel to
| start instead of some custom stuff from before.
| arccy wrote:
| They support ingesting via otel (ie competing with other
| vendors for their customers) but won't support ingesting
| via their SDKs (they still try very hard to lock you in
| to their tooling).
| rikthevik wrote:
| > the ability to move across tracing providers
|
| It's a nice dream. At Google Cloud Next last year, the
| vendors kinda of came in two buckets. Datadog, and everyone
| trying to replace Datadog's outrageous bills.
| hangonhn wrote:
| Yeah their agent will accept traces from the standard Otel
| SDK but there is no way to change their SDK to send the
| traces to anyone other than Datadog when I last checked a
| couple(?) of years ago.
|
| I mean I understand why they did that but it really removes
| one of the most compelling parts about Otel. We ended doing
| the hard work of using the standard Otel libraries. I had to
| contribute a PR or two to get it all to work with our
| services but am glad that's the route we went because now we
| can switch vendors if needed (which is likely in the not too
| distant future in our case.
| jensensbutton wrote:
| Pretty sure Datadog is literally one of the top contributors
| to OTel.
| ljm wrote:
| The biggest barrier to setting up oTel for me is the
| development experience. Having a single open specification is
| fantastic, especially for portability, but the SDKs are almost
| overwhelmingly abstract and therefore difficult to intuit.
|
| I used to really like Datadog for being a one-stop
| observability shop and even though the experience of
| integrating with it is still quite simple, I think product and
| pricing wise they've jumped the shark.
|
| I'm much happier these days using a collection of small time
| services and self-hosting other things, and the only part of
| that which isn't joyful is the boilerplate and not really
| understanding when and why you should, say, use gRPC over HTTP,
| and stuff like that.
| pranay01 wrote:
| part of the reason for that experience is also because DataDog
| is not open telemetry native and all their docs and
| instructions encourage use of their own agents. Using DataDog
| with Otel is like trying to hold your nose round over your head
|
| You should try Otel native observability platforms like SigNoz,
| Honeycomb, etc. your life will be much simpler
|
| Disclaimer : i am one of the maintainers at SigNoz
| nimish wrote:
| It's complicated because it's designed for the companies selling
| Otel compatible software, not the engineers implementing it
| convolvatron wrote:
| this is going to come off as being fussy, but 'implement' use
| to refer to the former activity, not the latter. which is fine,
| meanings change, its just amusing that we no longer have a word
| we can use for 'sitting down and writing software to match a
| specification' and only 'taking existing software and deploying
| it on servers'
| hinkley wrote:
| Operational versus builder jargon.
| stronglikedan wrote:
| It's still implementing. Someone has taken the specifications
| and implemented the software, and then someone else has taken
| the software and implemented a solution with it.
| skrebbel wrote:
| This has been the case for ages. Sysadmins use "implement" to
| mean "install software on servers and keep it running",
| coders use "implement" to mean "code stuff that matches a
| spec/interface". It's just two worlds accidentally using the
| same term for a different thing. No meanings are changing.
| Two MS certified sysadmins in 1999 could talk about how they
| were "Implementing Exchange across the whole company".
| paulddraper wrote:
| That hasn't been what I've seen from the contributors.
|
| If anything I think the backends were kinda slow to adopt.
| hocuspocus wrote:
| Have you considered Kamon instead? From personal experience it's
| really the best tracing solution for Akka and other libraries
| using Scala Futures. I haven't tried it, but it does have built-
| in Spring support as well.
|
| https://kamon.io
|
| Edit: I wonder why suggesting JVM instrumentation that is much
| more polished than the OTel and Lightbend agents gets me
| downvoted?
| dimitar wrote:
| It is as complicated as you want or need it to be. You can avoid
| any magic and stick to a subset that is easy to reason about and
| brings the most value in your context.
|
| For our team, it is very simple:
|
| * we use a library send traces and traces only[0]. They bring the
| most value for observing applications and can contain all the
| data the other types can contain. Basically hash-maps vs strings
| and floats.
|
| * we use manual instrumentation as opposed to automatic - we are
| deliberate in what we observe and have great understand of what
| emits the spans. We have naming conventions that match our code
| organization.
|
| * we use two different backends - an affordable 3rd party service
| and an all-on-one Jaeger install (just run 1 executable or docker
| container) that doesn't save the spans on disk for local
| development. The second is mostly for piece of mind of team
| members that they are not going to flood the third party service.
|
| [0] We have a previous setup to monitor infrastructure and in our
| case we don't see a lot of value of ingesting all the
| infrastructure logs and metrics. I think it is early days for
| OTEL metrics and logs, but the vendors don't tell you this.
| buzzdenver wrote:
| Mind sharing that that affordable 3rd party service is?
| hinkley wrote:
| I did not find that manual instrumentation made things simpler.
| You're trading a learning curve that now starts way before you
| can demonstrate results for a clearer understanding of the
| performance penalties of using this Rube Goldberg machine.
|
| Otel may be okay for a green field project but turning this
| thing on in a production service that already had telemetry
| felt like replacing a tire on a moving vehicle.
| dmoy wrote:
| I've not used otel for anything not greenfield, but I just
| wanted to say
|
| > felt like replacing a tire on a moving vehicle.
|
| Some people do this as a joke / dare. I mean literally
| replacing a car tire on a moving vehicle.
|
| You Saudi drift up onto one side, and have people climb out
| of the side in the air, and then swap the tire while the car
| is driving on two wheels.
|
| It's pretty insane stuff:
| https://youtu.be/Str7m8xV7W8?si=KkjBh6OvFoD0HGoh
| hinkley wrote:
| That was the image I had in my head.
|
| My whole career I've been watching people on greenfield
| projects looking down on devs on already successful
| products for not using some tool they've discovered,
| missing the fact that their tool only functions if you
| build your whole product around the exact mental model of
| the tool (green field).
|
| Wisdom is learning to watch for people obviously working on
| brownfield projects espousing a tool. Like moving from VMs
| to Docker. Ansible to Kubernetes (maybe not the best
| example). They can have a faster adoption cycle and more
| staying power.
| madeofpalk wrote:
| It's as complicated as you want, but it's not as easy as I
| want. The floor is pretty high.
|
| I'm still looking for an endpoint just to _send_ simple one-off
| metrics to from parts of infrastructure that 's not scrapable.
| pat2man wrote:
| You can just send metrics via JSON to any otlphttp collector:
| https://github.com/open-telemetry/opentelemetry-
| proto/blob/v...
| madeofpalk wrote:
| Shame none of this comes up whenever I search for it!
|
| Top google result, for me, for 'send metrics to otel' is
| https://opentelemetry.io/docs/specs/otel/metrics/. If I go
| through the the Language APIs & SDK more whole bunch of
| useless junk https://opentelemetry.io/docs/languages/js/
|
| Compare to the InfluxDB "send data" getting started
| https://docs.influxdata.com/influxdb/cloud/api-
| guide/client-... which gives you exactly it in a few lines.
| phillipcarter wrote:
| Maybe the confusion here is in comparing different
| things.
|
| The InfluxData docs you're linking to are similar to
| Observability vendor docs, which do indeed amount to
| "here's the endpoint, plug it in here, add this API key,
| tada".
|
| But OpenTelemetry isn't an observability vendor. You can
| send to an OpenTelemetry Collector (and the act of
| sending is simple), but you also need to stand that thing
| up and run it yourself. There's a lot of good reasons to
| do that, but if you don't need to run infrastructure
| right now then it's a lot simpler to just send directly
| to a backend.
|
| Would it be more helpful if the docs on OTel spelled this
| out more clearly?
| blue_pants wrote:
| There's an excellent article on how to implement
| OpenTelemetry Tracing in 200 lines of code.
|
| https://jeremymorrell.dev/blog/minimal-js-tracing/
|
| "It might help to go over a non-exhaustive list of things
| the offical SDK handles that our little learning library
| doesn't:
|
| - Buffer and batch outgoing telemetry data in a more
| efficient format. Don't send one-span-per-http request in
| production. Your vendor will want to have words."
|
| - Gracefully handle errors, wrap this library around your
| core functionality at your own peril"
|
| You can solve them of course, if you can
| Groxx wrote:
| > _You can avoid any magic and stick to a subset..._
|
| ... if (and only if) _all the libraries you use_ also stick to
| that subset, yea. That is _overwhelmingly_ not true in my
| experience. And the article shows a nice concrete example of
| why.
|
| For green-field projects which use nothing but otel and no non-
| otel frameworks, yea. I can believe it's nice. But I definitely
| do not live in that world yet.
| mikestorrent wrote:
| Very sane advice. Most folks will already have something for
| metrics and logs and unless there's ROI on changing it out, why
| bother?
| PeterZaitsev wrote:
| For those looking for tracing but less complexity check out eBPF
| based solutions such as Coroot or Odigos
| jensensbutton wrote:
| Isn't Odigos just OTel with simpler setup?
| hinkley wrote:
| The whole time I was learning/porting to Otel I felt like I was
| back in the Java world again. Every time I stepped through the
| code it felt like EnterpriseFizzBuzz. No discoverability. At all.
| And their own jargon that looks like it was made by people high
| on something.
|
| And in NodeJS, about four times the CPU usage of StatsD. We ended
| up doing our own aggregation to tamp this down and to reduce tag
| proliferation (StatsD is fine having multiple processes reporting
| the same tags, OTEL clobbers). At peak load we had 1 CPU running
| at 60-80% utilization. Until something changes we couldn't
| vertically scale. Other factors on that project mean that's now
| unlikely to happen but it grates.
|
| OTEL is actively hostile to any language that uses one process
| per core. What a joke.
|
| Just go with Prometheus. It's not like there are other contenders
| out there.
| paulddraper wrote:
| There are a lot of Java programmers working on it.
|
| (And some Go tbf.)
| hinkley wrote:
| Yeah and a blind man can see this, it's so loud.
| mkeedlinger wrote:
| This matches my experience. Very difficult to understand what I
| needed to get the effect I wanted.
| sethops1 wrote:
| This matches my conclusion as well. Just use Prometheus and
| whatever client library for your language of choice, it's 1000x
| simpler than the OTEL story.
| bushbaba wrote:
| Simpler near-term, but more painful long term when you want
| to switch vendors/stacks.
| hinkley wrote:
| And switching log implementations can be a pain in the
| butt. Ask me how I know.
|
| But I'd rather do that three more times before I want to
| see OpenTelemetry again.
|
| Also Prometheus is getting OTEL interop.
| pphysch wrote:
| Is this the same scam as "standard SQL"? Switching database
| products is never straightforward in practice, despite any
| marketing copy or wishful thinking.
|
| Prometheus ecosystem is very interoperable, by the way.
| kemitche wrote:
| Nine times out of ten, I've got more valuable problems to
| solve than a theoretical future change of our vendor/stack
| for telemetry. I'll gladly borrow from my future self's
| time if it means I can focus on something more important
| right now.
| hinkley wrote:
| I did our migration from StatsD to OTEL because our third
| party StatsD service was getting flaky. The first person
| from OPs to get to me pushed OTEL. The rest were fine
| with Prometheus and it was late in the process before
| they realized what had happened. I believe if we had gone
| straight to Prometheus I would have been done in half the
| time and solved half the problems I had to solve anyway
| for OTEL. If someone had to replace it again in the
| future I fully believe it would have taken cumulatively
| as much time to go StatsD->Prometheus->OTEL as it took to
| go StatsD->OTEL, especially when you consider that OTEL
| is not quite baked.
|
| Meanwhile functionality to retain and recruit new
| customers sat in the backlog.
| whalesalad wrote:
| Can you even achieve this with prometheus? Afaik it operates
| by exposing metrics that are scraped at some interval. High
| level stuff, not per-trace stuff.
|
| How would you build the "holy grail" map that shows a trace
| of every sub component in a transaction broken down by
| start/stop time etc... for instance show the load balancer
| see a request, the request get handled by middlewares etc,
| then go onto some kind of handler/controller, the sub-queries
| inside of that like database calls or cache calls. I don't
| think that is possible with prometheus?
| baby_souffle wrote:
| > Can you even achieve this with prometheus? Afaik it
| operates by exposing metrics that are scraped at some
| interval. High level stuff, not per-trace stuff.
|
| Correct. Prometheus is just metrics.
|
| The main argument for oTel is that instead of one
| proprietary vendor SDK or importing prometheus and jaeger
| and whatever you want to use for logging, just import oTel
| and all that will be done with a common / open data format.
|
| I still believe in that dream but it's clear that the whole
| project needs some time/resources to mature a bit more.
|
| If anybody remembers the Terraform/ToFu drama, it's been
| really wild to see how much support everybody pledged for
| ToFu but all the traditional observability providers have
| just kinda tolerated oTel :/
| niftaystory wrote:
| Code traces _are_ metrics. Run times per function calls
| metrics, count of specific function call metrics.
|
| Otel is an attempt to package such arithmetic.
|
| Web apps have added so many layers of syntax sugar and
| semantic wank, we've lost sight its all just the same old
| math operations relative to different math objects. Sets
| are not triangles but both are tested, quantified, and
| compared with the same old mathematical ops we learn by
| middle school.
| mikestorrent wrote:
| No, code traces are not just metrics; and while you can
| knit together something approximating traces from
| metrics, you'll quickly run into the reason why traces
| are a distinct thing. First, in a distributed system,
| you'll discover that you can't rely on clocks to get the
| timing of subsecond events correct. Second, you'll be
| contextless about code paths. So, you might independantly
| reinvent the idea of passing along a context - and now
| you're just making your own tracing system but without
| any of the benefit of building on years of existing
| discoveries in this field.
|
| OTel does feel a little bit heavy, unless you're already
| used to e.g. New Relic, Dynatrace, etc. where you have to
| run an agent process and instrumentize your code to some
| extent; it's never going to be free to audit every
| function call! This is why (a) you sample down and don't
| keep every trace, and (b) unless your company is
| extremely flush with cash you probably don't run tracing
| in every environment. If you can get away with it just in
| a staging or perf test env you can reap most of the
| benefit without the production impact and cost.
| niftaystory wrote:
| All those things you describe are computable metrics.
| They have to be or Otel itself would not be able to
| compute them for consumption. All you described are
| cherry picked semantic indirections to obfuscate it's all
| just a computer computing metrics of its own memory
| states.
|
| Sorry for knowing how computers actually work (EE grad
| not a CS grad). I know that can frustrate CS grads who
| think their preferred OS and favorite programming
| language is how a computer works. You're describing how
| contemporary SWEs view their day job.
|
| Edit: teleMETRY ...what's in a name? Oh right ...meaning.
| chupasaurus wrote:
| As a no grad to EE grad: traces mean a bundle of metrics
| that varies in structure hence you can't store and
| process them as effective as a list of counters unless
| you have a distinct bin for each possible trace,
| combinatorial explosion y'know.
| chrisweekly wrote:
| huh? I've always heard and read and experienced that
| "logs, traces, metrics" are the 3 legs of the
| observability stool.
| niftaystory wrote:
| Open teleMETRY
|
| Any guesses as to etymology?
| ffsm8 wrote:
| By this logic, you can say that logging, metrics and
| tracing are all fundamentally just different kinds of
| data and we should be calling it just plain databases and
| CRUD.
|
| They're related, but people have a very specific idea and
| concept of what each is, you haven't actually provided a
| good argument why we should throw out these distinctions
| just because they somewhat resemble each other if you
| ignore a few details
| hinkley wrote:
| Yeah part of the problem is it's called Open _telemetry_
| and half of you are only talking about tracing, not
| metrics. Telemetry is metrics. It's been metrics since at
| least the Mercury Program.
|
| Metrics in OTEL is about three years old and it's _garbage_
| for something that's been in development for three years.
| paulddraper wrote:
| Prometheus is good, but let's be clear...you don't get
| tracing.
| Xeago wrote:
| I wonder what your experience is with Sentry? Not just for
| error reporting but especially also their support for traces.
|
| Also open-source & self-hostable.
| malkia wrote:
| Quota/pricing.
| malkia wrote:
| Using otel from C++ side... To have cumulative metrics from
| multiple applications (e.g. not "statds/delta") I create a
| relatively low cardinality process.vpid integer (and somehow
| coordinate this number to be unique as long as the app emitting
| it is stil alive) - you can use some global object to
| coordinate it.
|
| Then you can have something that sums, and removes the
| attribute.
|
| With statsd/delta if you lose sending a signal - then all data
| gets skewed, with cumulation - you only use precision.
|
| edit... forgot to say - my use case is "push based" metrics as
| these are coming from "batch" tools, not long running processes
| that can be scraped.
| KronisLV wrote:
| > It's not like there are other contenders out there.
|
| Apache Skywalking _might_ be worth a look in some
| circumstances, doesn 't eat too many resources, is fairly
| straightforwards to setup and run, admittedly somewhat jank
| (not the most polished UI or docs), but works okay:
| https://skywalking.apache.org/
|
| Also I quite liked that a minimal setup is indeed pretty
| minimal: a web UI, a server instance and a DB that you already
| know
| https://skywalking.apache.org/docs/main/latest/en/setup/back...
|
| In some ways, it's a lot like Zabbix in the monitoring space -
| neither will necessarily impress anyone, but both have a nice
| amount of utility.
| to11mtm wrote:
| I'm fairly convinced that OTEL is in a form of 'vendor
| capture', i.e. because the only way to get a _standard_ was to
| compromise with various bigcorps and sloppy startups to glue-
| gun it all together.
|
| I tried doing a _simple_ otel setup in .NET and after a few
| hours of trying to grok the documentation of the vendor my org
| has chosen, hopped into a discord run by a colleague that has
| part of their business model around 'pay for the good otel on
| the OSS product' and immediately stated that whatever it cost,
| it was worth the money.
|
| I'd rather build another reliable event/pubsub library _without
| prior experience_ than try to implement OTEL.
| lexh wrote:
| Gee whiz is this person is in for a treat when they discover the
| joys of OpAMP https://github.com/open-telemetry/opamp-
| spec/blob/main/speci...
|
| Turtles all the way down.
| hinkley wrote:
| Blech.
|
| If you already have reloadable configuration infrastructure, or
| plan to add it in the future, this is just spreading out your
| configuration capture. No thank you (and by "no thank you" I
| mean fuck right off).
|
| If you want to improve your bus number for production triage,
| you have to make it so anyone (senior) can first identify and
| then reproduce the configuration and dependencies of the
| production system locally without interrupting any of the point
| people to do so. If you cannot see you cannot help.
|
| Just because you're one of k people who usually discover the
| problem quickly doesn't mean you'll always do it quickly. You
| have bad days. You have PTO. People release things or flip
| feature toggles that escape your notice. If you stop to
| entertain other people's queries or theories you are guaranteed
| to be in for a long triage window, and a potential SLA
| violation. But if you never accept other perspectives then your
| blind spots can also make for SLA violations.
|
| Let people putter on their own and they can help with the
| Pareto distributions. Encourage them to do so and you can build
| your bus number.
| deepsun wrote:
| Same thing. OpenTelemetry grew up from Traces, but Metrics and
| Logs are much better left to specialized solutions.
|
| Feels like a "leaky abstraction" (or "leaky framework") issue. If
| we wanted to put everything under one umbrella, then well, an SQL
| database can also do all these things at the same time! Doesn't
| mean it should.
| incangold wrote:
| I think giving metrics and logging a location in a trace is
| really useful.
|
| But I still dislike OTel every time I have to deal with it.
| hinkley wrote:
| You can't do fine grained tracing in OTEL because if you hit
| 500 spans in a single trace it starts dropping the trace.
| Basically a toy solution for brownfield work.
| phillipcarter wrote:
| ...huh? I work with customers who (through a mistake) have
| created literally multi-million span traces using OTel. Are
| you referring to a particular backend?
| hinkley wrote:
| AWS
| phillipcarter wrote:
| Well that's a shame, I'm going to ask some folks about
| that. 500 spans per trace is ridiculously small and I
| can't imagine any good reason to have that limitation
| since it's just not that big of a footprint.
|
| OTel doesn't define any limits on the # of spans in a
| trace (nor the # of attributes on a span!) but it will be
| bound by the limits of whatever backend you use. In the
| case of the one I work for, we do limit the total size of
| a span to be 1MB or less with 64KB per attribute before
| truncation. Other backends have different limitations.
| This is the first I've heard of such a small limitation
| on the total number of spans in a trace though. Traces
| are just (basically) collections of structured logs with
| in-built correlation IDs. I can't imagine why you'd limit
| them like this.
| hinkley wrote:
| That was two years ago (we tried spans before metrics),
| so it's fuzzy. I believe the collector sidecar was fine
| with it but the backend was not, which complicated
| debugging. There's not a clear feedback path in
| OpenTelemetry that we could find. I completely forgot to
| mention the tendency toward silent failures. That's a
| cardinal sin for telemetry. I would take it out back and
| shoot it for that fact alone.
|
| The other problem I noticed looking at the wire protocol
| was that the data for the parent trace doesn't seem to
| get sent until the trace closes. That seems like a
| bookkeeping nightmare to me. There should be a start of
| trace packet and an update at the end. I shouldn't have
| finished spans showing up before the parent trace has
| been registered. And that's what it looked like in the
| dumps my OPs people sent me to debug.
| pranay01 wrote:
| As mentioned by philip below, 500 spans is a very small
| amount. I have seen customers send 1000s of spans in a
| trace very easily
| IneffablePigeon wrote:
| This is just not true. We have traces with hundreds of
| thousands of spans. Those are not very readable but that's
| another problem.
| cedws wrote:
| I still don't understand what OTEL is. What problem is it
| solving? If it's a standard what is the change for the end user?
| Is it not just a matter of continuing to use whatever
| (Prometheus, Grafana, etc) with the option to swap components
| out?
| paulddraper wrote:
| The point of OTel is interoperability.
|
| For example the _author_ of the software instruments it with
| OTel -- either language interface or wire protocol -- and the
| _operator_ of the software uses the backend of choice.
|
| Otherwise, you have a combinatorial matrix of supported
| options.
|
| (Naturally, this problem is moot if the author and operator are
| the same.)
| hinkley wrote:
| Interoperability with _what_?
|
| Where are the three existing, successful solutions it is
| trying to abstract over?
|
| It doesn't know what it is because it's violating the Rule of
| Three.
| GauntletWizard wrote:
| Interoperability with the other things your Otel Vendor is
| selling you. No two implementations are even remotely
| compatible, but they can all mostly scrape data from your
| Prometheus endpoints, so it's easy to migrate from useful
| software to their walled garden.
| hinkley wrote:
| > to their walled garden
|
| Am I detecting sarcasm or did I just bring my own?
| GauntletWizard wrote:
| I don't think there's any sarcasm there; Perhaps a
| wistful hope that behind one of those walls somebody's
| actually got a garden instead of just a seedbed of false
| promises.
| hinkley wrote:
| Yarp.
| jiggawatts wrote:
| Application Insights, Data Dog, New Relic, etc...
|
| APM products in general.
| hinkley wrote:
| How to send Prometheus data to New Relic:
| https://docs.newrelic.com/docs/infrastructure/prometheus-
| int...
|
| How to send StatsD data to Datadog: https://docs.datadogh
| q.com/developers/dogstatsd/?tab=hostage...
|
| Places like datadog and posthog are selling you their
| ability to ingest your existing data. I call bullshit.
| It's a problem looking for a solution. It's an excuse for
| engineers to build moats around a moderately difficult
| problem by making it inscrutable.
| arccy wrote:
| interoperability between vendors, so your business isn't
| stuck with a vendor who can raise prices because their SDKs
| are deeply embedded in your codebase, so open source
| libraries / products have a common point to hook into
| without needing to integrate with each vendor.
| paulddraper wrote:
| > Interoperability with what?
|
| For the backend?
|
| Datadog, New Relic, Grafana, Sentry, Azure Monitor, Splunk,
| Dynatrace, Honeycomb
| hangonhn wrote:
| For the tracing part of Otel, neither Prometheus nor Grafana
| are capable of doing that. Tracing is the most mature part of
| Otel and the most compelling use case for it. For metrics,
| we've stayed with Prometheus and AWS Cloudwatch Metrics. The
| metrics part feels very under developed at the moment.
| hinkley wrote:
| When I last looked 9 months ago, there were libraries of the
| metrics side of the tree still marked as experimental, that
| you couldn't successfully send metrics without using. And a
| huge memory leak in the JS implementation that was only fixed
| 15 months ago: https://github.com/open-
| telemetry/opentelemetry-js/issues/41...
|
| Things, especially crosscutting concerns, you want to use in
| production should have stopped experiencing basic growing
| pains like this long before you touch them. It's not baked
| yet. Come back in a year. Or two.
| barake wrote:
| Everything is either in development or stable. There aren't
| statuses like alpha, beta, release candidate, etc. except
| for individual library releases. Metric clients will be
| marked as "development" until it goes "stable" [0].
| Consequently it can be hard to determine the _actual_
| maturity level of any given implementation.
|
| Tracing is very mature, with metric and logging
| implementations stable for a number of popular languages
| [1].
|
| _the "experimental" status was renamed "development"_
|
| [0] https://opentelemetry.io/docs/specs/otel/versioning-
| and-stab...
|
| [1] https://opentelemetry.io/docs/languages/#status-and-
| releases
| hinkley wrote:
| > the "experimental" status was renamed "development
|
| That doesn't really change things now does it. It's still
| a bunch of people sitting around saying "MMMM" loudly
| while eating half-raw cookies.
| dionian wrote:
| i can report the same traces to jager if i want open source or
| i switch out the provider and it can go to aws x-ray (paid).
| without any code or config changes. pretty useful. yes, a tad
| clumsy to set up the first time.
| pat2man wrote:
| A lot of web frameworks etc do most of the instrumentation for
| you these days. For instance using opentelemetry-js and self
| hosting something like https://signoz.io should take less than an
| hour to get spun up and you get a ton of data without writing any
| custom code.
| pranay01 wrote:
| Agree. Here's the repo for SigNoz if you want to check it out -
| https://github.com/signoz/signoz
| hocuspocus wrote:
| Context propagation isn't trivial on a multi-threaded async
| runtime. There are several ways to do it, but JVM agents that
| instrument bytecode are popular because they work
| transparently.
| hinkley wrote:
| While that's true, if you've already solved punching
| correlation-IDs and A/B testing (feature flags per request)
| through then you can use the same solution for all three. In
| fact you really should.
|
| Ours was old so based on domain <dry heaving sounds>, but by
| the time I left the project there were just a few places left
| where anyone touched raw domains directly and you could
| switch to AsyncLocalStorage in a reasonable amount of time.
|
| The simplest thing that could work is to pass the original
| request or response context everywhere but that... has its
| own struggles. It's hell on your function signatures (so I
| sympathize with my predecessors not doing that but goddamn)
| and you really don't want an entire sequence diagram being
| able to fire the response. That's equivalent to having a
| function with 100 return statements in it.
| pranay01 wrote:
| I literally gave a lightning talk on this in Kubecon NA last
| year. Here's the youtube video, might help you get some
| perspective
|
| tl;dr
|
| while there are certainly many areas to improve for the project,
| some reasons why it could seem complicated
|
| Extensibility by Design: Flexibility in defining meters and
| signals ensures diverse use cases are supported.
|
| It's still a relatively new technology (~3 years old), growing
| pains are expected. OpenTelemetry is still the most advanced open
| standard handling all three signals together.
|
| [1]https://www.youtube.com/watch?v=xEu8_Aeo_-o
| antithesis-nl wrote:
| This was exactly my reaction to OpenTelemetry.
|
| Creating an HTTP endpoint that publishes metrics in a Prometheus-
| scrape-able format? Easy! Some boolean/float key-value-pairs with
| appropriate annotations (basically: is this a counter or a
| gauge?), and done! And that lead (and leads!) to some very usable
| Grafana dashboards-created-by-actual-users and therefore much
| joy.
|
| Then, I read up on how to do things The Proper Way, and was
| initially very much discouraged, but decided to ignore All that
| Noise due to the existing solutions working so well. No
| complaints so far!
| edenfed wrote:
| Definitely can relate, this is why I started an open-source
| project that focus on making OpenTelemetry adoption as easy as
| running a single command line: https://github.com/odigos-
| io/odigos
| BugsJustFindMe wrote:
| If you get to the end you find that the pain was all self-
| inflicted. I found it to be very easy in Python with standard
| stacks (mysql, flask, redis, requests, etc), because you
| literally just do a few imports at the top of your service and it
| automatically hooks itself up to track everything without any
| fuss.
| etimberg wrote:
| Until you run your server behind something like gunicorn and
| all of the auto imports stop working and you have to do it all
| yourself.
| BugsJustFindMe wrote:
| It works with uwsgi just fine though.
| jdsleppy wrote:
| I found this to work fine https://opentelemetry-
| python.readthedocs.io/en/latest/exampl...
| verall wrote:
| So recently I needed to this up for a very simple flask app.
| We're running otel-collector-contrib, jaeger-all-in-one, and
| prometheus on a single server with docker compose (has to be
| all within the corpo intranet for reasons..)
|
| Traces work, and I have the spanmetrics exporter set up, and I
| can actually see the spanmetrics in prometheus if I query
| directly, but they won't show up in the jaeger "monitor" tab,
| no matter what I do.
|
| I spent 3 days on this before my boss is like "why don't we
| just manually instrument and send everything to the SQL server
| and create a grafana dashboard from that" and agh I don't want
| to do that either.
|
| Any advice? It's literally the simplest usecase but I can't get
| it to work. Should I just add grafana to the pile?
| BugsJustFindMe wrote:
| Yeah the biggest trouble really is on the dashboarding side
| of things, not the sending side, and is why there are popular
| SaaS products like datadog. If you're amenable to saas,
| datadog is probably the best way. Otherwise, look into SigNoz
| for a one-stop solution with minimal effort even if there are
| some rough edges still.
| verall wrote:
| We absolutely have to run it ourselves (...corporate
| reasons...), it's a lightweight service with only a few
| hundred users so we haven't had to worry much about perf
| (yet).
|
| SigNoz does look interesting, I may give this a shot, thank
| you. I'm a bit concerned about it conflicting with other
| things going on in our docker-compose but it doesn't look
| too bad..
| baby_souffle wrote:
| > I found it to be very easy in Python with standard stacks
| (mysql, flask, redis, requests, etc), because you literally
| just do a few imports at the top of your service and it
| automatically hooks itself up to track everything without any
| fuss.
|
| Yes, but only if everything in your stack is supported by their
| auto instrumentation. Take `aiohttp` for example. The latest
| version is 3.11.X and ... their auto instrumentation claims to
| support `3.X` [0] but results vary depending on how new your
| `aiohttp` is versus the auto instrumentation.
|
| It's _magical_ when it all just works, but that ends up being a
| pretty narrow needle to thread!
|
| [0]: https://github.com/open-telemetry/opentelemetry-python-
| contr...
| BugsJustFindMe wrote:
| > _their auto instrumentation claims to support `3.X`_
|
| Semver should never be treated as anything more than some
| tired programmer's shrug and prayer that nobody else notices
| the breakages they didn't notice themselves. Pin strict
| dependencies instead of loose ones, and upgrade only after
| integration testing.
|
| There are only two kinds of updates, ones that intend to
| break something and ones that don't intend to break
| something, and neither one guarantees that the intent matches
| the outcome.
| etimberg wrote:
| What otel really needs to succeed, at least in the python space,
| is something as easy and straightforward as DataDog's ddtrace
| command.
| rtuin wrote:
| Otel seems complicated because different observability vendors
| make implementing observability super easy with their proprietary
| SDK's, agents and API's. This is what Otel wants to solve and I
| think the people behind it are doing a great job. Also kudos to
| grafana for adopting OpenTelemetry as a first class citizen of
| their ecosystem.
|
| I've been pushing the use of Datadog for years but their pricing
| is out of control for anyone between mid size company and large
| enterprises. So as years passed and OpenTelemetry API's and SDK's
| stabilized it became our standard for application observability.
|
| To be honest the documentation could be better overall and the
| onboarding docs differ per programming language, which is not
| ideal.
|
| My current team is on a NodeJS/Typescript stack and we've created
| a set of packages and an example Grafana stack to get started
| with OpenTelemetry real quick. Maybe it's useful to anyone here:
| https://github.com/zonneplan/open-telemetry-js
| saurik wrote:
| > Otel seems complicated because different observability
| vendors make implementing observability super easy with their
| proprietary SDK's, agents and API's. This is what Otel wants to
| solve and I think the people behind it are doing a great job.
|
| Wait... so, the _problem_ is that everyone makes it super easy,
| and so this product solves that by being complicated?
| to11mtm wrote:
| > I've been pushing the use of Datadog for years but their
| pricing is out of control for anyone between mid size company
| and large enterprises
|
| Not a fan of datadog vs just good metric collection. OTOH while
| I see the value of OTEL vs what I prefer to do... in theory.
|
| My biggest problem with all of the APM vendors, once you have
| kernel hooks via your magical agent all sorts of fun things
| come up that developers can't explain.
|
| My favorite example: At another shop we eventually adopted
| Dynatrace. _Thankfully_ our app already had enough built-in
| metrics that a lead SRE considered it a 'model' for how to do
| instrumentation... I say that because, as soon as Dynatrace
| agents got installed on the app hosts, we started having
| various 'heisenbugs' requiring node restarts as well as a
| directly measured drop in performance. [0]
|
| Ironically, the metrics saved us from grief, yet nobody had an
| idea how to fix it. ;_;
|
| [0] - Curiously, the 'worst' one was MSSQL failovers on update
| somehow polluting our ADO.NET connection pools in a bad way...
| 6r17 wrote:
| I have implemented OTEL over numerous projects to retrieve
| traces. It's just a total pain and I'd 500% skip it for anything
| else.
| gpi wrote:
| OpenTelemessy
| Groxx wrote:
| Yeah... this is about how well every OTel migration goes, from
| what I've seen.
|
| Docs are an absolute monstrosity that rival Bazel's for utility,
| but are far less complete. Implementations are _extremely_ widely
| varied in support for basics. Getting X to work with OTel often
| requires exactly what they did here: reverse-engineering X to
| figure out where it does something slightly abnormal... which is
| normal, almost _every_ library does something similar, because it
| 's so hard to push custom data through these systems in a type-
| safe way, and many decent systems want type safety and will spend
| a lot of effort to get it.
|
| It feels kinda like OAuth 2 tbh. Lots of promise, obvious
| desirable goals, but completely failing at everything involving
| consistent and standardized implementation.
| pnathan wrote:
| I spent altogether too much time trying to get the Rust otel libs
| working in a useful and concise way. After a few hours I junked
| it and went back to a direct use of a jaeger client sending off
| to the otel collector.
|
| there's some gold here, but most of it is over in the
| consultant/vendor space today, I fear.
| jensensbutton wrote:
| It's getting close to k8s in terms of activity so at least there
| are a lot of people working on it.
| junto wrote:
| One of my biggest problems was the local development story. I
| wanted logs, traces and metrics support locally but didn't want
| to spin up a multitude of Docker images just to get that to work.
| I wanted logs to be able to check what my metrics, traces,
| baggage and activity spans look like before I deploy.
|
| Recently, the .NET team launched .NET Aspire and it's awesome.
| Super easy to visualize everything in one place in my local
| development stack and it acts as an orchestrator as code.
|
| Then when we deploy to k8s we just point the OTEL endpoint at the
| DataDog Agent and everything just works.
|
| We just avoid the DataDog custom trace libraries and SDK and
| stick with OTEL.
|
| Now it's a really nice development experience.
|
| https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals...
|
| https://docs.datadoghq.com/opentelemetry/#overview
| ejs wrote:
| Glad I'm not the only one that feels this way. For a small
| application when you just want some metrics and observability,
| it's a big burden to get it all working.
|
| On my own projects, I send the metrics I care about out through
| the logs and have another project I run collect and aggregate
| them from the logs. Probably "wrong" but it works and it's easy
| to set up.
| dboreham wrote:
| Author is trying to do something difficult with a non-batteries-
| included open source (free to them) product. Seems quite
| uncomplicated given the circumstances. The whole point of OTel is
| to not get bent over backwards by one of the SaaS
| "logging/tracing/telemetry" companies, and as such it's going to
| incur some cost/pain of its own, but typically the bargain is
| worth taking.
| shireboy wrote:
| I'm literally porting some code to Otel now and here is what I
| landed on, even before this article: It is confusing because it's
| a topic that uses vague terminology that means different things
| in different domains. For example, I'm looking at one OTel ui and
| "Traces" are the individual http requests to a service. In
| another UI, against the same data, "Traces" are the log messages
| from code in the service, and "Requests" are the individual http
| requests. To wire up in code, there's yet other terminology.
|
| I haven't decided exactly what to blame for this. In some ways,
| it's necessary to have vague, inconsistent terminology to cover
| various use cases. And, to be fair some of the UIs predate OTel.
___________________________________________________________________
(page generated 2025-01-10 23:00 UTC)