hngopher.com

       [HN Gopher] I got OpenTelemetry to work. But why was it so compl...
       ___________________________________________________________________
        
       I got OpenTelemetry to work. But why was it so complicated?
        
       Author : paltaie
       Score  : 150 points
       Date   : 2025-01-10 12:38 UTC (10 hours ago)
        
 (HTM) web link (iconsolutions.com)
 (TXT) w3m dump (iconsolutions.com)
        
       | cglan wrote:
       | I agree. I tried to get it to work recently with datadog, but
       | there was so many hiccups. I ended up having to use datadogs
       | solution mostly. The documentation across everything is also kind
       | of confusing
        
         | SomaticPirate wrote:
         | imo Datadog is pretty hostile to OTel too. Ever since
         | https://github.com/open-telemetry/opentelemetry-collector-co...
         | was nearly killed by them I never felt like they fully
         | supported the standard (perhaps for good reasons)
         | 
         | OTel is a bear though. I think the biggest advantage it gives
         | you is the ability to move across tracing providers
        
           | bebop wrote:
           | I worry that vision is not going to become reality if the
           | large observability vendors don't want to support the
           | standard.
        
             | phillipcarter wrote:
             | FWIW the "datadog doesn't like otel" thing is kind of old
             | hat, and the story was a little more complicated at the
             | time too.
             | 
             | Nowadays they're contributing more to the project directly
             | and have built some support to embed the collector into
             | their DD agent. Other vendors (splunk, dynatrace, new
             | relic, grafana, honeycomb, sumo logic, etc.) contribute to
             | the project a bunch and typically recommend using OTel to
             | start instead of some custom stuff from before.
        
               | arccy wrote:
               | They support ingesting via otel (ie competing with other
               | vendors for their customers) but won't support ingesting
               | via their SDKs (they still try very hard to lock you in
               | to their tooling).
        
           | rikthevik wrote:
           | > the ability to move across tracing providers
           | 
           | It's a nice dream. At Google Cloud Next last year, the
           | vendors kinda of came in two buckets. Datadog, and everyone
           | trying to replace Datadog's outrageous bills.
        
           | hangonhn wrote:
           | Yeah their agent will accept traces from the standard Otel
           | SDK but there is no way to change their SDK to send the
           | traces to anyone other than Datadog when I last checked a
           | couple(?) of years ago.
           | 
           | I mean I understand why they did that but it really removes
           | one of the most compelling parts about Otel. We ended doing
           | the hard work of using the standard Otel libraries. I had to
           | contribute a PR or two to get it all to work with our
           | services but am glad that's the route we went because now we
           | can switch vendors if needed (which is likely in the not too
           | distant future in our case.
        
           | jensensbutton wrote:
           | Pretty sure Datadog is literally one of the top contributors
           | to OTel.
        
         | ljm wrote:
         | The biggest barrier to setting up oTel for me is the
         | development experience. Having a single open specification is
         | fantastic, especially for portability, but the SDKs are almost
         | overwhelmingly abstract and therefore difficult to intuit.
         | 
         | I used to really like Datadog for being a one-stop
         | observability shop and even though the experience of
         | integrating with it is still quite simple, I think product and
         | pricing wise they've jumped the shark.
         | 
         | I'm much happier these days using a collection of small time
         | services and self-hosting other things, and the only part of
         | that which isn't joyful is the boilerplate and not really
         | understanding when and why you should, say, use gRPC over HTTP,
         | and stuff like that.
        
         | pranay01 wrote:
         | part of the reason for that experience is also because DataDog
         | is not open telemetry native and all their docs and
         | instructions encourage use of their own agents. Using DataDog
         | with Otel is like trying to hold your nose round over your head
         | 
         | You should try Otel native observability platforms like SigNoz,
         | Honeycomb, etc. your life will be much simpler
         | 
         | Disclaimer : i am one of the maintainers at SigNoz
        
       | nimish wrote:
       | It's complicated because it's designed for the companies selling
       | Otel compatible software, not the engineers implementing it
        
         | convolvatron wrote:
         | this is going to come off as being fussy, but 'implement' use
         | to refer to the former activity, not the latter. which is fine,
         | meanings change, its just amusing that we no longer have a word
         | we can use for 'sitting down and writing software to match a
         | specification' and only 'taking existing software and deploying
         | it on servers'
        
           | hinkley wrote:
           | Operational versus builder jargon.
        
           | stronglikedan wrote:
           | It's still implementing. Someone has taken the specifications
           | and implemented the software, and then someone else has taken
           | the software and implemented a solution with it.
        
           | skrebbel wrote:
           | This has been the case for ages. Sysadmins use "implement" to
           | mean "install software on servers and keep it running",
           | coders use "implement" to mean "code stuff that matches a
           | spec/interface". It's just two worlds accidentally using the
           | same term for a different thing. No meanings are changing.
           | Two MS certified sysadmins in 1999 could talk about how they
           | were "Implementing Exchange across the whole company".
        
         | paulddraper wrote:
         | That hasn't been what I've seen from the contributors.
         | 
         | If anything I think the backends were kinda slow to adopt.
        
       | hocuspocus wrote:
       | Have you considered Kamon instead? From personal experience it's
       | really the best tracing solution for Akka and other libraries
       | using Scala Futures. I haven't tried it, but it does have built-
       | in Spring support as well.
       | 
       | https://kamon.io
       | 
       | Edit: I wonder why suggesting JVM instrumentation that is much
       | more polished than the OTel and Lightbend agents gets me
       | downvoted?
        
       | dimitar wrote:
       | It is as complicated as you want or need it to be. You can avoid
       | any magic and stick to a subset that is easy to reason about and
       | brings the most value in your context.
       | 
       | For our team, it is very simple:
       | 
       | * we use a library send traces and traces only[0]. They bring the
       | most value for observing applications and can contain all the
       | data the other types can contain. Basically hash-maps vs strings
       | and floats.
       | 
       | * we use manual instrumentation as opposed to automatic - we are
       | deliberate in what we observe and have great understand of what
       | emits the spans. We have naming conventions that match our code
       | organization.
       | 
       | * we use two different backends - an affordable 3rd party service
       | and an all-on-one Jaeger install (just run 1 executable or docker
       | container) that doesn't save the spans on disk for local
       | development. The second is mostly for piece of mind of team
       | members that they are not going to flood the third party service.
       | 
       | [0] We have a previous setup to monitor infrastructure and in our
       | case we don't see a lot of value of ingesting all the
       | infrastructure logs and metrics. I think it is early days for
       | OTEL metrics and logs, but the vendors don't tell you this.
        
         | buzzdenver wrote:
         | Mind sharing that that affordable 3rd party service is?
        
         | hinkley wrote:
         | I did not find that manual instrumentation made things simpler.
         | You're trading a learning curve that now starts way before you
         | can demonstrate results for a clearer understanding of the
         | performance penalties of using this Rube Goldberg machine.
         | 
         | Otel may be okay for a green field project but turning this
         | thing on in a production service that already had telemetry
         | felt like replacing a tire on a moving vehicle.
        
           | dmoy wrote:
           | I've not used otel for anything not greenfield, but I just
           | wanted to say
           | 
           | > felt like replacing a tire on a moving vehicle.
           | 
           | Some people do this as a joke / dare. I mean literally
           | replacing a car tire on a moving vehicle.
           | 
           | You Saudi drift up onto one side, and have people climb out
           | of the side in the air, and then swap the tire while the car
           | is driving on two wheels.
           | 
           | It's pretty insane stuff:
           | https://youtu.be/Str7m8xV7W8?si=KkjBh6OvFoD0HGoh
        
             | hinkley wrote:
             | That was the image I had in my head.
             | 
             | My whole career I've been watching people on greenfield
             | projects looking down on devs on already successful
             | products for not using some tool they've discovered,
             | missing the fact that their tool only functions if you
             | build your whole product around the exact mental model of
             | the tool (green field).
             | 
             | Wisdom is learning to watch for people obviously working on
             | brownfield projects espousing a tool. Like moving from VMs
             | to Docker. Ansible to Kubernetes (maybe not the best
             | example). They can have a faster adoption cycle and more
             | staying power.
        
         | madeofpalk wrote:
         | It's as complicated as you want, but it's not as easy as I
         | want. The floor is pretty high.
         | 
         | I'm still looking for an endpoint just to _send_ simple one-off
         | metrics to from parts of infrastructure that 's not scrapable.
        
           | pat2man wrote:
           | You can just send metrics via JSON to any otlphttp collector:
           | https://github.com/open-telemetry/opentelemetry-
           | proto/blob/v...
        
             | madeofpalk wrote:
             | Shame none of this comes up whenever I search for it!
             | 
             | Top google result, for me, for 'send metrics to otel' is
             | https://opentelemetry.io/docs/specs/otel/metrics/. If I go
             | through the the Language APIs & SDK more whole bunch of
             | useless junk https://opentelemetry.io/docs/languages/js/
             | 
             | Compare to the InfluxDB "send data" getting started
             | https://docs.influxdata.com/influxdb/cloud/api-
             | guide/client-... which gives you exactly it in a few lines.
        
               | phillipcarter wrote:
               | Maybe the confusion here is in comparing different
               | things.
               | 
               | The InfluxData docs you're linking to are similar to
               | Observability vendor docs, which do indeed amount to
               | "here's the endpoint, plug it in here, add this API key,
               | tada".
               | 
               | But OpenTelemetry isn't an observability vendor. You can
               | send to an OpenTelemetry Collector (and the act of
               | sending is simple), but you also need to stand that thing
               | up and run it yourself. There's a lot of good reasons to
               | do that, but if you don't need to run infrastructure
               | right now then it's a lot simpler to just send directly
               | to a backend.
               | 
               | Would it be more helpful if the docs on OTel spelled this
               | out more clearly?
        
               | blue_pants wrote:
               | There's an excellent article on how to implement
               | OpenTelemetry Tracing in 200 lines of code.
               | 
               | https://jeremymorrell.dev/blog/minimal-js-tracing/
               | 
               | "It might help to go over a non-exhaustive list of things
               | the offical SDK handles that our little learning library
               | doesn't:
               | 
               | - Buffer and batch outgoing telemetry data in a more
               | efficient format. Don't send one-span-per-http request in
               | production. Your vendor will want to have words."
               | 
               | - Gracefully handle errors, wrap this library around your
               | core functionality at your own peril"
               | 
               | You can solve them of course, if you can
        
         | Groxx wrote:
         | > _You can avoid any magic and stick to a subset..._
         | 
         | ... if (and only if) _all the libraries you use_ also stick to
         | that subset, yea. That is _overwhelmingly_ not true in my
         | experience. And the article shows a nice concrete example of
         | why.
         | 
         | For green-field projects which use nothing but otel and no non-
         | otel frameworks, yea. I can believe it's nice. But I definitely
         | do not live in that world yet.
        
         | mikestorrent wrote:
         | Very sane advice. Most folks will already have something for
         | metrics and logs and unless there's ROI on changing it out, why
         | bother?
        
       | PeterZaitsev wrote:
       | For those looking for tracing but less complexity check out eBPF
       | based solutions such as Coroot or Odigos
        
         | jensensbutton wrote:
         | Isn't Odigos just OTel with simpler setup?
        
       | hinkley wrote:
       | The whole time I was learning/porting to Otel I felt like I was
       | back in the Java world again. Every time I stepped through the
       | code it felt like EnterpriseFizzBuzz. No discoverability. At all.
       | And their own jargon that looks like it was made by people high
       | on something.
       | 
       | And in NodeJS, about four times the CPU usage of StatsD. We ended
       | up doing our own aggregation to tamp this down and to reduce tag
       | proliferation (StatsD is fine having multiple processes reporting
       | the same tags, OTEL clobbers). At peak load we had 1 CPU running
       | at 60-80% utilization. Until something changes we couldn't
       | vertically scale. Other factors on that project mean that's now
       | unlikely to happen but it grates.
       | 
       | OTEL is actively hostile to any language that uses one process
       | per core. What a joke.
       | 
       | Just go with Prometheus. It's not like there are other contenders
       | out there.
        
         | paulddraper wrote:
         | There are a lot of Java programmers working on it.
         | 
         | (And some Go tbf.)
        
           | hinkley wrote:
           | Yeah and a blind man can see this, it's so loud.
        
         | mkeedlinger wrote:
         | This matches my experience. Very difficult to understand what I
         | needed to get the effect I wanted.
        
         | sethops1 wrote:
         | This matches my conclusion as well. Just use Prometheus and
         | whatever client library for your language of choice, it's 1000x
         | simpler than the OTEL story.
        
           | bushbaba wrote:
           | Simpler near-term, but more painful long term when you want
           | to switch vendors/stacks.
        
             | hinkley wrote:
             | And switching log implementations can be a pain in the
             | butt. Ask me how I know.
             | 
             | But I'd rather do that three more times before I want to
             | see OpenTelemetry again.
             | 
             | Also Prometheus is getting OTEL interop.
        
             | pphysch wrote:
             | Is this the same scam as "standard SQL"? Switching database
             | products is never straightforward in practice, despite any
             | marketing copy or wishful thinking.
             | 
             | Prometheus ecosystem is very interoperable, by the way.
        
             | kemitche wrote:
             | Nine times out of ten, I've got more valuable problems to
             | solve than a theoretical future change of our vendor/stack
             | for telemetry. I'll gladly borrow from my future self's
             | time if it means I can focus on something more important
             | right now.
        
               | hinkley wrote:
               | I did our migration from StatsD to OTEL because our third
               | party StatsD service was getting flaky. The first person
               | from OPs to get to me pushed OTEL. The rest were fine
               | with Prometheus and it was late in the process before
               | they realized what had happened. I believe if we had gone
               | straight to Prometheus I would have been done in half the
               | time and solved half the problems I had to solve anyway
               | for OTEL. If someone had to replace it again in the
               | future I fully believe it would have taken cumulatively
               | as much time to go StatsD->Prometheus->OTEL as it took to
               | go StatsD->OTEL, especially when you consider that OTEL
               | is not quite baked.
               | 
               | Meanwhile functionality to retain and recruit new
               | customers sat in the backlog.
        
           | whalesalad wrote:
           | Can you even achieve this with prometheus? Afaik it operates
           | by exposing metrics that are scraped at some interval. High
           | level stuff, not per-trace stuff.
           | 
           | How would you build the "holy grail" map that shows a trace
           | of every sub component in a transaction broken down by
           | start/stop time etc... for instance show the load balancer
           | see a request, the request get handled by middlewares etc,
           | then go onto some kind of handler/controller, the sub-queries
           | inside of that like database calls or cache calls. I don't
           | think that is possible with prometheus?
        
             | baby_souffle wrote:
             | > Can you even achieve this with prometheus? Afaik it
             | operates by exposing metrics that are scraped at some
             | interval. High level stuff, not per-trace stuff.
             | 
             | Correct. Prometheus is just metrics.
             | 
             | The main argument for oTel is that instead of one
             | proprietary vendor SDK or importing prometheus and jaeger
             | and whatever you want to use for logging, just import oTel
             | and all that will be done with a common / open data format.
             | 
             | I still believe in that dream but it's clear that the whole
             | project needs some time/resources to mature a bit more.
             | 
             | If anybody remembers the Terraform/ToFu drama, it's been
             | really wild to see how much support everybody pledged for
             | ToFu but all the traditional observability providers have
             | just kinda tolerated oTel :/
        
             | niftaystory wrote:
             | Code traces _are_ metrics. Run times per function calls
             | metrics, count of specific function call metrics.
             | 
             | Otel is an attempt to package such arithmetic.
             | 
             | Web apps have added so many layers of syntax sugar and
             | semantic wank, we've lost sight its all just the same old
             | math operations relative to different math objects. Sets
             | are not triangles but both are tested, quantified, and
             | compared with the same old mathematical ops we learn by
             | middle school.
        
               | mikestorrent wrote:
               | No, code traces are not just metrics; and while you can
               | knit together something approximating traces from
               | metrics, you'll quickly run into the reason why traces
               | are a distinct thing. First, in a distributed system,
               | you'll discover that you can't rely on clocks to get the
               | timing of subsecond events correct. Second, you'll be
               | contextless about code paths. So, you might independantly
               | reinvent the idea of passing along a context - and now
               | you're just making your own tracing system but without
               | any of the benefit of building on years of existing
               | discoveries in this field.
               | 
               | OTel does feel a little bit heavy, unless you're already
               | used to e.g. New Relic, Dynatrace, etc. where you have to
               | run an agent process and instrumentize your code to some
               | extent; it's never going to be free to audit every
               | function call! This is why (a) you sample down and don't
               | keep every trace, and (b) unless your company is
               | extremely flush with cash you probably don't run tracing
               | in every environment. If you can get away with it just in
               | a staging or perf test env you can reap most of the
               | benefit without the production impact and cost.
        
               | niftaystory wrote:
               | All those things you describe are computable metrics.
               | They have to be or Otel itself would not be able to
               | compute them for consumption. All you described are
               | cherry picked semantic indirections to obfuscate it's all
               | just a computer computing metrics of its own memory
               | states.
               | 
               | Sorry for knowing how computers actually work (EE grad
               | not a CS grad). I know that can frustrate CS grads who
               | think their preferred OS and favorite programming
               | language is how a computer works. You're describing how
               | contemporary SWEs view their day job.
               | 
               | Edit: teleMETRY ...what's in a name? Oh right ...meaning.
        
               | chupasaurus wrote:
               | As a no grad to EE grad: traces mean a bundle of metrics
               | that varies in structure hence you can't store and
               | process them as effective as a list of counters unless
               | you have a distinct bin for each possible trace,
               | combinatorial explosion y'know.
        
               | chrisweekly wrote:
               | huh? I've always heard and read and experienced that
               | "logs, traces, metrics" are the 3 legs of the
               | observability stool.
        
               | niftaystory wrote:
               | Open teleMETRY
               | 
               | Any guesses as to etymology?
        
               | ffsm8 wrote:
               | By this logic, you can say that logging, metrics and
               | tracing are all fundamentally just different kinds of
               | data and we should be calling it just plain databases and
               | CRUD.
               | 
               | They're related, but people have a very specific idea and
               | concept of what each is, you haven't actually provided a
               | good argument why we should throw out these distinctions
               | just because they somewhat resemble each other if you
               | ignore a few details
        
             | hinkley wrote:
             | Yeah part of the problem is it's called Open _telemetry_
             | and half of you are only talking about tracing, not
             | metrics. Telemetry is metrics. It's been metrics since at
             | least the Mercury Program.
             | 
             | Metrics in OTEL is about three years old and it's _garbage_
             | for something that's been in development for three years.
        
           | paulddraper wrote:
           | Prometheus is good, but let's be clear...you don't get
           | tracing.
        
         | Xeago wrote:
         | I wonder what your experience is with Sentry? Not just for
         | error reporting but especially also their support for traces.
         | 
         | Also open-source & self-hostable.
        
           | malkia wrote:
           | Quota/pricing.
        
         | malkia wrote:
         | Using otel from C++ side... To have cumulative metrics from
         | multiple applications (e.g. not "statds/delta") I create a
         | relatively low cardinality process.vpid integer (and somehow
         | coordinate this number to be unique as long as the app emitting
         | it is stil alive) - you can use some global object to
         | coordinate it.
         | 
         | Then you can have something that sums, and removes the
         | attribute.
         | 
         | With statsd/delta if you lose sending a signal - then all data
         | gets skewed, with cumulation - you only use precision.
         | 
         | edit... forgot to say - my use case is "push based" metrics as
         | these are coming from "batch" tools, not long running processes
         | that can be scraped.
        
         | KronisLV wrote:
         | > It's not like there are other contenders out there.
         | 
         | Apache Skywalking _might_ be worth a look in some
         | circumstances, doesn 't eat too many resources, is fairly
         | straightforwards to setup and run, admittedly somewhat jank
         | (not the most polished UI or docs), but works okay:
         | https://skywalking.apache.org/
         | 
         | Also I quite liked that a minimal setup is indeed pretty
         | minimal: a web UI, a server instance and a DB that you already
         | know
         | https://skywalking.apache.org/docs/main/latest/en/setup/back...
         | 
         | In some ways, it's a lot like Zabbix in the monitoring space -
         | neither will necessarily impress anyone, but both have a nice
         | amount of utility.
        
         | to11mtm wrote:
         | I'm fairly convinced that OTEL is in a form of 'vendor
         | capture', i.e. because the only way to get a _standard_ was to
         | compromise with various bigcorps and sloppy startups to glue-
         | gun it all together.
         | 
         | I tried doing a _simple_ otel setup in .NET and after a few
         | hours of trying to grok the documentation of the vendor my org
         | has chosen, hopped into a discord run by a colleague that has
         | part of their business model around  'pay for the good otel on
         | the OSS product' and immediately stated that whatever it cost,
         | it was worth the money.
         | 
         | I'd rather build another reliable event/pubsub library _without
         | prior experience_ than try to implement OTEL.
        
       | lexh wrote:
       | Gee whiz is this person is in for a treat when they discover the
       | joys of OpAMP https://github.com/open-telemetry/opamp-
       | spec/blob/main/speci...
       | 
       | Turtles all the way down.
        
         | hinkley wrote:
         | Blech.
         | 
         | If you already have reloadable configuration infrastructure, or
         | plan to add it in the future, this is just spreading out your
         | configuration capture. No thank you (and by "no thank you" I
         | mean fuck right off).
         | 
         | If you want to improve your bus number for production triage,
         | you have to make it so anyone (senior) can first identify and
         | then reproduce the configuration and dependencies of the
         | production system locally without interrupting any of the point
         | people to do so. If you cannot see you cannot help.
         | 
         | Just because you're one of k people who usually discover the
         | problem quickly doesn't mean you'll always do it quickly. You
         | have bad days. You have PTO. People release things or flip
         | feature toggles that escape your notice. If you stop to
         | entertain other people's queries or theories you are guaranteed
         | to be in for a long triage window, and a potential SLA
         | violation. But if you never accept other perspectives then your
         | blind spots can also make for SLA violations.
         | 
         | Let people putter on their own and they can help with the
         | Pareto distributions. Encourage them to do so and you can build
         | your bus number.
        
       | deepsun wrote:
       | Same thing. OpenTelemetry grew up from Traces, but Metrics and
       | Logs are much better left to specialized solutions.
       | 
       | Feels like a "leaky abstraction" (or "leaky framework") issue. If
       | we wanted to put everything under one umbrella, then well, an SQL
       | database can also do all these things at the same time! Doesn't
       | mean it should.
        
         | incangold wrote:
         | I think giving metrics and logging a location in a trace is
         | really useful.
         | 
         | But I still dislike OTel every time I have to deal with it.
        
           | hinkley wrote:
           | You can't do fine grained tracing in OTEL because if you hit
           | 500 spans in a single trace it starts dropping the trace.
           | Basically a toy solution for brownfield work.
        
             | phillipcarter wrote:
             | ...huh? I work with customers who (through a mistake) have
             | created literally multi-million span traces using OTel. Are
             | you referring to a particular backend?
        
               | hinkley wrote:
               | AWS
        
               | phillipcarter wrote:
               | Well that's a shame, I'm going to ask some folks about
               | that. 500 spans per trace is ridiculously small and I
               | can't imagine any good reason to have that limitation
               | since it's just not that big of a footprint.
               | 
               | OTel doesn't define any limits on the # of spans in a
               | trace (nor the # of attributes on a span!) but it will be
               | bound by the limits of whatever backend you use. In the
               | case of the one I work for, we do limit the total size of
               | a span to be 1MB or less with 64KB per attribute before
               | truncation. Other backends have different limitations.
               | This is the first I've heard of such a small limitation
               | on the total number of spans in a trace though. Traces
               | are just (basically) collections of structured logs with
               | in-built correlation IDs. I can't imagine why you'd limit
               | them like this.
        
               | hinkley wrote:
               | That was two years ago (we tried spans before metrics),
               | so it's fuzzy. I believe the collector sidecar was fine
               | with it but the backend was not, which complicated
               | debugging. There's not a clear feedback path in
               | OpenTelemetry that we could find. I completely forgot to
               | mention the tendency toward silent failures. That's a
               | cardinal sin for telemetry. I would take it out back and
               | shoot it for that fact alone.
               | 
               | The other problem I noticed looking at the wire protocol
               | was that the data for the parent trace doesn't seem to
               | get sent until the trace closes. That seems like a
               | bookkeeping nightmare to me. There should be a start of
               | trace packet and an update at the end. I shouldn't have
               | finished spans showing up before the parent trace has
               | been registered. And that's what it looked like in the
               | dumps my OPs people sent me to debug.
        
             | pranay01 wrote:
             | As mentioned by philip below, 500 spans is a very small
             | amount. I have seen customers send 1000s of spans in a
             | trace very easily
        
             | IneffablePigeon wrote:
             | This is just not true. We have traces with hundreds of
             | thousands of spans. Those are not very readable but that's
             | another problem.
        
       | cedws wrote:
       | I still don't understand what OTEL is. What problem is it
       | solving? If it's a standard what is the change for the end user?
       | Is it not just a matter of continuing to use whatever
       | (Prometheus, Grafana, etc) with the option to swap components
       | out?
        
         | paulddraper wrote:
         | The point of OTel is interoperability.
         | 
         | For example the _author_ of the software instruments it with
         | OTel -- either language interface or wire protocol -- and the
         | _operator_ of the software uses the backend of choice.
         | 
         | Otherwise, you have a combinatorial matrix of supported
         | options.
         | 
         | (Naturally, this problem is moot if the author and operator are
         | the same.)
        
           | hinkley wrote:
           | Interoperability with _what_?
           | 
           | Where are the three existing, successful solutions it is
           | trying to abstract over?
           | 
           | It doesn't know what it is because it's violating the Rule of
           | Three.
        
             | GauntletWizard wrote:
             | Interoperability with the other things your Otel Vendor is
             | selling you. No two implementations are even remotely
             | compatible, but they can all mostly scrape data from your
             | Prometheus endpoints, so it's easy to migrate from useful
             | software to their walled garden.
        
               | hinkley wrote:
               | > to their walled garden
               | 
               | Am I detecting sarcasm or did I just bring my own?
        
               | GauntletWizard wrote:
               | I don't think there's any sarcasm there; Perhaps a
               | wistful hope that behind one of those walls somebody's
               | actually got a garden instead of just a seedbed of false
               | promises.
        
               | hinkley wrote:
               | Yarp.
        
             | jiggawatts wrote:
             | Application Insights, Data Dog, New Relic, etc...
             | 
             | APM products in general.
        
               | hinkley wrote:
               | How to send Prometheus data to New Relic:
               | https://docs.newrelic.com/docs/infrastructure/prometheus-
               | int...
               | 
               | How to send StatsD data to Datadog: https://docs.datadogh
               | q.com/developers/dogstatsd/?tab=hostage...
               | 
               | Places like datadog and posthog are selling you their
               | ability to ingest your existing data. I call bullshit.
               | It's a problem looking for a solution. It's an excuse for
               | engineers to build moats around a moderately difficult
               | problem by making it inscrutable.
        
             | arccy wrote:
             | interoperability between vendors, so your business isn't
             | stuck with a vendor who can raise prices because their SDKs
             | are deeply embedded in your codebase, so open source
             | libraries / products have a common point to hook into
             | without needing to integrate with each vendor.
        
             | paulddraper wrote:
             | > Interoperability with what?
             | 
             | For the backend?
             | 
             | Datadog, New Relic, Grafana, Sentry, Azure Monitor, Splunk,
             | Dynatrace, Honeycomb
        
         | hangonhn wrote:
         | For the tracing part of Otel, neither Prometheus nor Grafana
         | are capable of doing that. Tracing is the most mature part of
         | Otel and the most compelling use case for it. For metrics,
         | we've stayed with Prometheus and AWS Cloudwatch Metrics. The
         | metrics part feels very under developed at the moment.
        
           | hinkley wrote:
           | When I last looked 9 months ago, there were libraries of the
           | metrics side of the tree still marked as experimental, that
           | you couldn't successfully send metrics without using. And a
           | huge memory leak in the JS implementation that was only fixed
           | 15 months ago: https://github.com/open-
           | telemetry/opentelemetry-js/issues/41...
           | 
           | Things, especially crosscutting concerns, you want to use in
           | production should have stopped experiencing basic growing
           | pains like this long before you touch them. It's not baked
           | yet. Come back in a year. Or two.
        
             | barake wrote:
             | Everything is either in development or stable. There aren't
             | statuses like alpha, beta, release candidate, etc. except
             | for individual library releases. Metric clients will be
             | marked as "development" until it goes "stable" [0].
             | Consequently it can be hard to determine the _actual_
             | maturity level of any given implementation.
             | 
             | Tracing is very mature, with metric and logging
             | implementations stable for a number of popular languages
             | [1].
             | 
             |  _the "experimental" status was renamed "development"_
             | 
             | [0] https://opentelemetry.io/docs/specs/otel/versioning-
             | and-stab...
             | 
             | [1] https://opentelemetry.io/docs/languages/#status-and-
             | releases
        
               | hinkley wrote:
               | > the "experimental" status was renamed "development
               | 
               | That doesn't really change things now does it. It's still
               | a bunch of people sitting around saying "MMMM" loudly
               | while eating half-raw cookies.
        
         | dionian wrote:
         | i can report the same traces to jager if i want open source or
         | i switch out the provider and it can go to aws x-ray (paid).
         | without any code or config changes. pretty useful. yes, a tad
         | clumsy to set up the first time.
        
       | pat2man wrote:
       | A lot of web frameworks etc do most of the instrumentation for
       | you these days. For instance using opentelemetry-js and self
       | hosting something like https://signoz.io should take less than an
       | hour to get spun up and you get a ton of data without writing any
       | custom code.
        
         | pranay01 wrote:
         | Agree. Here's the repo for SigNoz if you want to check it out -
         | https://github.com/signoz/signoz
        
         | hocuspocus wrote:
         | Context propagation isn't trivial on a multi-threaded async
         | runtime. There are several ways to do it, but JVM agents that
         | instrument bytecode are popular because they work
         | transparently.
        
           | hinkley wrote:
           | While that's true, if you've already solved punching
           | correlation-IDs and A/B testing (feature flags per request)
           | through then you can use the same solution for all three. In
           | fact you really should.
           | 
           | Ours was old so based on domain <dry heaving sounds>, but by
           | the time I left the project there were just a few places left
           | where anyone touched raw domains directly and you could
           | switch to AsyncLocalStorage in a reasonable amount of time.
           | 
           | The simplest thing that could work is to pass the original
           | request or response context everywhere but that... has its
           | own struggles. It's hell on your function signatures (so I
           | sympathize with my predecessors not doing that but goddamn)
           | and you really don't want an entire sequence diagram being
           | able to fire the response. That's equivalent to having a
           | function with 100 return statements in it.
        
       | pranay01 wrote:
       | I literally gave a lightning talk on this in Kubecon NA last
       | year. Here's the youtube video, might help you get some
       | perspective
       | 
       | tl;dr
       | 
       | while there are certainly many areas to improve for the project,
       | some reasons why it could seem complicated
       | 
       | Extensibility by Design: Flexibility in defining meters and
       | signals ensures diverse use cases are supported.
       | 
       | It's still a relatively new technology (~3 years old), growing
       | pains are expected. OpenTelemetry is still the most advanced open
       | standard handling all three signals together.
       | 
       | [1]https://www.youtube.com/watch?v=xEu8_Aeo_-o
        
       | antithesis-nl wrote:
       | This was exactly my reaction to OpenTelemetry.
       | 
       | Creating an HTTP endpoint that publishes metrics in a Prometheus-
       | scrape-able format? Easy! Some boolean/float key-value-pairs with
       | appropriate annotations (basically: is this a counter or a
       | gauge?), and done! And that lead (and leads!) to some very usable
       | Grafana dashboards-created-by-actual-users and therefore much
       | joy.
       | 
       | Then, I read up on how to do things The Proper Way, and was
       | initially very much discouraged, but decided to ignore All that
       | Noise due to the existing solutions working so well. No
       | complaints so far!
        
       | edenfed wrote:
       | Definitely can relate, this is why I started an open-source
       | project that focus on making OpenTelemetry adoption as easy as
       | running a single command line: https://github.com/odigos-
       | io/odigos
        
       | BugsJustFindMe wrote:
       | If you get to the end you find that the pain was all self-
       | inflicted. I found it to be very easy in Python with standard
       | stacks (mysql, flask, redis, requests, etc), because you
       | literally just do a few imports at the top of your service and it
       | automatically hooks itself up to track everything without any
       | fuss.
        
         | etimberg wrote:
         | Until you run your server behind something like gunicorn and
         | all of the auto imports stop working and you have to do it all
         | yourself.
        
           | BugsJustFindMe wrote:
           | It works with uwsgi just fine though.
        
           | jdsleppy wrote:
           | I found this to work fine https://opentelemetry-
           | python.readthedocs.io/en/latest/exampl...
        
         | verall wrote:
         | So recently I needed to this up for a very simple flask app.
         | We're running otel-collector-contrib, jaeger-all-in-one, and
         | prometheus on a single server with docker compose (has to be
         | all within the corpo intranet for reasons..)
         | 
         | Traces work, and I have the spanmetrics exporter set up, and I
         | can actually see the spanmetrics in prometheus if I query
         | directly, but they won't show up in the jaeger "monitor" tab,
         | no matter what I do.
         | 
         | I spent 3 days on this before my boss is like "why don't we
         | just manually instrument and send everything to the SQL server
         | and create a grafana dashboard from that" and agh I don't want
         | to do that either.
         | 
         | Any advice? It's literally the simplest usecase but I can't get
         | it to work. Should I just add grafana to the pile?
        
           | BugsJustFindMe wrote:
           | Yeah the biggest trouble really is on the dashboarding side
           | of things, not the sending side, and is why there are popular
           | SaaS products like datadog. If you're amenable to saas,
           | datadog is probably the best way. Otherwise, look into SigNoz
           | for a one-stop solution with minimal effort even if there are
           | some rough edges still.
        
             | verall wrote:
             | We absolutely have to run it ourselves (...corporate
             | reasons...), it's a lightweight service with only a few
             | hundred users so we haven't had to worry much about perf
             | (yet).
             | 
             | SigNoz does look interesting, I may give this a shot, thank
             | you. I'm a bit concerned about it conflicting with other
             | things going on in our docker-compose but it doesn't look
             | too bad..
        
         | baby_souffle wrote:
         | > I found it to be very easy in Python with standard stacks
         | (mysql, flask, redis, requests, etc), because you literally
         | just do a few imports at the top of your service and it
         | automatically hooks itself up to track everything without any
         | fuss.
         | 
         | Yes, but only if everything in your stack is supported by their
         | auto instrumentation. Take `aiohttp` for example. The latest
         | version is 3.11.X and ... their auto instrumentation claims to
         | support `3.X` [0] but results vary depending on how new your
         | `aiohttp` is versus the auto instrumentation.
         | 
         | It's _magical_ when it all just works, but that ends up being a
         | pretty narrow needle to thread!
         | 
         | [0]: https://github.com/open-telemetry/opentelemetry-python-
         | contr...
        
           | BugsJustFindMe wrote:
           | > _their auto instrumentation claims to support `3.X`_
           | 
           | Semver should never be treated as anything more than some
           | tired programmer's shrug and prayer that nobody else notices
           | the breakages they didn't notice themselves. Pin strict
           | dependencies instead of loose ones, and upgrade only after
           | integration testing.
           | 
           | There are only two kinds of updates, ones that intend to
           | break something and ones that don't intend to break
           | something, and neither one guarantees that the intent matches
           | the outcome.
        
       | etimberg wrote:
       | What otel really needs to succeed, at least in the python space,
       | is something as easy and straightforward as DataDog's ddtrace
       | command.
        
       | rtuin wrote:
       | Otel seems complicated because different observability vendors
       | make implementing observability super easy with their proprietary
       | SDK's, agents and API's. This is what Otel wants to solve and I
       | think the people behind it are doing a great job. Also kudos to
       | grafana for adopting OpenTelemetry as a first class citizen of
       | their ecosystem.
       | 
       | I've been pushing the use of Datadog for years but their pricing
       | is out of control for anyone between mid size company and large
       | enterprises. So as years passed and OpenTelemetry API's and SDK's
       | stabilized it became our standard for application observability.
       | 
       | To be honest the documentation could be better overall and the
       | onboarding docs differ per programming language, which is not
       | ideal.
       | 
       | My current team is on a NodeJS/Typescript stack and we've created
       | a set of packages and an example Grafana stack to get started
       | with OpenTelemetry real quick. Maybe it's useful to anyone here:
       | https://github.com/zonneplan/open-telemetry-js
        
         | saurik wrote:
         | > Otel seems complicated because different observability
         | vendors make implementing observability super easy with their
         | proprietary SDK's, agents and API's. This is what Otel wants to
         | solve and I think the people behind it are doing a great job.
         | 
         | Wait... so, the _problem_ is that everyone makes it super easy,
         | and so this product solves that by being complicated?
        
         | to11mtm wrote:
         | > I've been pushing the use of Datadog for years but their
         | pricing is out of control for anyone between mid size company
         | and large enterprises
         | 
         | Not a fan of datadog vs just good metric collection. OTOH while
         | I see the value of OTEL vs what I prefer to do... in theory.
         | 
         | My biggest problem with all of the APM vendors, once you have
         | kernel hooks via your magical agent all sorts of fun things
         | come up that developers can't explain.
         | 
         | My favorite example: At another shop we eventually adopted
         | Dynatrace. _Thankfully_ our app already had enough built-in
         | metrics that a lead SRE considered it a  'model' for how to do
         | instrumentation... I say that because, as soon as Dynatrace
         | agents got installed on the app hosts, we started having
         | various 'heisenbugs' requiring node restarts as well as a
         | directly measured drop in performance. [0]
         | 
         | Ironically, the metrics saved us from grief, yet nobody had an
         | idea how to fix it. ;_;
         | 
         | [0] - Curiously, the 'worst' one was MSSQL failovers on update
         | somehow polluting our ADO.NET connection pools in a bad way...
        
       | 6r17 wrote:
       | I have implemented OTEL over numerous projects to retrieve
       | traces. It's just a total pain and I'd 500% skip it for anything
       | else.
        
       | gpi wrote:
       | OpenTelemessy
        
       | Groxx wrote:
       | Yeah... this is about how well every OTel migration goes, from
       | what I've seen.
       | 
       | Docs are an absolute monstrosity that rival Bazel's for utility,
       | but are far less complete. Implementations are _extremely_ widely
       | varied in support for basics. Getting X to work with OTel often
       | requires exactly what they did here: reverse-engineering X to
       | figure out where it does something slightly abnormal... which is
       | normal, almost _every_ library does something similar, because it
       | 's so hard to push custom data through these systems in a type-
       | safe way, and many decent systems want type safety and will spend
       | a lot of effort to get it.
       | 
       | It feels kinda like OAuth 2 tbh. Lots of promise, obvious
       | desirable goals, but completely failing at everything involving
       | consistent and standardized implementation.
        
       | pnathan wrote:
       | I spent altogether too much time trying to get the Rust otel libs
       | working in a useful and concise way. After a few hours I junked
       | it and went back to a direct use of a jaeger client sending off
       | to the otel collector.
       | 
       | there's some gold here, but most of it is over in the
       | consultant/vendor space today, I fear.
        
       | jensensbutton wrote:
       | It's getting close to k8s in terms of activity so at least there
       | are a lot of people working on it.
        
       | junto wrote:
       | One of my biggest problems was the local development story. I
       | wanted logs, traces and metrics support locally but didn't want
       | to spin up a multitude of Docker images just to get that to work.
       | I wanted logs to be able to check what my metrics, traces,
       | baggage and activity spans look like before I deploy.
       | 
       | Recently, the .NET team launched .NET Aspire and it's awesome.
       | Super easy to visualize everything in one place in my local
       | development stack and it acts as an orchestrator as code.
       | 
       | Then when we deploy to k8s we just point the OTEL endpoint at the
       | DataDog Agent and everything just works.
       | 
       | We just avoid the DataDog custom trace libraries and SDK and
       | stick with OTEL.
       | 
       | Now it's a really nice development experience.
       | 
       | https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals...
       | 
       | https://docs.datadoghq.com/opentelemetry/#overview
        
       | ejs wrote:
       | Glad I'm not the only one that feels this way. For a small
       | application when you just want some metrics and observability,
       | it's a big burden to get it all working.
       | 
       | On my own projects, I send the metrics I care about out through
       | the logs and have another project I run collect and aggregate
       | them from the logs. Probably "wrong" but it works and it's easy
       | to set up.
        
       | dboreham wrote:
       | Author is trying to do something difficult with a non-batteries-
       | included open source (free to them) product. Seems quite
       | uncomplicated given the circumstances. The whole point of OTel is
       | to not get bent over backwards by one of the SaaS
       | "logging/tracing/telemetry" companies, and as such it's going to
       | incur some cost/pain of its own, but typically the bargain is
       | worth taking.
        
       | shireboy wrote:
       | I'm literally porting some code to Otel now and here is what I
       | landed on, even before this article: It is confusing because it's
       | a topic that uses vague terminology that means different things
       | in different domains. For example, I'm looking at one OTel ui and
       | "Traces" are the individual http requests to a service. In
       | another UI, against the same data, "Traces" are the log messages
       | from code in the service, and "Requests" are the individual http
       | requests. To wire up in code, there's yet other terminology.
       | 
       | I haven't decided exactly what to blame for this. In some ways,
       | it's necessary to have vague, inconsistent terminology to cover
       | various use cases. And, to be fair some of the UIs predate OTel.
        
       ___________________________________________________________________
       (page generated 2025-01-10 23:00 UTC)