[HN Gopher] The problem with OpenTelemetry
       ___________________________________________________________________
        
       The problem with OpenTelemetry
        
       Author : robgering
       Score  : 134 points
       Date   : 2024-06-14 12:36 UTC (10 hours ago)
        
 (HTM) web link (cra.mr)
 (TXT) w3m dump (cra.mr)
        
       | NeutralForest wrote:
       | It resonates. As an intern I had to add OTEL to a Python project
       | and I had to spend a lot of time in the docs to understand the
       | concepts and implementation. Also, the Python impl has a lot of
       | global state that makes it hard to use properly imo.
        
         | zaphar wrote:
         | Tracing requires keeping mappings for tracing identifiers per
         | request. I don't know you do that without global state unless
         | you want the tracing identifiers to pollute your own internal
         | apis everywhere.
        
           | bigblind wrote:
           | Many frameworks have the idea of a context" for this, that
           | holds per-request state, following your reques through the
           | system. Functions that don't care about the context just pass
           | it on to whatever they call.
           | 
           | I think Go was smart to make this concept part of the
           | standard library, as it encouraged frameworks to adopt it as
           | well.
        
           | NeutralForest wrote:
           | I understand that but if you look at the Python
           | implementation (or at least as it was 1-2 years ago), you
           | have a lot of god objects that hack __new__ which leads to
           | hidden flows when you create new instances of tracers for
           | example. I'm not saying I have a better idea but when you put
           | that together with the docs and the (at the time) very bare
           | examples, it's just annoying.
        
         | chipdart wrote:
         | > As an intern I had to ${DO_SOME_PROJECT} and I had to spend a
         | lot of time in the docs to understand the concepts and
         | implementation
         | 
         | That sounds like every single run-of-the-mill internship.
        
           | NeutralForest wrote:
           | That's fair but I'll say that the time and number of concepts
           | you have to deal with before going into the code, per the
           | docs; is quite big and I think the critic in the article is
           | warranted.
        
       | BiteCode_dev wrote:
       | 100% agree.
       | 
       | Every time I tried to use OT I was reading the doc and whispering
       | "but, why? I only need...".
        
         | Karrot_Kream wrote:
         | Yeah I was going down this path for a side project I was
         | getting going and spent a couple days of after-work time
         | exploring how to get just some basic traces in OT and realized
         | it was much more than I needed or cared about.
        
       | antonyt wrote:
       | OTel is flawed for sure, but I don't understand the stance
       | against metrics and logs. Traces are inherently sampled unless
       | you're lighting all your money on fire, or operating at so small
       | a scale that these decisions have no real impact. There are kinds
       | of metrics and logs which you always want to emit because they're
       | mission-critical in some way. Is this a Sentry-specific thing?
       | Does it just collapse these three kinds of information into a
       | single thing called a "trace"?
        
         | aleph_minus_one wrote:
         | > OTel is flawed for sure, but I don't understand the stance
         | against metrics and logs.
         | 
         | Even if you don't want to consider the privacy concerns:
         | _telemetry wastes quite some data of your internet connection_.
        
           | spenczar5 wrote:
           | Client-side transport is pretty unusual with OTel. I think
           | almost everybody is sending things from the server side, so I
           | don't think your concern is usually relevant.
        
           | cogman10 wrote:
           | Hey, this isn't the sort of telemetry we are talking about
           | with OTel.
           | 
           | About the only "privacy concern" with otel is that you are
           | probably shipping traces/metrics to a cloud provider for your
           | internal applications. This isn't the sort of telemetry
           | getting baked into ms or google that is used to try and
           | identify personal aspects of individuals, this is data that
           | tells you "Foo app is taking 300ms serving /bar which is
           | unusual".
        
             | cdelsolar wrote:
             | After I added OTel to an open source project I run, I spent
             | a bit of time arguing with someone about telemetry - they
             | kept saying they didn't opt in and that we need to inform
             | our users about it, etc., and I kept saying no, that's not
             | the same type of telemetry. I wonder how common this
             | misconception is.
        
               | cogman10 wrote:
               | This is the second time I've seen this misconception come
               | up in HN and I've definitely seen it in Reddit at least
               | once.
        
               | remram wrote:
               | OpenTracing was a much clearer name, especially for those
               | of us who really don't care about doing logging or
               | metrics through OTel.
        
           | wdb wrote:
           | I think you are more talking about RUM which isn't yet
           | supported by OpenTelemetry. I think they are working on it.
           | 
           | I am not sure if it will support session replays like some
           | vendors like Sentry or New Relic offer. Technically, I think
           | session replays (rrweb etc) is pretty cool but as a web
           | visitor I am not a fan.
        
         | Dextro wrote:
         | I mean, when you're the one selling the gas to light that money
         | on fire you have a vested interest in keeping it that way
         | right?
         | 
         | I do agree that logging and spans are very similar, but I
         | disagree that logs are just spans because they aren't exactly
         | the same.
         | 
         | I also agree that you can collect all metrics from spans and,
         | in fact, it might be a better way to tackle it. But it's just
         | not feasible to do so monetarily so you do need to have some
         | sort of collection step closer to the metric producers.
         | 
         | What I do agree with is that the terminology and the
         | implementation of OTEL's SDK is incredibly confusing and hard
         | to implement/keep up to date. I spent way too many hours of my
         | career struggling with conflicting versions of OTEL so I know
         | the pain and I desperately wish they would at least take to
         | heart the idea of separating implementation from API.
        
           | zeeg wrote:
           | Food for thought- the subjective nature of both of those is
           | exactly why it shouldn't be bundled.
        
         | the_mitsuhiko wrote:
         | > Traces are inherently sampled unless you're lighting all your
         | money on fire
         | 
         | You can burn a lot of money with logs and metrics too. The
         | question is how much value you get for the money you throw on
         | the burning pile of monitoring. My personal belief is that well
         | instrumented distributed tracing is more actionable than logs
         | and metrics. Even if sampled.
         | 
         | (Disclaimer: I work at sentry)
        
           | Jemaclus wrote:
           | I actually take the opposite approach. In my experience, well
           | instrumented metrics and finely tuned logs are more
           | actionable than distributed traces! Interesting how that
           | works out.
        
             | the_mitsuhiko wrote:
             | I believe on the infrastructure side that might be correct.
             | Within applications that doesn't match my experience. In
             | many cases the concurrent nature of servers makes it
             | impossible to repro issues and narrow down the problem
             | without tracing or trace aware logs.
        
               | aserafini wrote:
               | With only sampled traces though it's very hard to
               | understand the impact of the problem. There are some bad
               | traces but is it affecting 5%, 10% or 90% of your
               | customers. Metrics shine there.
        
               | remram wrote:
               | Whether it is affecting 5% or 10% of your customers, if
               | it is erroring at that rate you are going to want to find
               | the root cause ASAP. Traces let you do that, whereas the
               | precise number does nothing. I am a big supporter of
               | metrics but I don't see this as the use case at all.
        
       | dboreham wrote:
       | I've used Otel quite a bit (in JVM systems) and honestly didn't
       | know it did more than tracing.
       | 
       | That said, I think this rot comes from the commercial side of the
       | sector -- if you're a successful startup with one product (e.g.
       | graphing counters), then your investors are going to start
       | beating you up about why don't you expand into other adjacent
       | product areas (e.g. tracing). Repeat previous sentence reversed.
       | And so you get Grafana, New Relic, et al). OpenTelemetry is just
       | mirroring that arrangement.
        
       | cogman10 wrote:
       | Perhaps the real problem with OTel (IMO) is it's trying to be
       | everything for everyone and every language. It's trying to have a
       | common interface so that you can write OTel in Java or
       | Javascript, python or rust, and you basically have the exact same
       | API.
       | 
       | I suspect OP is seeing this directly when talking about the
       | cludgyness of the Javascript API.
        
       | syngrog66 wrote:
       | Up my alley. I'm the author of a FOSS Golang span instrumentation
       | library for latency (LatLearn in my GitHub.) And part of the team
       | that back in 2006/2007 made an in-house distributed tracing
       | solution for Orbitz.
        
       | doctorpangloss wrote:
       | I don't know what the Sentry guy is really saying - I mean you
       | can write whatever code you want, go for it man.
       | 
       | But I do have to "pip uninstall sentry-sdk" in my Dockerfile
       | because it clashes with something I didn't author. And anyway,
       | because it is completely open source, the flaws in OpenTelemetry
       | for my particular use case took an hour to surmount, and vitally,
       | I didn't have to pay the brain damage cost most developers hate:
       | relationships with yet another vendor.
       | 
       | That said I appreciate all the innovation in this space, from
       | both Sentry and OpenTelemetry. The metrics will become the
       | standard, and that's great.
       | 
       | The problem with Not OpenTelemetry: eventually everyone is going
       | to learn how to use Kubernetes, and the USP of many startup
       | offerings will vanish. OpenTelemetry and its feature scope creep
       | make perfect sense for people who know Kubernetes. Then it makes
       | sense why you have a wire protocol, why abstraction for vendors
       | is redundant or meaningless toil, and why PostHog and others stop
       | supporting Kubernetes: it competes with their paid offering.
        
         | MapleWalnut wrote:
         | The Sentry SDK is open source and easy to contribute to in my
         | experience.
        
           | hahn-kev wrote:
           | Yeah but who wants to contribute to an SDK for a service that
           | you need to pay for? That would be like if Oracle DB was open
           | to contribution
        
             | MapleWalnut wrote:
             | Sentry provides a great hosted service. You can self host
             | if you like, but it's nicer to let them do it
        
             | tnolet wrote:
             | Sentry is self hostable. https://develop.sentry.dev/self-
             | hosted/
        
               | brunoqc wrote:
               | But not foss. It's using the BSL or FSL or whatever.
        
               | riedel wrote:
               | Although I do not like those licences, I would not care
               | so much about 2yrs until it goes FOSS. Before all this
               | rush development RRDTool and OpenTSDB was so slow, this
               | whole thing seems rather ideological than substantial
               | criticism. Now going down the licence rabbit hole based
               | to criticise the original argument seems like a classical
               | strawman.
        
               | brunoqc wrote:
               | I was supporting a variation in my head of the "Yeah but
               | who wants to contribute to an SDK for a service that you
               | need to pay for?" claim.
               | 
               | You can self-host for free, so maybe @hahn-kev don't mind
               | contributing to the SDK now.
               | 
               | For me, I refuse to contribute to an open-source SDK for
               | a non-foss product. And I refuse to self-host a non-foss
               | product.
               | 
               | Personally, I don't care if non-foss licenses speeds
               | development. So yeah in my case it's ideological.
        
               | ensignavenger wrote:
               | https://glitchtip.com/ is an Open Source form of Sentry
               | created after they went closed source, if you are
               | interested in something like that.
        
               | zeeg wrote:
               | Just want to say I appreciate your stance.
               | 
               | (also no one should feel like they have to contribute to
               | our SDKs, but please file a ticket if somethings fucked
               | up and we'll deal w/ it)
        
               | nimih wrote:
               | Sentry is _technically_ self-hostable, but they provide
               | no deployment guidance beyond running the giant blob of
               | services /microservices (including instances of postgres,
               | redis, memcache, clickhouse, and kafka) as a single
               | docker-compose thing. I get why they do this and think
               | it's totally reasonable of them, but Sentry is a very
               | complicated piece of software and takes substantially
               | more work IME to both get up and running and maintain
               | compared to other open-source self-hosted
               | observability/monitoring/telemetry software I've had the
               | pleasure of working with.
        
               | miohtama wrote:
               | Our Linux devops engineer, who had not used Sentry
               | before, set up a self-hosted Sentry in a day.
        
               | tapoxi wrote:
               | Yeah, it works for a time, but they don't support on-
               | premise versions and they don't offer a Helm chart
               | install, its all community based.
               | 
               | I tried it for well over a year, and there are so many
               | moving parts and so many "best guesses" from the
               | community that we had to rip it out. There's a lot of
               | components, sentry, sentry-relay, snuba, celery, redis,
               | clickhouse, zookeeper (for clickhouse), kafka, zookeeper
               | (for kafka), maybe even elasticsearch for good measure.
               | It did work for a time, but there are so many moving
               | parts that required care and feeding it would inevitably
               | break down at some point.
               | 
               | Problem is I can't ship data to their SaaS version
               | because we have PHI and our contracts forbid it, even if
               | scrubbed, so I had to settle on OTEL.
        
               | tnolet wrote:
               | Day 1 vs day 2. That's why the SaaS version exists.
        
             | Spivak wrote:
             | I've been using GlitchTip https://glitchtip.com with the
             | Sentry SDKs and I couldn't be happier. Completely self-
             | hosted, literally just the container and a db, requires
             | zero attention.
        
         | zitterbewegung wrote:
         | Why have I heard only bad things on k8s? To the point where
         | it's a meme to understand k8s...
        
         | marcosdumay wrote:
         | > eventually everyone is going to learn how to use Kubernetes
         | 
         | That seems obviously true... yet, there are so many people out
         | there that seem unable to learn it that I don't think it's a
         | reliable prediction.
        
           | politelemon wrote:
           | > unable
           | 
           | I wouldn't equate unwillingness or not needing it to
           | inability to learn
        
       | drewbug01 wrote:
       | As a contributor to (and consumer of) OpenTelemetry, I think
       | critique and feedback is most welcome - and sorely needed.
       | 
       | But this ain't it. In the opening paragraphs the author dismisses
       | the hardest parts of the problem (presumably because they are
       | _human_ problems, which engineers tend to ignore), and betrays a
       | complete lack of interest in understanding why things ended up
       | this way. It also seems they've completely misunderstood the API
       | /SDK split in its entirety - because they argue for having such a
       | split. It's there - that's exactly what exists!
       | 
       | And it goes on and on. I think it's fair to critique
       | OpenTelemetry; it can be really confusing. The blog post is
       | evidence of that, certainly. But really it just reads like
       | someone who got frustrated that they didn't understand how
       | something worked - and so instead of figuring it out, they've
       | decided that it's just hot garbage. I wish I could say this was
       | unusual amongst engineers, but it isn't.
        
         | arccy wrote:
         | indeed, it just sounds like they're complaining they don't have
         | a seat at the table...
        
         | klabb3 wrote:
         | No dog in the fight here, but... you're saying that one of the
         | top guys at a major observability shop didn't understand Open
         | Telemetry, then that's saying much more about OT than it does
         | about his skills or efforts to understand. After all, his main
         | point is that it's complex and overengineered, which is the key
         | takeaway for curious bystanders like me, whether every detail
         | is technically correct or not.
         | 
         | > it just reads like someone who [...] didn't understand how
         | something worked - and so instead of figuring it out, they've
         | decided that it's just hot garbage.
         | 
         | And what about average developers asked to "add telemetry" to
         | their apps and libraries? Their patience will be much lower
         | than that.
         | 
         | Not necessarily defending the content (frankly it should have
         | had more examples), but I relate to the sentiment. As a
         | developer, I _need_ framework providers to make sane design
         | decisions with minimal api surface, otherwise I'd rather build
         | something bespoke or just not care.
        
           | cdelsolar wrote:
           | OTel is very easy to add.. I've added it to several Go
           | projects. For some frameworks like .NET you can do it
           | automatically. The harder/more annoying part is setting up a
           | viewer/collector like Jaeger. I've done that too but just in
           | memory and it fills up quick.
        
             | bbkane wrote:
             | For my small scale projects, Openobserve.ai has been super
             | helpful. It ships as a single binary and (in non h/a setup)
             | saves traces/logs/metrics to disk. I just set it up as a
             | systems service and start sending telemetry via localhost.
             | Code at https://github.com/bbkane/shovel_ansible/
        
         | zeeg wrote:
         | Author here.
         | 
         | That's kind of making my point for me fwiw. It's too
         | complicated. I consider myself a product person so this is my
         | version of that lens on the problem.
         | 
         | I'm not dismissing the people problem at all - I actually am
         | trying to suggest the technology problem is the easier part (eg
         | a basic spec). Getting it implemented, making it easy to
         | understand, etc is where I see it struggling right now.
         | 
         | Aside this is not just my feedback, it's a synthesis of what
         | I'm hearing (but also what I believe).
        
       | shaqbert wrote:
       | Otel is indeed quite complex. And the docs are not meant for
       | quick wins...
       | 
       | Otelbin [0] has helped me quite a bit in configuring and making
       | sense of it, and getting stuff done.
       | 
       | [0]: https://www.otelbin.io/
        
         | wdb wrote:
         | That looks pretty cool! OpenTelemetry Collector configuration
         | files are pretty confusing. Do like the collector, though.
         | Makes it easy to sent a subset of your telemetry to trusted
         | partners.
        
       | wdb wrote:
       | Personally, I like OpenTelemetry, nice standardised approach. I
       | just wished the vendors would have better support for the
       | semantic conventions defined for a wide variety of traces.
       | 
       | I quite like the idea of only need to change one small piece of
       | the code to switch otel exporters instead of swapping out a
       | vendor trace sdk.
       | 
       | My main gripe with OpenTelemetry I don't fully understand what
       | the exact difference is between (trace) events and log records.
        
         | tnolet wrote:
         | Can you give an example of the missing semantic conventions?
        
         | yunwal wrote:
         | > My main gripe with OpenTelemetry I don't fully understand
         | what the exact difference is between (trace) events and log
         | records.
         | 
         | This is my main gripe too. I don't understand why {traces,
         | logs, metrics} are not just different abstractions built on top
         | of "events" (blobs of data your application ships off to some
         | set of central locations). I don't understand why the
         | opentelemetry collector forces me to re-implement the same
         | settings for all of them and import separate libraries that all
         | seem to do the same thing by default. Besides sdks and
         | processors, I don't understand the need for these abstractions
         | to persist throughout the pipeline. I'm running one collector,
         | so why do I need to specify where my collector endpoint is 3
         | different times? Why do I need to specify that I want my blobs
         | batched 3 different times? What's the point of having
         | opentelemetry be one project at all?
         | 
         | My guess is this is just because opentelemetry started as a
         | tracing project, and then became a logs and metrics project
         | later. If it had started as a logging project, things would
         | probably make more sense.
        
           | serverlessmom wrote:
           | Something I mention any time I'm introducing OpenTelemetry is
           | that it's an unfinished project, a huge piece being the
           | unifying abstractions between those signals.
           | 
           | In part this is a very practical decision: most people
           | already have pretty good tools for their logs, and have
           | struggled to get tracing working. So it's better to work on
           | tools for measuring and sending traces, and just let people
           | export their current log stream via the OpenTelemetry
           | collector.
           | 
           | Notably the OTel docs acknowledge this mismatch between
           | current implementation and design goals: https://opentelemetr
           | y.io/docs/specs/otel/logs/#limitations-o...
        
           | chipdart wrote:
           | > This is my main gripe too. I don't understand why {traces,
           | logs, metrics} are not just different abstractions built on
           | top of "events" (blobs of data your application ships off to
           | some set of central locations).
           | 
           | By design, they cannot be abstractions of the single concept.
           | For example, logs have a hard requirement on preserving
           | sequential order and session and emitting strings, whereas
           | metrics are aggregated and sampled and dropped arbitrarily
           | and consist of single discrete values. Logs can store open-
           | ended data, and thus need to comply with tighter data
           | protection regulations. Traces often track a very specific
           | set of generic events, whereas there are whole classes of
           | metrics that serve entirely different purposes.
           | 
           | Just because you can squint hard enough to only see events
           | being emitted, that does not mean all event types can or
           | should be treated the same.
        
           | arccy wrote:
           | If you're using OTLP, SDKs only require you specify the
           | endpoint once, the signal specific settings are for if you
           | want to send them to different places.
           | 
           | The way you process/modify metrics vs logs vs traces are
           | usually sufficiently different that there's not much point in
           | having a unified event model if you're going to need a bunch
           | of conditions to separate and process them differently. Of
           | course, you can still use only one source (logs or events)
           | and derive the other 2 from that, though that rarely scales
           | well.
           | 
           | Plus, the backends that you can use to store/visualize the
           | data usually are optimized for specific signals anyways.
        
       | prymitive wrote:
       | I only learned about OT after Prometheus announced some deeper
       | integration with it. Reading OT docs about metrics feels like
       | every little problem has a dedicated solution in the OT world,
       | even if a more generalised one already covers it. Which is quite
       | striking coming from the Prometheus world.
        
       | tnolet wrote:
       | A recent example of OTel confusion.
       | 
       | I could for the life of me not get the Python integration send
       | traces to a collector. Same URL, same setup same API key as for
       | Nodejs and Go.
       | 
       | Turns out the Python SDK expect a URL encoded header, e.g.
       | "Bearer%20somekey" whereas all other SDKs just accept a string
       | with a whitespace.
       | 
       | The whole split between HTTP, protobuf over HTTP and GRPC is also
       | massively confusing.
        
         | hahn-kev wrote:
         | Sounds like a problem with the Python sdk
        
           | tnolet wrote:
           | Well actually. They (python SDK maintainers) argue their
           | implementation is the correct one according to the spec. See
           | this issue thread for example.
           | 
           | https://github.com/open-telemetry/opentelemetry-
           | specificatio...
           | 
           | There are more. This is a symptom of a how hard it is to dive
           | into Otel due to its surface area being so big.
        
             | chipdart wrote:
             | > Well actually. They (python SDK maintainers) argue their
             | implementation is the correct one according to the spec.
             | See this issue thread for example.
             | 
             | The comment section of that issue gives out contrarian
             | vibes. Apparently the problem is that the Python SDK
             | maintainers refuse to support a use case that virtually all
             | other SDKs support. There are some weasel words that try to
             | convey the idea that half the SDKs are with Python while in
             | reality the ones that support the choices followed by the
             | Python SDK actually support all scenarios.
             | 
             | From the looks of it, the Python SDK maintainers are
             | purposely making a mountain out of a molehill that could be
             | levelled with a single commit with a single line of code.
        
               | tnolet wrote:
               | I guess you word it better than I did.
               | 
               | As a user it feels very weird to wade into threads like
               | this to find a solution to your problem.
               | 
               | The power of Otel is it being an open standard. But the
               | practice shows the implementation of that standard / spec
               | leads to all kinds of issues and fiefdoms
        
         | hinkley wrote:
         | The silent failure policy of OTEL makes flames shoot out of the
         | top of my head.
         | 
         | We had to use wireshark to identify a super nasty bug in the
         | "JavaScript" (but actually typescript despite being called
         | opentelemetryjs) implementation.
         | 
         | And OTEL is largely unsuitable for short lived processes like
         | CLIs, CI/CD. And I would wager the same holds for FaaS
         | (Lambda).
         | 
         | In the end I prefer the network topology of StatsD, which is
         | what we were migrating from. Let the collector do ALL of the
         | bookkeeping instead of faffing about. OTEL is _actively_
         | hostile to process-per-thread programming languages. If I had
         | it to do over again I'd look at the StatsD- >Prometheus
         | integrations, and the StatsD extensions that support tagging.
        
           | tnolet wrote:
           | Yeah. And Otel has actually pretty nice debugging. You just
           | need to set the right environment variable. But on prod it
           | will blow up your logs
        
       | hobofan wrote:
       | This seems to be more of a branding problem than anything.
       | 
       | OP (rightfully) complains that there is a mismatch between what
       | they (can) advertise ("We support OTEL") and what they are
       | actually providing to the user. I have the same pain point from
       | the consumer side, where I have to trial multiple tools and
       | service to figure out which of them actually supports the OTEL
       | feature set I care about.
       | 
       | I feel like this could be solved by introducing better branding
       | that has a clearly defined scope of features inside the project
       | (like e.g. "OTEL Tracing") which can serve as a direct signifier
       | to customers about what feature set can be expected.
        
         | zeeg wrote:
         | Yes! Its a bit deeper than that but its fundamentally a
         | packaging issue.
        
       | no_circuit wrote:
       | IMO this boils down how one gets paid to understand or
       | misunderstand something. A telemetry provider/founder is being
       | commoditized by an open specification in which they do not
       | participate in its development -- implied by the post saying the
       | author doesn't know anyone on the spec committee(s). No surprise
       | here.
       | 
       | Of course implementing a spec from the provider point of view can
       | be difficult. And also take a look at all the names of the OTEL
       | community and notice that Sentry is not there:
       | https://github.com/open-telemetry/community/blob/86941073816....
       | This really isn't news. I'd guess that a Sentry customer should
       | just be able to use the OTEL API and could just configure a
       | proprietary Sentry exporter, for all their compute nodes, if
       | Sentry has some superior way of collecting and managing
       | telemetry.
       | 
       | IMO most library authors do not have to worry about annotation
       | naming or anything like that mentioned in the post. Just use the
       | OTEL API for logs, or use a logging API where there is an OTEL
       | exporter, and whomever is integrating your code will take care of
       | annotating spans. Propagating span IDs is the job of "RPC"
       | libraries, not general code authors. Your URL fetch library
       | should know how to propagate the Span ID provided that it also
       | uses the OTEL API.
       | 
       | It is the same as using something like Docker containers on a
       | serverless platform. You really don't need to know that your code
       | is actually being deployed in Kubernetes. Use the common Docker
       | interface is what matters.
        
         | chipdart wrote:
         | > IMO this boils down how one gets paid to understand or
         | misunderstand something.
         | 
         | I completely agree. The most charitable interpretation of this
         | blog post is that the blogger genuinely fails go understand the
         | basics of the problem domain, or worst case scenario they are
         | trying to shitpost away the need for features that are well
         | supported by a community-driven standard like OpenTelemetry.
        
         | serverlessmom wrote:
         | I think that a number of Observability providers are looking at
         | how they can add features and value to parts of monitoring that
         | OTel effectively commoditizes. Thinking of the tail-based
         | sampling implemented at Honeycomb for APM, or synthetic
         | monitoring by my own team at Checkly.
         | 
         | "In 2015 Armin and I built a spec for Distributed Tracing. Its
         | not a hard problem, it just requires an immense amount of
         | coordination and effort." This to me feels like a nice glass of
         | orange juice after brushing my teeth. The spec on DT is very
         | easy, but the implementation is very very hard. The fact that
         | OTel has nurtured a vast array of libraries to aid in context
         | propagation is a huge acheivement, and saying 'This would all
         | work fine if everyone everywhere adopted Sentry' is...
         | laughable.
         | 
         | Totally outside the O11y space, OTel context propagation is an
         | intensely useful feature because of how widespread it is. See
         | Signadot implementing their smart test routing with
         | OpenTelemetry: https://www.signadot.com/blog/scaling-
         | environments-with-open...
        
         | zeeg wrote:
         | Author here.
         | 
         | Y'all realize we'd just make more money if everyone has better
         | instrumentation and we could spend less time on it, and more
         | time on the product, right?
         | 
         | There is no conspiracy. It's simple math and reasoning. We
         | don't compete with most otel consumers.
         | 
         | I don't know how you could read what I posted and think sentry
         | believes otel is a threat, let alone from the fact that we just
         | migrated our JS SDK to run off it.
        
       | epgui wrote:
       | Anyone else finding this very difficult to read? I'd really
       | recommend feeding this through a grammar checker, because poor
       | grammar betrays unclear thinking.
        
         | zeeg wrote:
         | So you're saying it makes my thinking more clear? :)
         | 
         | This is what happens when you use a tool designed for authoring
         | code to also author content.
        
           | kaashif wrote:
           | "betrays" means to expose, to be evidence of, particularly
           | unintentionally.
           | 
           | i.e. "poor grammar unintentionally exposed unclear thinking"
        
       | markl42 wrote:
       | At the risk of hijacking the comments, I've been trying to use
       | OTel recently to debug performance of a complex webpage with lots
       | of async sibling spans, and finding it very very difficult to
       | identify the critical path / bottlenecks.
       | 
       | There's no causal relationships between sibling spans. I think in
       | theory "span links" solves this, but afaict this is not a widely
       | used feature in SDKs are UI viewers.
       | 
       | (I wrote about this here https://github.com/open-
       | telemetry/opentelemetry-specificatio...)
        
         | diurnalist wrote:
         | I don't believe this is a solved problem, and it's been around
         | since OpenTracing days[0]. I do not think that the Span links,
         | as they are currently defined, would be the best place to do
         | this, but maybe Span links are extended to support this in the
         | future. Right now Span links are mostly used to correlate spans
         | causally _across different traces_ whereas as you point out
         | there are cases where you want correlation _within a trace_.
         | 
         | [0]: https://github.com/opentracing/specification/issues/142
        
         | hinkley wrote:
         | I was underwhelmed by the max size for spans before they get
         | rejected. Our app was about an order of magnitude too complex
         | for OTEL to handle.
         | 
         | Reworking our code to support spans made our stack traces
         | harder to read and in the end we turned the whole thing off
         | anyway. Worse than doing nothing.
        
       | noname120 wrote:
       | tl;dr OpenTelemetry eats Sentry's cake by commoditizing what they
       | do and the reaction of the founder of Sentry is to be very upset
       | about it rather than innovating.
        
       | wvh wrote:
       | I have surveyed this landscape for a number of years, though I'm
       | not involved enough to have strong opinions. We're running a lot
       | of Prometheus ecosystem and even some OpenTelemetry stacks across
       | customers. OpenTelemetry does seem like one of these projects
       | with an ever expanding scope. It makes it hard to integrate parts
       | you like and keep things both computing-wise and mentally
       | lightweight without having to go all-in.
       | 
       | It's not anymore about hey, we'll include this little library or
       | protocol instead of rolling our own, so we can hope to be
       | compatible with a bunch of other industry-standard software. It's
       | a large stack with an ever evolving spec. You have to develop
       | your applications and infrastructure around it. It's very
       | seductive to roll your own simpler solution.
       | 
       | I appreciate it's not easy to build industry-wide consensus
       | across vendors, platforms and programming languages. But be
       | careful with projects that fail to capture developer mindshare.
        
       | EdSchouten wrote:
       | > Its not a hard problem, [...]. At its core its structured
       | events that carry two GUIDs along with them: a trace ID and a
       | parent event ID. It is just building a tree.
       | 
       | I've always wondered, what's the point of the trace ID? What even
       | is a trace?
       | 
       | - It could be a single database query that's invoked on a
       | distributed database, giving you information about everything
       | that went on inside the cluster processing that query.
       | 
       | - Or it could be all database calls made by a single page request
       | on a web server.
       | 
       | - Or it could be a collection of page requests made by a single
       | user as part of a shopping checkout process. Each page request
       | could make many outgoing database calls.
       | 
       | Which of these three you should choose merely depends on what you
       | want to visualize at a given point in time. My hope is that at
       | some point we get a standard for tracing that does away with the
       | notion of trace IDs. Just treat everything going on in the
       | universe as a graph of inter-connected events.
        
         | remram wrote:
         | I think they meant "an event ID and a parent event ID".
        
           | zeeg wrote:
           | I actually meant trace ID and parent event ID (and ID was
           | inferred). Parent comment is correct in that trace ID isnt
           | technically needed, and is in fact quite controversial. Its
           | an implementation level protocol optimization though, and
           | unfortunately not an objective one. It creates an arbitrary
           | grouping of these annotations - which is entirely subjective,
           | and the spec struggles to reconcile - but its primarily
           | because the technology to aggregate and/or query them would
           | be far more difficult if you didn't keep that simple GUID.
           | 
           | It does have one positive benefit beyond that. If you lose
           | data, or have disparate systems, its pretty easy to keep the
           | Trace ID intact and still have better instrumentation than
           | otherwise.
        
       | serverlessmom wrote:
       | An argument that OpenTelemetry is somehow 'too big' is an example
       | of motivated reasoning. I can understand that A Guy Who Makes
       | Money If You Use Sentry dislikes that people are using OTel
       | libraries to solve similar problems.
       | 
       | Context propagation and distributed tracing are cool OTel
       | features! But they are not the only thing OTel should be doing.
       | OpenTelemetry instrumentation libraries can do a lot on their
       | own, a friend of mine made massive savings in compute efficiency
       | with the NodeJS OTel library:
       | https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...
        
         | zeeg wrote:
         | Author here.
         | 
         | OpenTelemetry is not competitive to us (it doesn't do what we
         | do in plurality), and we specifically want to see the open
         | tracing goals succeed.
         | 
         | I was pretty clear about that in the post though.
        
           | serverlessmom wrote:
           | I think that it's disingenuous to say OpenTelemetry and
           | Sentry aren't in competition. I think it would be good news
           | for Sentry if DT were split from the project, and
           | instrumentation and performance monitoring weren't
           | commoditized by broad adoption of those parts of the
           | OpenTelemetry project.
           | 
           | I think you, the author, stand to benefit directly from a
           | breakup of OpenTelemetry, and a refusal to acknowledge your
           | own bias is problematic when your piece starts with a request
           | to 'look objectively.'
        
             | zeeg wrote:
             | We just rewrote our most heavily used SDK to run on top of
             | OTel. What do we gain from it failing?
             | 
             | We also make most of our revenue from errors which don't
             | have an open protocol implementation outside of our own.
        
       | codereflection wrote:
       | I understand what the author is saying, but vendor lock-in with
       | closed-source observability platforms is a significant challenge,
       | especially for large organizations. When you instrument hundreds
       | or thousands of applications with a specific tool, like the
       | Datadog Agent, disentangling from that tool becomes nearly
       | impossible without a massive investment of engineering time. In
       | the Platform Engineering professional services space, we see this
       | problem frequently. Enterprises are growing tired of big
       | observability platform lock-in, especially when it comes to
       | Datadog's opaque nature of your spend on their products, for
       | example.
       | 
       | One of the promises of OTEL is that it allows organizations to
       | replace vendor-specific agents with OTEL collectors, allowing the
       | flexibility of the end observability platform. When used with an
       | observability pipeline (such as EdgeDelta or Cribl), you can re-
       | process collected telemetry data and send it to another platform,
       | like Splunk, if needed. Consequently, switching from one
       | observability platform to another becomes a bit less of a
       | headache. Ironically, even Splunk recognizes this and has put
       | substantial support behind the OTEL standard.
       | 
       | OTEL is far from perfect, and maybe some of these goals are a bit
       | lofty, but I can say that many large organizations are adopting
       | OTEL for these reasons.
        
         | zeeg wrote:
         | I totally agree I just wish we could do it in a way that
         | doesn't try to lump every problem into the same bucket. I don't
         | see what it achieves personally, and I think it's limiting the
         | ability for the original goals of the project to be as
         | successful as they could be.
        
         | andrewmcwatters wrote:
         | Yeah, it's the primary reason we used it. If OpenTelemetry's
         | raison d'etre was simply to give Datadog a reason to not
         | bullshit their customers on pricing, it would fulfill a major
         | need in platform services.
        
       | zellyn wrote:
       | Are they basically just saying that the OpenTelemetry client APIs
       | should be split from the rest of the pieces of the project, and
       | versioned super conservatively?
       | 
       | The simple API they describe is basically there in OTel. The API
       | is larger, because it also does quite a few other things
       | (personally, I think (W3C) Baggage is important too), but as a
       | library author I should need only the client APIs to write to.
       | 
       | When implementing, you're free to plug in Providers that use
       | OpenAPI-provided plumbing, but you can equally well plug in
       | Providers from DataDog or Sentry or whatever.
       | 
       | Unless I'm missing something, any further complaints could be
       | solved by making sure the Client APIs (almost) never have
       | backward-incompatible changes, and are versioned separately.
        
         | zeeg wrote:
         | It's a bit deeper than that. The SDKs that library authors
         | implement need to be extemely minimal. The collection libraries
         | that vendors implement based on imo should also be minimal.
         | 
         | OTLP imo doesn't even need to be part of the spec.
         | 
         | But minimal would also mean focusing on solving fewer problems
         | as a whole. Eg OpenTracing plus OpenMetrics plus OpenLogs. I
         | only need one of those things.
        
           | arccy wrote:
           | that just sounds like a branding problem though...
           | 
           | OTLP has been quite useful especially in metrics to get a
           | format that doesn't really have any sacrifices/limitations
           | compared to all the other protocols.
        
             | zeeg wrote:
             | It is! But to prove your point, OTLP is actually just the
             | transport protocol (Open Telemetry Transport Protocol). Its
             | one of _so many things_ its trying to address. All of those
             | things might be probems, but not everyone has those same
             | problems (vendors, customers, and lib authors), and
             | bundling them all into one umbrella just screams for me.
             | 
             | I actually have no need for a standard metrics
             | implementation, just as an example. I never have, and I'd
             | argue Sentry (as a tech company) never has. We built our
             | own abstraction and/or used a library. That doesnt mean
             | others don't, and it doesnt mean it shouldnt be something
             | people solve, but bundling "all telemetry problems" into
             | one giant design committee is a fundamental misstep imo.
        
       | PeterZaitsev wrote:
       | OpenTelemetry is interesting, On one side it is designed as the
       | "commodity feeder" to number of proprietary backends as DataDog,
       | on other hand we see good development of Open Source solutions as
       | SigNoz and Coroot with good Otel support.
        
       | spullara wrote:
       | There is a huge whole in using spans as they are specified.
       | Without separating the start of a span from the end of a span you
       | can never see things that never complete, fail hard enough to not
       | close the span, or travel through queues. This is a compromise
       | they made because typical storage systems for tracing aren't
       | really good enough to stitch them all back together quickly.
       | Everyone should be sending events and stitching it all together
       | to create the view. But instead we get a least common denominator
       | solution.
        
       | fractalwrench wrote:
       | The main interest I've seen in OTel from Android engineers has
       | been driven by concerns around vendor lock-in. Backend/devops in
       | their organisations are typically using OTel tooling already &
       | want to see all telemetry in one place.
       | 
       | From this perspective it doesn't matter if the OTel SDK comes
       | bundled with a bunch of unnecessary code or version conflicts as
       | is suggested in the article. The whole point is to regain control
       | over telemetry & avoid paying $$$ to an ambivalent vendor.
       | 
       | FWIW, I don't think the OTel implementation for mobile is perfect
       | - a lot of the code was originally written with backend JVM apps
       | in mind & that can cause friction. However, I'm fairly optimistic
       | those pain points will get fixed as more folks converge on this
       | standard.
       | 
       | Disclaimer: I work at a Sentry competitor
        
       | AndreasBackx wrote:
       | I have been trying to find an equivalent for `tracing` first in
       | Python and this week in TypeScript/JavaScript. At my work I
       | created an internal post called "Better Python Logging? Tracing
       | for Python?" that basically asks this question. OpenTelemetry was
       | also what I looked at and since I have looked at other tooling.
       | 
       | It is hard to explain how convenient `tracing` is in Rust and why
       | I sorely miss it elsewhere. The simple part of adding context to
       | logs can be solved in a myriad of ways, yet all boil down to a
       | similar "span-like" approach. I'm very interested in helping
       | bring what `tracing` offers to other programming communities.
       | 
       | It very likely is worth having some people from the space
       | involved, possibly from the tracing crate itself.
        
         | zeeg wrote:
         | We'll fund solving this as long as the committees agree with
         | the goal. We just want standard tracing implementations.
         | 
         | (Speaking on behalf of Sentry)
        
       | crabbone wrote:
       | I've heard about OpenTelementry before, but I could never
       | understand what it's for.
       | 
       | Can anyone with more knowledge enlighten me? Why is Prometheus
       | not enough? From reading from OpenTelementry's Web site, I can
       | see no obvious benefits of using it (if I already use
       | Prometheus).
       | 
       | Is it trying to be somehow more generic than Prometheus'
       | instrumentation? Sort of like ORM might try to be more generic
       | than a particular database?
       | 
       | Also, being certified as "cloud native", in my experience, has
       | always being a sort of scam. So, when I see that, I tend to think
       | negatively about a project. Maybe that's a distraction though.
       | 
       | Also, in their documentation, they use "tracing" in some weird
       | way I cannot quite reconcile with the way I've learned to use
       | this word (eg. by using "strace" on Linux). They must mean
       | something else, or do they?
       | 
       | ----
       | 
       | OP reads as a parody of Donald Trump for some reason. It's not
       | the most pleasant style to read. Of course, I'm not an authority
       | on writing styles. Just mentioning this as this was quite a bit
       | of distraction.
        
         | remram wrote:
         | https://opentelemetry.io/docs/concepts/signals/traces/
        
       | ris wrote:
       | 1. The main reason I want to use otel is so I can have one
       | sidecar for my observability, not three, each with subtly
       | different quirks and expectations. (also the associated
       | collection/aggregation infrastructure)
       | 
       | 2. I honestly think the main reason otel appears so complex is
       | the existing resources that attempt to explain the various
       | concepts around it do a poor job and are very hand-wavey. You
       | know the main thing that made otel "click" for me? Reading the
       | protobuf specs. Literally nothing else explained succinctly the
       | relationships between the different types of structure and what
       | the possibilities with each were.
        
       | esafak wrote:
       | This caught my eye:
       | 
       | > Logs are just events - which is exactly what a span is, btw -
       | and metrics are just abstractions out of those event properties.
       | That is, you want to know the response time of an API endpoint?
       | You don't rewind 20 years and increment a counter, you instead
       | aggregate the duration of the relevant span segment. Somehow
       | though, Logs and Metrics are still front and center.
       | 
       | Is anyone replacing logs and metrics with traces?
        
         | zeeg wrote:
         | imo Honeycomb pioneered this, and its the right baseline. There
         | are limitations to it of course, and certainly its been done
         | before at BigCo's that can afford to build the tech, but its
         | extremely powerful.
         | 
         | The main argument for metrics beyond traces is simply a
         | technology implementation - its aggregation because you cant
         | store the raw events. That doesnt mean though you need a new
         | abstraction on those metrics. They're still just questions
         | you're asking of the events in the system, and most systems are
         | debuggable by aggregation data points of spans or other
         | telemetry.
         | 
         | As for logs, they're important for some kinds of workloads, but
         | for the majority of companies I dont think they're the best
         | solution to the problem. You might need them for auditability,
         | but its quite difficult to find a case where logs are the
         | solution to debug a problem if you had span annotations.
        
       | dan-allen wrote:
       | I keep checking in on OpenTelemetry every few months to see if
       | the bits we need are stable yet. There's been very little
       | progress on the things we're waiting for.
       | 
       | I don't follow closely enough to comment on possible causes.
       | 
       | What I do know is that the surface area of code and
       | infrastructure that telemetry touches means adopting something
       | unfinished is a big leap of faith.
        
       ___________________________________________________________________
       (page generated 2024-06-14 23:02 UTC)