[HN Gopher] The problem with OpenTelemetry
___________________________________________________________________
The problem with OpenTelemetry
Author : robgering
Score : 134 points
Date : 2024-06-14 12:36 UTC (10 hours ago)
(HTM) web link (cra.mr)
(TXT) w3m dump (cra.mr)
| NeutralForest wrote:
| It resonates. As an intern I had to add OTEL to a Python project
| and I had to spend a lot of time in the docs to understand the
| concepts and implementation. Also, the Python impl has a lot of
| global state that makes it hard to use properly imo.
| zaphar wrote:
| Tracing requires keeping mappings for tracing identifiers per
| request. I don't know you do that without global state unless
| you want the tracing identifiers to pollute your own internal
| apis everywhere.
| bigblind wrote:
| Many frameworks have the idea of a context" for this, that
| holds per-request state, following your reques through the
| system. Functions that don't care about the context just pass
| it on to whatever they call.
|
| I think Go was smart to make this concept part of the
| standard library, as it encouraged frameworks to adopt it as
| well.
| NeutralForest wrote:
| I understand that but if you look at the Python
| implementation (or at least as it was 1-2 years ago), you
| have a lot of god objects that hack __new__ which leads to
| hidden flows when you create new instances of tracers for
| example. I'm not saying I have a better idea but when you put
| that together with the docs and the (at the time) very bare
| examples, it's just annoying.
| chipdart wrote:
| > As an intern I had to ${DO_SOME_PROJECT} and I had to spend a
| lot of time in the docs to understand the concepts and
| implementation
|
| That sounds like every single run-of-the-mill internship.
| NeutralForest wrote:
| That's fair but I'll say that the time and number of concepts
| you have to deal with before going into the code, per the
| docs; is quite big and I think the critic in the article is
| warranted.
| BiteCode_dev wrote:
| 100% agree.
|
| Every time I tried to use OT I was reading the doc and whispering
| "but, why? I only need...".
| Karrot_Kream wrote:
| Yeah I was going down this path for a side project I was
| getting going and spent a couple days of after-work time
| exploring how to get just some basic traces in OT and realized
| it was much more than I needed or cared about.
| antonyt wrote:
| OTel is flawed for sure, but I don't understand the stance
| against metrics and logs. Traces are inherently sampled unless
| you're lighting all your money on fire, or operating at so small
| a scale that these decisions have no real impact. There are kinds
| of metrics and logs which you always want to emit because they're
| mission-critical in some way. Is this a Sentry-specific thing?
| Does it just collapse these three kinds of information into a
| single thing called a "trace"?
| aleph_minus_one wrote:
| > OTel is flawed for sure, but I don't understand the stance
| against metrics and logs.
|
| Even if you don't want to consider the privacy concerns:
| _telemetry wastes quite some data of your internet connection_.
| spenczar5 wrote:
| Client-side transport is pretty unusual with OTel. I think
| almost everybody is sending things from the server side, so I
| don't think your concern is usually relevant.
| cogman10 wrote:
| Hey, this isn't the sort of telemetry we are talking about
| with OTel.
|
| About the only "privacy concern" with otel is that you are
| probably shipping traces/metrics to a cloud provider for your
| internal applications. This isn't the sort of telemetry
| getting baked into ms or google that is used to try and
| identify personal aspects of individuals, this is data that
| tells you "Foo app is taking 300ms serving /bar which is
| unusual".
| cdelsolar wrote:
| After I added OTel to an open source project I run, I spent
| a bit of time arguing with someone about telemetry - they
| kept saying they didn't opt in and that we need to inform
| our users about it, etc., and I kept saying no, that's not
| the same type of telemetry. I wonder how common this
| misconception is.
| cogman10 wrote:
| This is the second time I've seen this misconception come
| up in HN and I've definitely seen it in Reddit at least
| once.
| remram wrote:
| OpenTracing was a much clearer name, especially for those
| of us who really don't care about doing logging or
| metrics through OTel.
| wdb wrote:
| I think you are more talking about RUM which isn't yet
| supported by OpenTelemetry. I think they are working on it.
|
| I am not sure if it will support session replays like some
| vendors like Sentry or New Relic offer. Technically, I think
| session replays (rrweb etc) is pretty cool but as a web
| visitor I am not a fan.
| Dextro wrote:
| I mean, when you're the one selling the gas to light that money
| on fire you have a vested interest in keeping it that way
| right?
|
| I do agree that logging and spans are very similar, but I
| disagree that logs are just spans because they aren't exactly
| the same.
|
| I also agree that you can collect all metrics from spans and,
| in fact, it might be a better way to tackle it. But it's just
| not feasible to do so monetarily so you do need to have some
| sort of collection step closer to the metric producers.
|
| What I do agree with is that the terminology and the
| implementation of OTEL's SDK is incredibly confusing and hard
| to implement/keep up to date. I spent way too many hours of my
| career struggling with conflicting versions of OTEL so I know
| the pain and I desperately wish they would at least take to
| heart the idea of separating implementation from API.
| zeeg wrote:
| Food for thought- the subjective nature of both of those is
| exactly why it shouldn't be bundled.
| the_mitsuhiko wrote:
| > Traces are inherently sampled unless you're lighting all your
| money on fire
|
| You can burn a lot of money with logs and metrics too. The
| question is how much value you get for the money you throw on
| the burning pile of monitoring. My personal belief is that well
| instrumented distributed tracing is more actionable than logs
| and metrics. Even if sampled.
|
| (Disclaimer: I work at sentry)
| Jemaclus wrote:
| I actually take the opposite approach. In my experience, well
| instrumented metrics and finely tuned logs are more
| actionable than distributed traces! Interesting how that
| works out.
| the_mitsuhiko wrote:
| I believe on the infrastructure side that might be correct.
| Within applications that doesn't match my experience. In
| many cases the concurrent nature of servers makes it
| impossible to repro issues and narrow down the problem
| without tracing or trace aware logs.
| aserafini wrote:
| With only sampled traces though it's very hard to
| understand the impact of the problem. There are some bad
| traces but is it affecting 5%, 10% or 90% of your
| customers. Metrics shine there.
| remram wrote:
| Whether it is affecting 5% or 10% of your customers, if
| it is erroring at that rate you are going to want to find
| the root cause ASAP. Traces let you do that, whereas the
| precise number does nothing. I am a big supporter of
| metrics but I don't see this as the use case at all.
| dboreham wrote:
| I've used Otel quite a bit (in JVM systems) and honestly didn't
| know it did more than tracing.
|
| That said, I think this rot comes from the commercial side of the
| sector -- if you're a successful startup with one product (e.g.
| graphing counters), then your investors are going to start
| beating you up about why don't you expand into other adjacent
| product areas (e.g. tracing). Repeat previous sentence reversed.
| And so you get Grafana, New Relic, et al). OpenTelemetry is just
| mirroring that arrangement.
| cogman10 wrote:
| Perhaps the real problem with OTel (IMO) is it's trying to be
| everything for everyone and every language. It's trying to have a
| common interface so that you can write OTel in Java or
| Javascript, python or rust, and you basically have the exact same
| API.
|
| I suspect OP is seeing this directly when talking about the
| cludgyness of the Javascript API.
| syngrog66 wrote:
| Up my alley. I'm the author of a FOSS Golang span instrumentation
| library for latency (LatLearn in my GitHub.) And part of the team
| that back in 2006/2007 made an in-house distributed tracing
| solution for Orbitz.
| doctorpangloss wrote:
| I don't know what the Sentry guy is really saying - I mean you
| can write whatever code you want, go for it man.
|
| But I do have to "pip uninstall sentry-sdk" in my Dockerfile
| because it clashes with something I didn't author. And anyway,
| because it is completely open source, the flaws in OpenTelemetry
| for my particular use case took an hour to surmount, and vitally,
| I didn't have to pay the brain damage cost most developers hate:
| relationships with yet another vendor.
|
| That said I appreciate all the innovation in this space, from
| both Sentry and OpenTelemetry. The metrics will become the
| standard, and that's great.
|
| The problem with Not OpenTelemetry: eventually everyone is going
| to learn how to use Kubernetes, and the USP of many startup
| offerings will vanish. OpenTelemetry and its feature scope creep
| make perfect sense for people who know Kubernetes. Then it makes
| sense why you have a wire protocol, why abstraction for vendors
| is redundant or meaningless toil, and why PostHog and others stop
| supporting Kubernetes: it competes with their paid offering.
| MapleWalnut wrote:
| The Sentry SDK is open source and easy to contribute to in my
| experience.
| hahn-kev wrote:
| Yeah but who wants to contribute to an SDK for a service that
| you need to pay for? That would be like if Oracle DB was open
| to contribution
| MapleWalnut wrote:
| Sentry provides a great hosted service. You can self host
| if you like, but it's nicer to let them do it
| tnolet wrote:
| Sentry is self hostable. https://develop.sentry.dev/self-
| hosted/
| brunoqc wrote:
| But not foss. It's using the BSL or FSL or whatever.
| riedel wrote:
| Although I do not like those licences, I would not care
| so much about 2yrs until it goes FOSS. Before all this
| rush development RRDTool and OpenTSDB was so slow, this
| whole thing seems rather ideological than substantial
| criticism. Now going down the licence rabbit hole based
| to criticise the original argument seems like a classical
| strawman.
| brunoqc wrote:
| I was supporting a variation in my head of the "Yeah but
| who wants to contribute to an SDK for a service that you
| need to pay for?" claim.
|
| You can self-host for free, so maybe @hahn-kev don't mind
| contributing to the SDK now.
|
| For me, I refuse to contribute to an open-source SDK for
| a non-foss product. And I refuse to self-host a non-foss
| product.
|
| Personally, I don't care if non-foss licenses speeds
| development. So yeah in my case it's ideological.
| ensignavenger wrote:
| https://glitchtip.com/ is an Open Source form of Sentry
| created after they went closed source, if you are
| interested in something like that.
| zeeg wrote:
| Just want to say I appreciate your stance.
|
| (also no one should feel like they have to contribute to
| our SDKs, but please file a ticket if somethings fucked
| up and we'll deal w/ it)
| nimih wrote:
| Sentry is _technically_ self-hostable, but they provide
| no deployment guidance beyond running the giant blob of
| services /microservices (including instances of postgres,
| redis, memcache, clickhouse, and kafka) as a single
| docker-compose thing. I get why they do this and think
| it's totally reasonable of them, but Sentry is a very
| complicated piece of software and takes substantially
| more work IME to both get up and running and maintain
| compared to other open-source self-hosted
| observability/monitoring/telemetry software I've had the
| pleasure of working with.
| miohtama wrote:
| Our Linux devops engineer, who had not used Sentry
| before, set up a self-hosted Sentry in a day.
| tapoxi wrote:
| Yeah, it works for a time, but they don't support on-
| premise versions and they don't offer a Helm chart
| install, its all community based.
|
| I tried it for well over a year, and there are so many
| moving parts and so many "best guesses" from the
| community that we had to rip it out. There's a lot of
| components, sentry, sentry-relay, snuba, celery, redis,
| clickhouse, zookeeper (for clickhouse), kafka, zookeeper
| (for kafka), maybe even elasticsearch for good measure.
| It did work for a time, but there are so many moving
| parts that required care and feeding it would inevitably
| break down at some point.
|
| Problem is I can't ship data to their SaaS version
| because we have PHI and our contracts forbid it, even if
| scrubbed, so I had to settle on OTEL.
| tnolet wrote:
| Day 1 vs day 2. That's why the SaaS version exists.
| Spivak wrote:
| I've been using GlitchTip https://glitchtip.com with the
| Sentry SDKs and I couldn't be happier. Completely self-
| hosted, literally just the container and a db, requires
| zero attention.
| zitterbewegung wrote:
| Why have I heard only bad things on k8s? To the point where
| it's a meme to understand k8s...
| marcosdumay wrote:
| > eventually everyone is going to learn how to use Kubernetes
|
| That seems obviously true... yet, there are so many people out
| there that seem unable to learn it that I don't think it's a
| reliable prediction.
| politelemon wrote:
| > unable
|
| I wouldn't equate unwillingness or not needing it to
| inability to learn
| drewbug01 wrote:
| As a contributor to (and consumer of) OpenTelemetry, I think
| critique and feedback is most welcome - and sorely needed.
|
| But this ain't it. In the opening paragraphs the author dismisses
| the hardest parts of the problem (presumably because they are
| _human_ problems, which engineers tend to ignore), and betrays a
| complete lack of interest in understanding why things ended up
| this way. It also seems they've completely misunderstood the API
| /SDK split in its entirety - because they argue for having such a
| split. It's there - that's exactly what exists!
|
| And it goes on and on. I think it's fair to critique
| OpenTelemetry; it can be really confusing. The blog post is
| evidence of that, certainly. But really it just reads like
| someone who got frustrated that they didn't understand how
| something worked - and so instead of figuring it out, they've
| decided that it's just hot garbage. I wish I could say this was
| unusual amongst engineers, but it isn't.
| arccy wrote:
| indeed, it just sounds like they're complaining they don't have
| a seat at the table...
| klabb3 wrote:
| No dog in the fight here, but... you're saying that one of the
| top guys at a major observability shop didn't understand Open
| Telemetry, then that's saying much more about OT than it does
| about his skills or efforts to understand. After all, his main
| point is that it's complex and overengineered, which is the key
| takeaway for curious bystanders like me, whether every detail
| is technically correct or not.
|
| > it just reads like someone who [...] didn't understand how
| something worked - and so instead of figuring it out, they've
| decided that it's just hot garbage.
|
| And what about average developers asked to "add telemetry" to
| their apps and libraries? Their patience will be much lower
| than that.
|
| Not necessarily defending the content (frankly it should have
| had more examples), but I relate to the sentiment. As a
| developer, I _need_ framework providers to make sane design
| decisions with minimal api surface, otherwise I'd rather build
| something bespoke or just not care.
| cdelsolar wrote:
| OTel is very easy to add.. I've added it to several Go
| projects. For some frameworks like .NET you can do it
| automatically. The harder/more annoying part is setting up a
| viewer/collector like Jaeger. I've done that too but just in
| memory and it fills up quick.
| bbkane wrote:
| For my small scale projects, Openobserve.ai has been super
| helpful. It ships as a single binary and (in non h/a setup)
| saves traces/logs/metrics to disk. I just set it up as a
| systems service and start sending telemetry via localhost.
| Code at https://github.com/bbkane/shovel_ansible/
| zeeg wrote:
| Author here.
|
| That's kind of making my point for me fwiw. It's too
| complicated. I consider myself a product person so this is my
| version of that lens on the problem.
|
| I'm not dismissing the people problem at all - I actually am
| trying to suggest the technology problem is the easier part (eg
| a basic spec). Getting it implemented, making it easy to
| understand, etc is where I see it struggling right now.
|
| Aside this is not just my feedback, it's a synthesis of what
| I'm hearing (but also what I believe).
| shaqbert wrote:
| Otel is indeed quite complex. And the docs are not meant for
| quick wins...
|
| Otelbin [0] has helped me quite a bit in configuring and making
| sense of it, and getting stuff done.
|
| [0]: https://www.otelbin.io/
| wdb wrote:
| That looks pretty cool! OpenTelemetry Collector configuration
| files are pretty confusing. Do like the collector, though.
| Makes it easy to sent a subset of your telemetry to trusted
| partners.
| wdb wrote:
| Personally, I like OpenTelemetry, nice standardised approach. I
| just wished the vendors would have better support for the
| semantic conventions defined for a wide variety of traces.
|
| I quite like the idea of only need to change one small piece of
| the code to switch otel exporters instead of swapping out a
| vendor trace sdk.
|
| My main gripe with OpenTelemetry I don't fully understand what
| the exact difference is between (trace) events and log records.
| tnolet wrote:
| Can you give an example of the missing semantic conventions?
| yunwal wrote:
| > My main gripe with OpenTelemetry I don't fully understand
| what the exact difference is between (trace) events and log
| records.
|
| This is my main gripe too. I don't understand why {traces,
| logs, metrics} are not just different abstractions built on top
| of "events" (blobs of data your application ships off to some
| set of central locations). I don't understand why the
| opentelemetry collector forces me to re-implement the same
| settings for all of them and import separate libraries that all
| seem to do the same thing by default. Besides sdks and
| processors, I don't understand the need for these abstractions
| to persist throughout the pipeline. I'm running one collector,
| so why do I need to specify where my collector endpoint is 3
| different times? Why do I need to specify that I want my blobs
| batched 3 different times? What's the point of having
| opentelemetry be one project at all?
|
| My guess is this is just because opentelemetry started as a
| tracing project, and then became a logs and metrics project
| later. If it had started as a logging project, things would
| probably make more sense.
| serverlessmom wrote:
| Something I mention any time I'm introducing OpenTelemetry is
| that it's an unfinished project, a huge piece being the
| unifying abstractions between those signals.
|
| In part this is a very practical decision: most people
| already have pretty good tools for their logs, and have
| struggled to get tracing working. So it's better to work on
| tools for measuring and sending traces, and just let people
| export their current log stream via the OpenTelemetry
| collector.
|
| Notably the OTel docs acknowledge this mismatch between
| current implementation and design goals: https://opentelemetr
| y.io/docs/specs/otel/logs/#limitations-o...
| chipdart wrote:
| > This is my main gripe too. I don't understand why {traces,
| logs, metrics} are not just different abstractions built on
| top of "events" (blobs of data your application ships off to
| some set of central locations).
|
| By design, they cannot be abstractions of the single concept.
| For example, logs have a hard requirement on preserving
| sequential order and session and emitting strings, whereas
| metrics are aggregated and sampled and dropped arbitrarily
| and consist of single discrete values. Logs can store open-
| ended data, and thus need to comply with tighter data
| protection regulations. Traces often track a very specific
| set of generic events, whereas there are whole classes of
| metrics that serve entirely different purposes.
|
| Just because you can squint hard enough to only see events
| being emitted, that does not mean all event types can or
| should be treated the same.
| arccy wrote:
| If you're using OTLP, SDKs only require you specify the
| endpoint once, the signal specific settings are for if you
| want to send them to different places.
|
| The way you process/modify metrics vs logs vs traces are
| usually sufficiently different that there's not much point in
| having a unified event model if you're going to need a bunch
| of conditions to separate and process them differently. Of
| course, you can still use only one source (logs or events)
| and derive the other 2 from that, though that rarely scales
| well.
|
| Plus, the backends that you can use to store/visualize the
| data usually are optimized for specific signals anyways.
| prymitive wrote:
| I only learned about OT after Prometheus announced some deeper
| integration with it. Reading OT docs about metrics feels like
| every little problem has a dedicated solution in the OT world,
| even if a more generalised one already covers it. Which is quite
| striking coming from the Prometheus world.
| tnolet wrote:
| A recent example of OTel confusion.
|
| I could for the life of me not get the Python integration send
| traces to a collector. Same URL, same setup same API key as for
| Nodejs and Go.
|
| Turns out the Python SDK expect a URL encoded header, e.g.
| "Bearer%20somekey" whereas all other SDKs just accept a string
| with a whitespace.
|
| The whole split between HTTP, protobuf over HTTP and GRPC is also
| massively confusing.
| hahn-kev wrote:
| Sounds like a problem with the Python sdk
| tnolet wrote:
| Well actually. They (python SDK maintainers) argue their
| implementation is the correct one according to the spec. See
| this issue thread for example.
|
| https://github.com/open-telemetry/opentelemetry-
| specificatio...
|
| There are more. This is a symptom of a how hard it is to dive
| into Otel due to its surface area being so big.
| chipdart wrote:
| > Well actually. They (python SDK maintainers) argue their
| implementation is the correct one according to the spec.
| See this issue thread for example.
|
| The comment section of that issue gives out contrarian
| vibes. Apparently the problem is that the Python SDK
| maintainers refuse to support a use case that virtually all
| other SDKs support. There are some weasel words that try to
| convey the idea that half the SDKs are with Python while in
| reality the ones that support the choices followed by the
| Python SDK actually support all scenarios.
|
| From the looks of it, the Python SDK maintainers are
| purposely making a mountain out of a molehill that could be
| levelled with a single commit with a single line of code.
| tnolet wrote:
| I guess you word it better than I did.
|
| As a user it feels very weird to wade into threads like
| this to find a solution to your problem.
|
| The power of Otel is it being an open standard. But the
| practice shows the implementation of that standard / spec
| leads to all kinds of issues and fiefdoms
| hinkley wrote:
| The silent failure policy of OTEL makes flames shoot out of the
| top of my head.
|
| We had to use wireshark to identify a super nasty bug in the
| "JavaScript" (but actually typescript despite being called
| opentelemetryjs) implementation.
|
| And OTEL is largely unsuitable for short lived processes like
| CLIs, CI/CD. And I would wager the same holds for FaaS
| (Lambda).
|
| In the end I prefer the network topology of StatsD, which is
| what we were migrating from. Let the collector do ALL of the
| bookkeeping instead of faffing about. OTEL is _actively_
| hostile to process-per-thread programming languages. If I had
| it to do over again I'd look at the StatsD- >Prometheus
| integrations, and the StatsD extensions that support tagging.
| tnolet wrote:
| Yeah. And Otel has actually pretty nice debugging. You just
| need to set the right environment variable. But on prod it
| will blow up your logs
| hobofan wrote:
| This seems to be more of a branding problem than anything.
|
| OP (rightfully) complains that there is a mismatch between what
| they (can) advertise ("We support OTEL") and what they are
| actually providing to the user. I have the same pain point from
| the consumer side, where I have to trial multiple tools and
| service to figure out which of them actually supports the OTEL
| feature set I care about.
|
| I feel like this could be solved by introducing better branding
| that has a clearly defined scope of features inside the project
| (like e.g. "OTEL Tracing") which can serve as a direct signifier
| to customers about what feature set can be expected.
| zeeg wrote:
| Yes! Its a bit deeper than that but its fundamentally a
| packaging issue.
| no_circuit wrote:
| IMO this boils down how one gets paid to understand or
| misunderstand something. A telemetry provider/founder is being
| commoditized by an open specification in which they do not
| participate in its development -- implied by the post saying the
| author doesn't know anyone on the spec committee(s). No surprise
| here.
|
| Of course implementing a spec from the provider point of view can
| be difficult. And also take a look at all the names of the OTEL
| community and notice that Sentry is not there:
| https://github.com/open-telemetry/community/blob/86941073816....
| This really isn't news. I'd guess that a Sentry customer should
| just be able to use the OTEL API and could just configure a
| proprietary Sentry exporter, for all their compute nodes, if
| Sentry has some superior way of collecting and managing
| telemetry.
|
| IMO most library authors do not have to worry about annotation
| naming or anything like that mentioned in the post. Just use the
| OTEL API for logs, or use a logging API where there is an OTEL
| exporter, and whomever is integrating your code will take care of
| annotating spans. Propagating span IDs is the job of "RPC"
| libraries, not general code authors. Your URL fetch library
| should know how to propagate the Span ID provided that it also
| uses the OTEL API.
|
| It is the same as using something like Docker containers on a
| serverless platform. You really don't need to know that your code
| is actually being deployed in Kubernetes. Use the common Docker
| interface is what matters.
| chipdart wrote:
| > IMO this boils down how one gets paid to understand or
| misunderstand something.
|
| I completely agree. The most charitable interpretation of this
| blog post is that the blogger genuinely fails go understand the
| basics of the problem domain, or worst case scenario they are
| trying to shitpost away the need for features that are well
| supported by a community-driven standard like OpenTelemetry.
| serverlessmom wrote:
| I think that a number of Observability providers are looking at
| how they can add features and value to parts of monitoring that
| OTel effectively commoditizes. Thinking of the tail-based
| sampling implemented at Honeycomb for APM, or synthetic
| monitoring by my own team at Checkly.
|
| "In 2015 Armin and I built a spec for Distributed Tracing. Its
| not a hard problem, it just requires an immense amount of
| coordination and effort." This to me feels like a nice glass of
| orange juice after brushing my teeth. The spec on DT is very
| easy, but the implementation is very very hard. The fact that
| OTel has nurtured a vast array of libraries to aid in context
| propagation is a huge acheivement, and saying 'This would all
| work fine if everyone everywhere adopted Sentry' is...
| laughable.
|
| Totally outside the O11y space, OTel context propagation is an
| intensely useful feature because of how widespread it is. See
| Signadot implementing their smart test routing with
| OpenTelemetry: https://www.signadot.com/blog/scaling-
| environments-with-open...
| zeeg wrote:
| Author here.
|
| Y'all realize we'd just make more money if everyone has better
| instrumentation and we could spend less time on it, and more
| time on the product, right?
|
| There is no conspiracy. It's simple math and reasoning. We
| don't compete with most otel consumers.
|
| I don't know how you could read what I posted and think sentry
| believes otel is a threat, let alone from the fact that we just
| migrated our JS SDK to run off it.
| epgui wrote:
| Anyone else finding this very difficult to read? I'd really
| recommend feeding this through a grammar checker, because poor
| grammar betrays unclear thinking.
| zeeg wrote:
| So you're saying it makes my thinking more clear? :)
|
| This is what happens when you use a tool designed for authoring
| code to also author content.
| kaashif wrote:
| "betrays" means to expose, to be evidence of, particularly
| unintentionally.
|
| i.e. "poor grammar unintentionally exposed unclear thinking"
| markl42 wrote:
| At the risk of hijacking the comments, I've been trying to use
| OTel recently to debug performance of a complex webpage with lots
| of async sibling spans, and finding it very very difficult to
| identify the critical path / bottlenecks.
|
| There's no causal relationships between sibling spans. I think in
| theory "span links" solves this, but afaict this is not a widely
| used feature in SDKs are UI viewers.
|
| (I wrote about this here https://github.com/open-
| telemetry/opentelemetry-specificatio...)
| diurnalist wrote:
| I don't believe this is a solved problem, and it's been around
| since OpenTracing days[0]. I do not think that the Span links,
| as they are currently defined, would be the best place to do
| this, but maybe Span links are extended to support this in the
| future. Right now Span links are mostly used to correlate spans
| causally _across different traces_ whereas as you point out
| there are cases where you want correlation _within a trace_.
|
| [0]: https://github.com/opentracing/specification/issues/142
| hinkley wrote:
| I was underwhelmed by the max size for spans before they get
| rejected. Our app was about an order of magnitude too complex
| for OTEL to handle.
|
| Reworking our code to support spans made our stack traces
| harder to read and in the end we turned the whole thing off
| anyway. Worse than doing nothing.
| noname120 wrote:
| tl;dr OpenTelemetry eats Sentry's cake by commoditizing what they
| do and the reaction of the founder of Sentry is to be very upset
| about it rather than innovating.
| wvh wrote:
| I have surveyed this landscape for a number of years, though I'm
| not involved enough to have strong opinions. We're running a lot
| of Prometheus ecosystem and even some OpenTelemetry stacks across
| customers. OpenTelemetry does seem like one of these projects
| with an ever expanding scope. It makes it hard to integrate parts
| you like and keep things both computing-wise and mentally
| lightweight without having to go all-in.
|
| It's not anymore about hey, we'll include this little library or
| protocol instead of rolling our own, so we can hope to be
| compatible with a bunch of other industry-standard software. It's
| a large stack with an ever evolving spec. You have to develop
| your applications and infrastructure around it. It's very
| seductive to roll your own simpler solution.
|
| I appreciate it's not easy to build industry-wide consensus
| across vendors, platforms and programming languages. But be
| careful with projects that fail to capture developer mindshare.
| EdSchouten wrote:
| > Its not a hard problem, [...]. At its core its structured
| events that carry two GUIDs along with them: a trace ID and a
| parent event ID. It is just building a tree.
|
| I've always wondered, what's the point of the trace ID? What even
| is a trace?
|
| - It could be a single database query that's invoked on a
| distributed database, giving you information about everything
| that went on inside the cluster processing that query.
|
| - Or it could be all database calls made by a single page request
| on a web server.
|
| - Or it could be a collection of page requests made by a single
| user as part of a shopping checkout process. Each page request
| could make many outgoing database calls.
|
| Which of these three you should choose merely depends on what you
| want to visualize at a given point in time. My hope is that at
| some point we get a standard for tracing that does away with the
| notion of trace IDs. Just treat everything going on in the
| universe as a graph of inter-connected events.
| remram wrote:
| I think they meant "an event ID and a parent event ID".
| zeeg wrote:
| I actually meant trace ID and parent event ID (and ID was
| inferred). Parent comment is correct in that trace ID isnt
| technically needed, and is in fact quite controversial. Its
| an implementation level protocol optimization though, and
| unfortunately not an objective one. It creates an arbitrary
| grouping of these annotations - which is entirely subjective,
| and the spec struggles to reconcile - but its primarily
| because the technology to aggregate and/or query them would
| be far more difficult if you didn't keep that simple GUID.
|
| It does have one positive benefit beyond that. If you lose
| data, or have disparate systems, its pretty easy to keep the
| Trace ID intact and still have better instrumentation than
| otherwise.
| serverlessmom wrote:
| An argument that OpenTelemetry is somehow 'too big' is an example
| of motivated reasoning. I can understand that A Guy Who Makes
| Money If You Use Sentry dislikes that people are using OTel
| libraries to solve similar problems.
|
| Context propagation and distributed tracing are cool OTel
| features! But they are not the only thing OTel should be doing.
| OpenTelemetry instrumentation libraries can do a lot on their
| own, a friend of mine made massive savings in compute efficiency
| with the NodeJS OTel library:
| https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...
| zeeg wrote:
| Author here.
|
| OpenTelemetry is not competitive to us (it doesn't do what we
| do in plurality), and we specifically want to see the open
| tracing goals succeed.
|
| I was pretty clear about that in the post though.
| serverlessmom wrote:
| I think that it's disingenuous to say OpenTelemetry and
| Sentry aren't in competition. I think it would be good news
| for Sentry if DT were split from the project, and
| instrumentation and performance monitoring weren't
| commoditized by broad adoption of those parts of the
| OpenTelemetry project.
|
| I think you, the author, stand to benefit directly from a
| breakup of OpenTelemetry, and a refusal to acknowledge your
| own bias is problematic when your piece starts with a request
| to 'look objectively.'
| zeeg wrote:
| We just rewrote our most heavily used SDK to run on top of
| OTel. What do we gain from it failing?
|
| We also make most of our revenue from errors which don't
| have an open protocol implementation outside of our own.
| codereflection wrote:
| I understand what the author is saying, but vendor lock-in with
| closed-source observability platforms is a significant challenge,
| especially for large organizations. When you instrument hundreds
| or thousands of applications with a specific tool, like the
| Datadog Agent, disentangling from that tool becomes nearly
| impossible without a massive investment of engineering time. In
| the Platform Engineering professional services space, we see this
| problem frequently. Enterprises are growing tired of big
| observability platform lock-in, especially when it comes to
| Datadog's opaque nature of your spend on their products, for
| example.
|
| One of the promises of OTEL is that it allows organizations to
| replace vendor-specific agents with OTEL collectors, allowing the
| flexibility of the end observability platform. When used with an
| observability pipeline (such as EdgeDelta or Cribl), you can re-
| process collected telemetry data and send it to another platform,
| like Splunk, if needed. Consequently, switching from one
| observability platform to another becomes a bit less of a
| headache. Ironically, even Splunk recognizes this and has put
| substantial support behind the OTEL standard.
|
| OTEL is far from perfect, and maybe some of these goals are a bit
| lofty, but I can say that many large organizations are adopting
| OTEL for these reasons.
| zeeg wrote:
| I totally agree I just wish we could do it in a way that
| doesn't try to lump every problem into the same bucket. I don't
| see what it achieves personally, and I think it's limiting the
| ability for the original goals of the project to be as
| successful as they could be.
| andrewmcwatters wrote:
| Yeah, it's the primary reason we used it. If OpenTelemetry's
| raison d'etre was simply to give Datadog a reason to not
| bullshit their customers on pricing, it would fulfill a major
| need in platform services.
| zellyn wrote:
| Are they basically just saying that the OpenTelemetry client APIs
| should be split from the rest of the pieces of the project, and
| versioned super conservatively?
|
| The simple API they describe is basically there in OTel. The API
| is larger, because it also does quite a few other things
| (personally, I think (W3C) Baggage is important too), but as a
| library author I should need only the client APIs to write to.
|
| When implementing, you're free to plug in Providers that use
| OpenAPI-provided plumbing, but you can equally well plug in
| Providers from DataDog or Sentry or whatever.
|
| Unless I'm missing something, any further complaints could be
| solved by making sure the Client APIs (almost) never have
| backward-incompatible changes, and are versioned separately.
| zeeg wrote:
| It's a bit deeper than that. The SDKs that library authors
| implement need to be extemely minimal. The collection libraries
| that vendors implement based on imo should also be minimal.
|
| OTLP imo doesn't even need to be part of the spec.
|
| But minimal would also mean focusing on solving fewer problems
| as a whole. Eg OpenTracing plus OpenMetrics plus OpenLogs. I
| only need one of those things.
| arccy wrote:
| that just sounds like a branding problem though...
|
| OTLP has been quite useful especially in metrics to get a
| format that doesn't really have any sacrifices/limitations
| compared to all the other protocols.
| zeeg wrote:
| It is! But to prove your point, OTLP is actually just the
| transport protocol (Open Telemetry Transport Protocol). Its
| one of _so many things_ its trying to address. All of those
| things might be probems, but not everyone has those same
| problems (vendors, customers, and lib authors), and
| bundling them all into one umbrella just screams for me.
|
| I actually have no need for a standard metrics
| implementation, just as an example. I never have, and I'd
| argue Sentry (as a tech company) never has. We built our
| own abstraction and/or used a library. That doesnt mean
| others don't, and it doesnt mean it shouldnt be something
| people solve, but bundling "all telemetry problems" into
| one giant design committee is a fundamental misstep imo.
| PeterZaitsev wrote:
| OpenTelemetry is interesting, On one side it is designed as the
| "commodity feeder" to number of proprietary backends as DataDog,
| on other hand we see good development of Open Source solutions as
| SigNoz and Coroot with good Otel support.
| spullara wrote:
| There is a huge whole in using spans as they are specified.
| Without separating the start of a span from the end of a span you
| can never see things that never complete, fail hard enough to not
| close the span, or travel through queues. This is a compromise
| they made because typical storage systems for tracing aren't
| really good enough to stitch them all back together quickly.
| Everyone should be sending events and stitching it all together
| to create the view. But instead we get a least common denominator
| solution.
| fractalwrench wrote:
| The main interest I've seen in OTel from Android engineers has
| been driven by concerns around vendor lock-in. Backend/devops in
| their organisations are typically using OTel tooling already &
| want to see all telemetry in one place.
|
| From this perspective it doesn't matter if the OTel SDK comes
| bundled with a bunch of unnecessary code or version conflicts as
| is suggested in the article. The whole point is to regain control
| over telemetry & avoid paying $$$ to an ambivalent vendor.
|
| FWIW, I don't think the OTel implementation for mobile is perfect
| - a lot of the code was originally written with backend JVM apps
| in mind & that can cause friction. However, I'm fairly optimistic
| those pain points will get fixed as more folks converge on this
| standard.
|
| Disclaimer: I work at a Sentry competitor
| AndreasBackx wrote:
| I have been trying to find an equivalent for `tracing` first in
| Python and this week in TypeScript/JavaScript. At my work I
| created an internal post called "Better Python Logging? Tracing
| for Python?" that basically asks this question. OpenTelemetry was
| also what I looked at and since I have looked at other tooling.
|
| It is hard to explain how convenient `tracing` is in Rust and why
| I sorely miss it elsewhere. The simple part of adding context to
| logs can be solved in a myriad of ways, yet all boil down to a
| similar "span-like" approach. I'm very interested in helping
| bring what `tracing` offers to other programming communities.
|
| It very likely is worth having some people from the space
| involved, possibly from the tracing crate itself.
| zeeg wrote:
| We'll fund solving this as long as the committees agree with
| the goal. We just want standard tracing implementations.
|
| (Speaking on behalf of Sentry)
| crabbone wrote:
| I've heard about OpenTelementry before, but I could never
| understand what it's for.
|
| Can anyone with more knowledge enlighten me? Why is Prometheus
| not enough? From reading from OpenTelementry's Web site, I can
| see no obvious benefits of using it (if I already use
| Prometheus).
|
| Is it trying to be somehow more generic than Prometheus'
| instrumentation? Sort of like ORM might try to be more generic
| than a particular database?
|
| Also, being certified as "cloud native", in my experience, has
| always being a sort of scam. So, when I see that, I tend to think
| negatively about a project. Maybe that's a distraction though.
|
| Also, in their documentation, they use "tracing" in some weird
| way I cannot quite reconcile with the way I've learned to use
| this word (eg. by using "strace" on Linux). They must mean
| something else, or do they?
|
| ----
|
| OP reads as a parody of Donald Trump for some reason. It's not
| the most pleasant style to read. Of course, I'm not an authority
| on writing styles. Just mentioning this as this was quite a bit
| of distraction.
| remram wrote:
| https://opentelemetry.io/docs/concepts/signals/traces/
| ris wrote:
| 1. The main reason I want to use otel is so I can have one
| sidecar for my observability, not three, each with subtly
| different quirks and expectations. (also the associated
| collection/aggregation infrastructure)
|
| 2. I honestly think the main reason otel appears so complex is
| the existing resources that attempt to explain the various
| concepts around it do a poor job and are very hand-wavey. You
| know the main thing that made otel "click" for me? Reading the
| protobuf specs. Literally nothing else explained succinctly the
| relationships between the different types of structure and what
| the possibilities with each were.
| esafak wrote:
| This caught my eye:
|
| > Logs are just events - which is exactly what a span is, btw -
| and metrics are just abstractions out of those event properties.
| That is, you want to know the response time of an API endpoint?
| You don't rewind 20 years and increment a counter, you instead
| aggregate the duration of the relevant span segment. Somehow
| though, Logs and Metrics are still front and center.
|
| Is anyone replacing logs and metrics with traces?
| zeeg wrote:
| imo Honeycomb pioneered this, and its the right baseline. There
| are limitations to it of course, and certainly its been done
| before at BigCo's that can afford to build the tech, but its
| extremely powerful.
|
| The main argument for metrics beyond traces is simply a
| technology implementation - its aggregation because you cant
| store the raw events. That doesnt mean though you need a new
| abstraction on those metrics. They're still just questions
| you're asking of the events in the system, and most systems are
| debuggable by aggregation data points of spans or other
| telemetry.
|
| As for logs, they're important for some kinds of workloads, but
| for the majority of companies I dont think they're the best
| solution to the problem. You might need them for auditability,
| but its quite difficult to find a case where logs are the
| solution to debug a problem if you had span annotations.
| dan-allen wrote:
| I keep checking in on OpenTelemetry every few months to see if
| the bits we need are stable yet. There's been very little
| progress on the things we're waiting for.
|
| I don't follow closely enough to comment on possible causes.
|
| What I do know is that the surface area of code and
| infrastructure that telemetry touches means adopting something
| unfinished is a big leap of faith.
___________________________________________________________________
(page generated 2024-06-14 23:02 UTC)