[HN Gopher] Migrating to OpenTelemetry
___________________________________________________________________
Migrating to OpenTelemetry
Author : kkoppenhaver
Score : 242 points
Date : 2023-11-16 17:29 UTC (1 days ago)
(HTM) web link (www.airplane.dev)
(TXT) w3m dump (www.airplane.dev)
| caust1c wrote:
| Curious about the code implemented for logs! Hopefully that's
| something that can be shared at some point. Also curious if it
| integrates with `log/slog` :-)
|
| Congrats too! As I understand it from stories I've heard from
| others, migrating to OTel is no easy undertaking.
| bhyolken wrote:
| Thanks! For logs, we actually use github.com/segmentio/events
| and just implemented a handler for that library that batches
| logs and periodically flushes them out to our collector using
| the underlying protocol buffer interface. We plan on migrating
| to log/slog soon, and once we do that we'll adapt our handler
| and can share the code.
| caust1c wrote:
| Awesome! Great work and thanks for sharing your experience!
| MajimasEyepatch wrote:
| It's interesting that you're using both Honeycomb and Datadog.
| With everything migrated to OTel, would there be advantages to
| consolidating on just Honeycomb (or Datadog)? Have you found
| they're useful for different things, or is there enough overlap
| that you could use just one or the other?
| bhyolken wrote:
| Author here, thanks for the question! The current split
| developed from the personal preferences of the engineers who
| initially set up our observability systems, based on what they
| had used (and liked) at previous jobs.
|
| We're definitely open to doing more consolidation in the
| future, especially if we can save money by doing that, but from
| a usability standpoint we've been pretty happy with Honeycomb
| for traces and Datadog for everything else so far. And, that
| seems to be aligned with what each vendor is best at at the
| moment.
| MuffinFlavored wrote:
| > from the personal preferences of the engineers
|
| https://www.honeycomb.io/pricing
|
| https://www.datadoghq.com/pricing/
|
| Am I wrong to say... having 2 is "expensive"? Maybe not if
| 50% of your stuff is going to Honeycomb and 50% going to
| DataDog. Could you save money/complexity (less places to look
| for things) having just DataDog or just Honeycomb?
| bhyolken wrote:
| Right now, there isn't much duplication of what we're
| sending to each vendor, so I don't think we'd save a ton by
| consolidating, at least based on list prices. We could
| maybe negotiate better prices based on higher volumes, but
| I'm not sure if Airplane is spending enough at this point
| to get massive discounts there.
|
| Another potential benefit would definitely be reduced
| complexity and better integration for the engineering team.
| So, for instance, you could look at a log and then more
| easily navigate to the UI for the associated trace.
| Currently, we do this by putting Honeycomb URLs in our
| Datadog log events, which works but isn't quite as
| seamless. But, given that our team is pretty small at this
| point and that we're not spending a ton of our time on
| performance optimizations, we don't feel an urgent need to
| consolidate (yet).
| MuffinFlavored wrote:
| When you say DataDog for everything else (as in not
| traces), besides logs, what else do you mean?
| claytonjy wrote:
| Metrics, probably? The article calls out logs, metrics,
| and traces as the 3 pillars of observability.
| bhyolken wrote:
| Yeah, metrics and logs, plus a few other things that
| depend on these (alerts, SLOs, metric-based dashboards,
| etc.).
| tapoxi wrote:
| I made this switch very recently. For our Java apps it was as
| simple as loading the otel agent in place of the Datadog SDK,
| basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in
| our args.
|
| The collector (which processes and ships metrics) can be
| installed in K8S through Helm or an operator, and we just added a
| variable to our charts so the agent can be pointed at the
| collector. The collector speaks OTLP which is the fancy combined
| metrics/traces/logs protocol the OTEL SDKs/agents use, but it
| also speaks Prometheus, Zipkin, etc to give you an easy migration
| path. We currently ship to Datadog as well as an internal
| service, with the end goal being migrating off of Datadog
| gradually.
| andrewstuart2 wrote:
| We tried this about a year and a half ago and ended up going
| somewhat backwards into DD entrenchment, because they've
| decided that anything not an official DD metric (that is,
| collected by their agent typically) is custom and then becomes
| substantially more expensive. We wanted a nice migration path
| from any vendor to any other vendor but they have a fairly
| effective strategy for making gradual migrations more expensive
| for heavy telemetry users. At least our instrumentation these
| days is otel, but it's the metrics we expected to just scrape
| from prometheus that we had to dial back and start using more
| official DD agent metrics and configs to get, lest our bill
| balloon by 10x. It's a frustrating place to be. Especially
| since it's still not remotely cheap, just that it could be way
| worse.
|
| I know this isn't a DataDog post, and I'm a bit off topic, but
| I try to do my best to warn against DD these days.
| shawnb576 wrote:
| This has been a concern for me too. But the agent is just a
| statsd receiver with some extra magic, so this seems like a
| thing that could be solved with the collector sending traffic
| to an agent rather than the HTTP APIs?
|
| I looked at the OTel DD stuff and did not see any support for
| this, fwiw, maybe it doesn't work b/c the agent expects more
| context from the pod (e.g. app and label?)
| andrewstuart2 wrote:
| Yeah, the DD agent and the otel-collector DD exporter
| actually use the same code paths for the most part. The
| relevant difference tends to be in metrics, where the
| official path involves the DD agent doing collection
| directly, for example, collecting redis metrics by giving
| the agent your redis database hostname and creds. It can
| then pack those into the specific shape that DD knows about
| and they get sent with the right name, values, etc so that
| DD calls them regular metrics.
|
| If you instead went the more flexible route of using many
| of the de-facto standard prometheus exporters like the one
| for redis, or built-in prometheus metrics from something
| like istio, and forward those to your agent or configure
| your agent to poll those prometheus metrics, it won't do
| any reshaping (which I can see the arguments for, kinda,
| knowing a bit about their backend) and they just end up in
| the DD backend as custom metrics, and charge you at
| $0.10/mo per 100 time series. If you've used prometheus
| before for any realistic deployments with enrichment etc,
| you can probably see this gets expensive ridiculously fast.
|
| What I wish they'd do instead is have some form of adapter
| from those de facto standards, so I can still collect
| metrics 99% my own way, in a portable fashion, and then add
| DD as my backend without ending up as custom everything,
| costing significantly more.
| xyst wrote:
| > somewhat backwards into DD entrenchment, because they've
| decided that anything not an official DD metric (that is,
| collected by their agent typically) is custom and then
| becomes substantially more expensive.
|
| It a vendor pulled shit like this on me. That's when I would
| counsel them. Of course most big orgs would rather not do the
| leg work to actually become portable, migrate off vendor. So
| of course they will just pay the bill.
|
| Vendors love the custom shit they build because they know
| once it's infiltrated the stack then it's basically like
| gangrene (have to cut off the appendage to save the host)
| k__ wrote:
| I had the impression, logs and metrics are a pre-observability
| thing.
| SteveNuts wrote:
| I've never heard the term "pre-observability", what does that
| mean?
| renegade-otter wrote:
| The era when "debugging in production" wasn't standard.
| marcosdumay wrote:
| Observability is about logs and metrics, and pre-observability
| (I guess you mean the high-level-only records simpler
| environments keep) is also about logs and metrics.
|
| Anything you register to keep track of your environment has the
| form of either logs or metrics. The difference is about the
| contents of such logs and metrics.
| k__ wrote:
| When I read Observability Engineering, I got the impression
| it was about long events and tracing, and metrics and logs
| were a thing of the past people gave up on since the rise of
| Microservices.
| jwestbury wrote:
| > metrics and logs were a thing of the past people gave up
| on since the rise of Microservices
|
| Definitely not the case, and, in fact, probably the
| opposite is true. In the era of microservices, metrics are
| absolutely critical to understand the health of your
| system. Distributed tracing is also only beneficial if you
| have the associated logs - so that you can understand what
| each piece of the system was doing for a single unit of
| work.
| phillipcarter wrote:
| > Distributed tracing is also only beneficial if you have
| the associated logs - so that you can understand what
| each piece of the system was doing for a single unit of
| work.
|
| Ehhh, that's only if you view tracing as "the thing that
| tells me that service A talks to service B". Spans in a
| trace are just structured logs. They are your application
| logging vehicle, especially if you don't have a legacy of
| good in-app instrumentation via logs.
|
| But even then the worlds are blurring a bit. OTel logs
| burn in a span and trace ID, and depending on the backend
| that correlated log may well just be treated as if it's a
| part of the trace.
| sofixa wrote:
| > Authors Charity Majors, Liz Fong-Jones, and George
| Miranda from Honeycomb explain what constitutes good
| observability, show you how to improve upon what you're
| doing today, and provide practical dos and don'ts for
| migrating from legacy tooling, such as metrics, monitoring,
| and log management. You'll also learn the impact
| observability has on organizational culture (and vice
| versa).
|
| No wonder, it's either strong bias from people working in a
| tracing vendor, or outright a sales pitch.
|
| It's totally false though. Each pillar - metrics, logs and
| traces have their place and serve different purposes. You
| won't use traces to measure the number of requests hitting
| your load balancer, or the amount of objects in the async
| queue, or CPU utilisation, or network latency, or any
| number of things. Logs can be more rich than traces, and a
| nice pattern I've used with Grafana is linking the two, and
| having the option to jump to corresponding log lines from a
| trace which can describe the different actions performed
| during that span.
| phillipcarter wrote:
| You can sorta measure some of this with traces. For
| example, sampled traces that contain the sampling rate in
| their metadata let you re-weight counts, thus allowing
| you to accurately measure "number of requests to x".
| Similarly, a good sampling of network latency can
| absolutely be measured by trace data. Metrics will always
| have their place, though, for reasons you mention -
| measuring cpu utilization, # of objects in something etc.
| Logs vs. traces is more nuanced I think. A trace is
| nothing more than a collection of structured logs. I
| would wager that nearly all use cases for structured
| logging could be wholesale replaced by tracing. Security
| logging and big object logging is an exception, although
| that's also dependent on your vendor or backend.
| tsamba wrote:
| Interesting read. What did you find easier about using GCP's log
| tooling for your internal system logs, rather than the OTel
| collector?
| clintonb wrote:
| Their collector is used to send infrastructure logs to GCP
| (instead of Datadog).
|
| My guess is this is to save on costs. GCP logging is probably
| cheaper than Datadog, and infrastructure logs may not be needed
| as frequently as application logs.
| bhyolken wrote:
| Author here. This decision was more about ease of
| implementation than anything else. Our internal application
| logs were already being scooped up by GCP because we run our
| services in GKE, and we already had a GCP->Datadog log syncer
| [1] for some other GCP infra logs, so re-using the GCP-based
| pipeline was the easiest way to handle our application logs
| once we removed the Datadog agent.
|
| In the future, we'll probably switch these logs to also go
| through our collector, and it shouldn't be super hard (because
| we already implemented a golang OTel log handler for the
| external case), but we just haven't gotten around to it yet.
|
| [1]
| https://docs.datadoghq.com/integrations/google_cloud_platfor...
| roskilli wrote:
| > Moreover, we encountered some rough edges in the metrics-
| related functionality of the Go SDK referenced above. Ultimately,
| we had to write a conversion layer on top of the OTel metrics API
| that allowed for simple, Prometheus-like counters, gauges, and
| histograms.
|
| Have encountered this a lot from teams attempting to use the
| metrics SDK.
|
| Are you open to comment on specifics here and also what kind of
| shim you had to put in front of the SDK? It would be great to
| continue to retrieve feedback so that we can as a community have
| a good idea of what remains before it's possible to use the SDK
| for real world production use cases in anger. Just wiring up the
| setup in your app used to be fairly painful but that has gotten
| somewhat better over the last 12-24 months, I'd love to also hear
| what is currently causing compatibility issues w/ the metric
| types themselves using the SDK which requires a shim and what the
| shim is doing to achieve compatibility.
| bhyolken wrote:
| Sure, happy to provide more specifics!
|
| Our main issue was the lack of a synchronous gauge. The
| officially supported asynchronous API of registering a callback
| function to report a gauge metric is very different from how we
| were doing things before, and would have required lots of
| refactoring of our code. Instead, we wrote a wrapper that
| exposes a synchronous-like API: https://gist.github.com/yolken-
| airplane/027867b753840f7d15d6....
|
| It seems like this is a common feature request across many of
| the SDKs, and it's in the process of being fixed in some of
| them (https://github.com/open-telemetry/opentelemetry-
| specificatio...)? I'm not sure what the plans are for the
| golang SDK specifically.
|
| Another, more minor issue, is the lack of support for
| "constant" attributes that are applied to all observations of a
| metric. We use these to identify the app, among other use
| cases, so we added wrappers around the various "Add", "Record",
| "Observe", etc. calls that automatically add these. (It's
| totally possible that this is supported and I missed it, in
| which case please let me know.)
|
| Overall, the SDK was generally well-written and well-
| documented, we just needed some extra work to make the
| interfaces more similar to the ones we were using before.
| arccy wrote:
| the official SDKs will only support an api once there's a
| spec that allows it.
|
| for const attributes, generally these should be defined at
| the resource / provider level: https://pkg.go.dev/go.opentele
| metry.io/otel/sdk/metric#WithR...
| roskilli wrote:
| Thanks for the detailed response.
|
| I am surprised there is no gauge update API yet (instead of
| callback only), this is a common use case and I don't think
| folks should be expected to implement their own. Especially
| since it will lead to potentially allocation heavy bespoke
| implementations, depending on use case given
| mutex+callback+other structures that likely need to be heap
| allocated (vs a simple int64 wrapper with atomic update/load
| APIs).
|
| Also I would just say that the fact the APIs differ a lot to
| more common popular Prometheus client libraries does beg the
| question of do we need more complicated APIs that folks have
| a harder time using. Now is the time to modernize these
| before everyone is instrumented with some generation of a
| client library that would need to change/evolve. The whole
| idea of an OTel SDK is instrument once and then avoid needing
| to re-instrument again when making changes to your
| observability pipeline and where it's pointed. This becomes a
| hard sell if OTel SDK needs to shift fairly significantly to
| support more popular & common use cases with more typical
| APIs and by doing so leaves a whole bunch of OTel
| instrumented code that needs to be modernized to a different
| looking API.
| CSMastermind wrote:
| > The data collected from these streams is sent to several
| vendors including Datadog (for application logs and metrics),
| Honeycomb (for traces), and Google Cloud Logging (for
| infrastructure logs).
|
| It sounds like they were in a place that a lot of companies are
| in where they don't have a single pane of glass for
| observability. One of if not the main benefit I've gotten out of
| Datadog is having everything in Datadog so that it's all
| connected and I can easily jump from a trace to logs for
| instance.
|
| One of the terrible mistakes I see companies make with this
| tooling is fragmenting like this. Everyone has their own personal
| preference for tool and ultimately the collective experience is
| significantly worse than the sum of its parts.
| devin wrote:
| Eh, personally I view honeycomb and datadog as different enough
| offerings that I can see why you'd choose to have both.
| dexterdog wrote:
| Depending on your usage it can be prohibitively expensive to
| use datadog for everything like that. We have it for just our
| prod env because it's just not worth what it brings to the
| table to put all of our logs into it.
| dabeeeenster wrote:
| Is prod not 99% of your logs?
| shric wrote:
| I once worked out what it would cost to send our company's
| prod logs to datadog. It was 1.5x our total AWS cost. The
| company ran entirely on AWS
| maccard wrote:
| I've spent a small amount of time in datadog, lots in grafana,
| and somewhere in between in honeycomb. Out applications are
| designed to emit traces, and comparing honeycomb with tracing
| to a traditional app with metrics and logs, I would choose
| tracing every time.
|
| It annoys me that logs are overlooked in honeycomb, (and
| metrics are... fine). But, given the choice between a single
| pane of glass in grafana or having to do logs (and metrics
| sometimes) in cloudwatch but spending 95% of my time in
| honeycomb - I'd pick honeycomb every time
| mdtusz wrote:
| Agreed - honeycomb has been a boon, however some improvements
| to metric displays and the ability to set the default "board"
| used in the home page would be very welcome. Also would be
| pretty happy if there was a way to drop events on the
| honeycomb side for a way to dynamically filter - e.g. "don't
| even bother storing this trace if it has a http.status_code <
| 400". This is surprisingly painful to implement on the
| application side (at least in rust).
|
| Hopefully someone that works there is reading this.
| masterj wrote:
| It sounds like you should look into their tail-sampling
| Refinery tool https://docs.honeycomb.io/manage-data-
| volume/refinery/
| phillipcarter wrote:
| Yep, this is the one to use. Refinery handles exactly
| this scenario (and more).
| viraptor wrote:
| Have you tried the traces in grafana/tempo yet?
| https://grafana.com/docs/grafana/latest/panels-
| visualization...
|
| It seems to miss some aggregation stuff, but also it's
| improving every time I check. I wonder if anyone's used it in
| anger yet and how far is it from replacing datadog or
| honeycomb.
| arccy wrote:
| tempo still feels very much: look at a trace that you found
| from elsewhere (like logs).
|
| with so much information in traces and the pure volume, the
| aggregation really is the key to actionable info out of a
| tracing setup if it's going to be the primary entry point.
| maccard wrote:
| I've not. Honestly, I'm not in the market for tool shopping
| at the moment, I need another honeycomb-style moment of
| "this is incredible" to start looking again. I think it
| would take "Honeycomb, but we handle metric rollups and do
| logs" right now.
| ankit01-oss wrote:
| You can also check out SigNoz -
| https://github.com/SigNoz/signoz. It has logs, metrics, and
| traces under a single pane. If you're using otel libraries
| and otel collector you can do a lot of correlation between
| your logs and traces. I am a maintainer, and we have seen a
| lot of our users using signoz to have the ease of having
| three signals in a single pane.
| serverlessmom wrote:
| I think Honeycomb is perfect for one kind of user, who's
| entirely concerned with traces and very long retention. For a
| more general OpenTelemetry-native solution, check out Signoz.
| rewmie wrote:
| > It sounds like they were in a place that a lot of companies
| are in where they don't have a single pane of glass for
| observability.
|
| One of the biggest features of AWS which is very easy to take
| for granted and go unnoticed is Amazon CloudWatch. It supports
| metrics, logging, alarms, metrics from alarms, alarms from
| alarms, querying historical logs, trigger actions, etc etc etc.
| and it covers each and every single service provided by AWS
| including metaservices like AWS Config and Cloudtrail.
|
| And you barely notice it. It's just there, and you can see
| everything.
|
| > One of the terrible mistakes I see companies make with this
| tooling is fragmenting like this.
|
| So much this. It's not fun at all to have to go through logs
| and metrics on any application,and much less so if for some
| reason their maintainers scattered their metrics emission to
| the four winds. However, with AWS all roads lead to Cloudwatch,
| and everything is so much better.
| yourapostasy wrote:
| _> ...with AWS all roads lead to Cloudwatch, and everything
| is so much better._
|
| Most of my clients are not in the product-market fit for AWS
| CloudWatch, because most of their developers don't have the
| development, testing and operational maturity/discipline to
| use CloudWatch cost-effectively (this is at root an
| organization problem, but let's not go off onto that giant
| tangent). So the only realistic tracing strategy we converged
| upon to recommend for them is "grab everything, and retain it
| up to the point in time we won't be blamed for not knowing
| root cause" (which in some specific cases can be up to
| years!), while we undertake the long journey with them to
| upskill their teams.
|
| This would make using CloudWatch everywhere rapidly climb up
| into the top three largest line item in the AWS bill, easily
| justifying spinning that tracing functionality in-house. So
| we wind up opting into self-managed tooling like Elastic
| Observability or Honeycomb where the pricing is friendlier to
| teams in unfortunate situations that need to start with
| everything for CYA, much as I would like to stay within
| CloudWatch.
|
| Has anyone found a better solution to these use cases where
| the development maturity level is more prosaic, or is this
| really the best local maxima at the industry's current SOTA?
| everfrustrated wrote:
| In addition, one of the largest limitations of CloudWatch is
| it doesn't work well with a many-aws-account strategy.
|
| Some part of the value of Datadog etc is having a single pane
| of glass over many aws accounts.
| badloginagain wrote:
| I feel we hold up single-observability-solution as the Holy
| Grail, and I can see the argument for it- one place to
| understand the health of your services.
|
| But I've also been in terrible vendor lock-in situations, being
| bent over the barrel because switching to a better solution is
| so damn expensive.
|
| At least now with OTel you have an open standard that allows
| you to switch easier, but even then I'd rather have 2 solutions
| that meet my exact observability requirements than a single
| solution that does everything OKish.
| mikeshi42 wrote:
| Biased as a founder in the space [1] but I think with
| OpenTelemetry + OSS extensible observability tooling, the
| holy grail of one tool is more realizable than ever.
|
| Vendor lock in with Otel now is hopefully a thing of the past
| - but now that more obs solutions are going open source,
| hopefully it's not necessarily true that one tool would be
| mediocre over all use cases (since DD and the likes are
| inherently limited by their own engineering teams, vs OSS
| products can have community/customer contributions to improve
| the surface area over time on top of the core maintainer's
| work).
|
| [1] https://github.com/hyperdxio/hyperdx
| pranay01 wrote:
| I think that OpenTelemetry will solve this problem of vendor
| lock in. I am a founder building in this space[1] and we see
| many of our users switching to opentelemetry as that provides
| an easy way to switch if needed in future.
|
| At SigNoz, we have metrics, traces and logs in a single
| application which helps you correlate across signals much
| more easily - and being natively based on opentelemetry makes
| this correlation much easier as it leverages the standard
| data format.
|
| Though this might take sometime, as many teams have
| proprietary SDK in their code, which is not easy to rip out.
| Opentelemetry auto-instrumentation[2] makes it much easier,
| and I think that's the path people will follow to get started
|
| [1]https://github.com/SigNoz/signoz [2]https://opentelemetry.
| io/docs/instrumentation/java/automatic...
| sofixa wrote:
| Switch the backend destination of metrics/traces/logs, but
| all your dashboards, alerts, and potentially legacy data
| still need to be migrated. Drastically better than before
| where instrumentation and agents were custom for each
| backend, but there's still hurdles.
| nevon wrote:
| I would love to save a few hundred thousands a year by running
| Otel collector over Datadog agents, just on the cost-per-host
| alone. Unfortunately that would also mean giving up Datatog APM
| and NPM, as far as I can tell, which have been really valuable.
| Going back to just metrics and traces would feel like quite the
| step backwards and be a hard sell.
| arccy wrote:
| you can submit opentelemetry traces to datadog which should be
| the equivalent of apm/npm, though maybe with a less polished
| integration.
| nevon wrote:
| Just traces are a long way off from APM and NPM. APM gives me
| the ability to debug memory leaks from continuous heap
| snapshots, or performance issues through CPU profiling. NPM
| is almost like having tcpdump running constantly, showing me
| where there's packet loss or other forms of connectivity
| issues.
| porker wrote:
| Thank you for sharing this, I've had "look at tracing" on
| my to do list for months and assumed it was identical to
| APM. It seems it won't be a direct substitute, which helps
| explain the cost difference.
| throwaway084t95 wrote:
| What is the "first principles" argument that observability
| decomposes into logs, metrics, and tracing? I see this dogma
| accepted everywhere, but I'm inquisitive about it
| yannyu wrote:
| First you had logs. Everyone uses logs because it's easy. Logs
| are great, but suddenly you're spending a crapton of time or
| money maintaining terabytes or petabytes of storage and ingest
| of logs. And even worse, in some cases for these logs, you
| don't actually care about 99% of the log line and simply want a
| single number, such as CPU utilization or the value of the
| shopping cart or latency.
|
| So, someone says, "let's make something smaller and more
| portable than logs. We need to track numerical data over time
| more easily, so that we can see pretty charts of when these
| values are outside of where they should be." This ends up being
| metrics and a time-series database (TSDB), built to handle not
| arbitrary lines of text but instead meant to parse out metadata
| and append numerical data to existing time-series based on that
| metadata.
|
| Between metrics and logs, you end up with a good idea of what's
| going on with your infrastructure, but logs are still too
| verbose to understand what's happening with your applications
| past a certain point. If you have an application crashing
| repeatedly, or if you've got applications running slowly,
| metrics and logs can't really help you there. So companies
| built out Application Performance Monitoring, meant to tap
| directly into the processes running on the box and spit out all
| sorts of interesting runtime metrics and events about not just
| the applications, but the specific methods and calls those
| applications are utilizing within their stack/code.
|
| Initially, this works great if you're running these APM tools
| on a single box within monolithic stacks, but as the world
| moved toward Cloud Service Providers and
| containerized/ephemeral infrastructure, APM stopped being as
| effective. When a transaction starts to go through multiple
| machines and microservices, APM deployed on those boxes
| individually can't give you the context of how these disparate
| calls relate to a holistic transaction.
|
| So someone says, "hey, what if we include transaction IDs in
| these service calls, so that we can post-hoc stitch together
| these individual transaction lines into a whole transaction,
| end-to-end?" Which is how you end up with the concept of spans
| and traces, taking what worked well with Application
| Performance Monitoring and generalizing that out into the
| modern microservices architectures that are more common today.
| shoelessone wrote:
| I really really want to use OTel for a small project but have
| always had a really tough time finding a path that is cheap or
| free for a personal project.
|
| In theory you can send telemetry data with OTel to Cloud Watch,
| but I've struggle to connect the dots with the front end
| application (e.g. React/Next.js).
| arccy wrote:
| grafana cloud, honeycomb, etc have free tiers, though you'll
| have to watch how much data you send them. or you can self host
| something like signoz or the elastic stack. frontend will
| typically go to an instance of opentelemetry collector to
| filter/convert to the protocol for the storage backend.
| yourapostasy wrote:
| Have you checked out Jaeger [1]? It is lightweight enough for a
| personal project, open source, and featureful enough to really
| help "turn on the lightbulb" with other engineers to show them
| the difference between logging/monitoring and tracing.
|
| [1] https://www.jaegertracing.io/
| Jedd wrote:
| The killer feature of OpenTelemetry for us is brokering (with
| ETL).
|
| Partly this lets us easily re-route & duplicate telemetry, partly
| it means changes to backend products in the future won't be a big
| disruption.
|
| For metrics we're a mostly telegraf->prometheus->grafana mimir
| shop - telegraf because its rock solid and feature-rich,
| prometheus because there's no real competition in that tier, and
| mimir because of scale & self-host options.
|
| Our scale problem means most online pricing calculators generate
| overflow errors.
|
| Our non-security log destination preference is Loki - for similar
| reasons to Mimir - though a SIEM it definitely is not.
|
| Tracing to a vendor, but looking to bring that back to grafana
| Tempo. Product maturity is a long way off commercial APM
| offerings, but it feels like the feature-set is about 70% there
| and converging rapidly. Off-the-shelf tracing products have an
| appealingly low cost of entry, which only briefly defers lock-in
| & pricing shocks.
| pranay01 wrote:
| Yeah, the ability to send to multiple sources is quite powerful
| and most of this comes from the configurability of Otel
| Collector [1].
|
| If you are looking for a open source backend for OpenTelemetry,
| then you can explore SigNoz[2] (I am one of the founders) We
| have a quite a decent product for APM/tracing leveraging
| opentelemerty native data format and semantic convention.
|
| [1]https://opentelemetry.io/docs/collector/
| [2]https://github.com/SigNoz/signoz
| Jedd wrote:
| Hi Pranay - actually I've had a signoz tab open for about 5
| weeks - once I find time I'm meaning to run it up in my lab.
| pranay01 wrote:
| Awesome! Do reach out to us in our slack community[1] if
| you have any questions or need any help on setting things
| up
|
| [1] https://signoz.io/slack
| nullify88 wrote:
| One thing that's slightly off putting about OpenTelemetry is how
| resource attributes don't get included as prometheus labels for
| metrics, instead they are on an info metric which requires a join
| to enrich the metric you are interested in.
|
| Luckily the prometheus exporters have a switch to enable this
| behaviour, but there's talk of removing this functionality
| because it breaks the spec. If you were to use the OpenTelemetry
| protocol in to something like Mimir, you don't have the option of
| enabling that behaviour unless you use prometheus remote write.
|
| Our developers aren't a fan of that.
|
| https://opentelemetry.io/docs/specs/otel/compatibility/prome...
| jon-wood wrote:
| At the risk of being downvoted (probably justly) for having a
| moan, can we please have a moratorium on every blog post needing
| to have a generally irrelevant picture attached to it? On opening
| this page I can see 28 words that are actually relevant because
| almost the entire view is consumed by a huge picture of a graph
| and the padding around it.
|
| This is endemic now. Doesn't matter what someone is writing about
| there'll be some pointless stock photo taking up half the page.
| There'll probably be some more throughout the page. Stop it
| please.
___________________________________________________________________
(page generated 2023-11-17 23:02 UTC)