[HN Gopher] Migrating to OpenTelemetry
       ___________________________________________________________________
        
       Migrating to OpenTelemetry
        
       Author : kkoppenhaver
       Score  : 242 points
       Date   : 2023-11-16 17:29 UTC (1 days ago)
        
 (HTM) web link (www.airplane.dev)
 (TXT) w3m dump (www.airplane.dev)
        
       | caust1c wrote:
       | Curious about the code implemented for logs! Hopefully that's
       | something that can be shared at some point. Also curious if it
       | integrates with `log/slog` :-)
       | 
       | Congrats too! As I understand it from stories I've heard from
       | others, migrating to OTel is no easy undertaking.
        
         | bhyolken wrote:
         | Thanks! For logs, we actually use github.com/segmentio/events
         | and just implemented a handler for that library that batches
         | logs and periodically flushes them out to our collector using
         | the underlying protocol buffer interface. We plan on migrating
         | to log/slog soon, and once we do that we'll adapt our handler
         | and can share the code.
        
           | caust1c wrote:
           | Awesome! Great work and thanks for sharing your experience!
        
       | MajimasEyepatch wrote:
       | It's interesting that you're using both Honeycomb and Datadog.
       | With everything migrated to OTel, would there be advantages to
       | consolidating on just Honeycomb (or Datadog)? Have you found
       | they're useful for different things, or is there enough overlap
       | that you could use just one or the other?
        
         | bhyolken wrote:
         | Author here, thanks for the question! The current split
         | developed from the personal preferences of the engineers who
         | initially set up our observability systems, based on what they
         | had used (and liked) at previous jobs.
         | 
         | We're definitely open to doing more consolidation in the
         | future, especially if we can save money by doing that, but from
         | a usability standpoint we've been pretty happy with Honeycomb
         | for traces and Datadog for everything else so far. And, that
         | seems to be aligned with what each vendor is best at at the
         | moment.
        
           | MuffinFlavored wrote:
           | > from the personal preferences of the engineers
           | 
           | https://www.honeycomb.io/pricing
           | 
           | https://www.datadoghq.com/pricing/
           | 
           | Am I wrong to say... having 2 is "expensive"? Maybe not if
           | 50% of your stuff is going to Honeycomb and 50% going to
           | DataDog. Could you save money/complexity (less places to look
           | for things) having just DataDog or just Honeycomb?
        
             | bhyolken wrote:
             | Right now, there isn't much duplication of what we're
             | sending to each vendor, so I don't think we'd save a ton by
             | consolidating, at least based on list prices. We could
             | maybe negotiate better prices based on higher volumes, but
             | I'm not sure if Airplane is spending enough at this point
             | to get massive discounts there.
             | 
             | Another potential benefit would definitely be reduced
             | complexity and better integration for the engineering team.
             | So, for instance, you could look at a log and then more
             | easily navigate to the UI for the associated trace.
             | Currently, we do this by putting Honeycomb URLs in our
             | Datadog log events, which works but isn't quite as
             | seamless. But, given that our team is pretty small at this
             | point and that we're not spending a ton of our time on
             | performance optimizations, we don't feel an urgent need to
             | consolidate (yet).
        
               | MuffinFlavored wrote:
               | When you say DataDog for everything else (as in not
               | traces), besides logs, what else do you mean?
        
               | claytonjy wrote:
               | Metrics, probably? The article calls out logs, metrics,
               | and traces as the 3 pillars of observability.
        
               | bhyolken wrote:
               | Yeah, metrics and logs, plus a few other things that
               | depend on these (alerts, SLOs, metric-based dashboards,
               | etc.).
        
       | tapoxi wrote:
       | I made this switch very recently. For our Java apps it was as
       | simple as loading the otel agent in place of the Datadog SDK,
       | basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in
       | our args.
       | 
       | The collector (which processes and ships metrics) can be
       | installed in K8S through Helm or an operator, and we just added a
       | variable to our charts so the agent can be pointed at the
       | collector. The collector speaks OTLP which is the fancy combined
       | metrics/traces/logs protocol the OTEL SDKs/agents use, but it
       | also speaks Prometheus, Zipkin, etc to give you an easy migration
       | path. We currently ship to Datadog as well as an internal
       | service, with the end goal being migrating off of Datadog
       | gradually.
        
         | andrewstuart2 wrote:
         | We tried this about a year and a half ago and ended up going
         | somewhat backwards into DD entrenchment, because they've
         | decided that anything not an official DD metric (that is,
         | collected by their agent typically) is custom and then becomes
         | substantially more expensive. We wanted a nice migration path
         | from any vendor to any other vendor but they have a fairly
         | effective strategy for making gradual migrations more expensive
         | for heavy telemetry users. At least our instrumentation these
         | days is otel, but it's the metrics we expected to just scrape
         | from prometheus that we had to dial back and start using more
         | official DD agent metrics and configs to get, lest our bill
         | balloon by 10x. It's a frustrating place to be. Especially
         | since it's still not remotely cheap, just that it could be way
         | worse.
         | 
         | I know this isn't a DataDog post, and I'm a bit off topic, but
         | I try to do my best to warn against DD these days.
        
           | shawnb576 wrote:
           | This has been a concern for me too. But the agent is just a
           | statsd receiver with some extra magic, so this seems like a
           | thing that could be solved with the collector sending traffic
           | to an agent rather than the HTTP APIs?
           | 
           | I looked at the OTel DD stuff and did not see any support for
           | this, fwiw, maybe it doesn't work b/c the agent expects more
           | context from the pod (e.g. app and label?)
        
             | andrewstuart2 wrote:
             | Yeah, the DD agent and the otel-collector DD exporter
             | actually use the same code paths for the most part. The
             | relevant difference tends to be in metrics, where the
             | official path involves the DD agent doing collection
             | directly, for example, collecting redis metrics by giving
             | the agent your redis database hostname and creds. It can
             | then pack those into the specific shape that DD knows about
             | and they get sent with the right name, values, etc so that
             | DD calls them regular metrics.
             | 
             | If you instead went the more flexible route of using many
             | of the de-facto standard prometheus exporters like the one
             | for redis, or built-in prometheus metrics from something
             | like istio, and forward those to your agent or configure
             | your agent to poll those prometheus metrics, it won't do
             | any reshaping (which I can see the arguments for, kinda,
             | knowing a bit about their backend) and they just end up in
             | the DD backend as custom metrics, and charge you at
             | $0.10/mo per 100 time series. If you've used prometheus
             | before for any realistic deployments with enrichment etc,
             | you can probably see this gets expensive ridiculously fast.
             | 
             | What I wish they'd do instead is have some form of adapter
             | from those de facto standards, so I can still collect
             | metrics 99% my own way, in a portable fashion, and then add
             | DD as my backend without ending up as custom everything,
             | costing significantly more.
        
           | xyst wrote:
           | > somewhat backwards into DD entrenchment, because they've
           | decided that anything not an official DD metric (that is,
           | collected by their agent typically) is custom and then
           | becomes substantially more expensive.
           | 
           | It a vendor pulled shit like this on me. That's when I would
           | counsel them. Of course most big orgs would rather not do the
           | leg work to actually become portable, migrate off vendor. So
           | of course they will just pay the bill.
           | 
           | Vendors love the custom shit they build because they know
           | once it's infiltrated the stack then it's basically like
           | gangrene (have to cut off the appendage to save the host)
        
       | k__ wrote:
       | I had the impression, logs and metrics are a pre-observability
       | thing.
        
         | SteveNuts wrote:
         | I've never heard the term "pre-observability", what does that
         | mean?
        
           | renegade-otter wrote:
           | The era when "debugging in production" wasn't standard.
        
         | marcosdumay wrote:
         | Observability is about logs and metrics, and pre-observability
         | (I guess you mean the high-level-only records simpler
         | environments keep) is also about logs and metrics.
         | 
         | Anything you register to keep track of your environment has the
         | form of either logs or metrics. The difference is about the
         | contents of such logs and metrics.
        
           | k__ wrote:
           | When I read Observability Engineering, I got the impression
           | it was about long events and tracing, and metrics and logs
           | were a thing of the past people gave up on since the rise of
           | Microservices.
        
             | jwestbury wrote:
             | > metrics and logs were a thing of the past people gave up
             | on since the rise of Microservices
             | 
             | Definitely not the case, and, in fact, probably the
             | opposite is true. In the era of microservices, metrics are
             | absolutely critical to understand the health of your
             | system. Distributed tracing is also only beneficial if you
             | have the associated logs - so that you can understand what
             | each piece of the system was doing for a single unit of
             | work.
        
               | phillipcarter wrote:
               | > Distributed tracing is also only beneficial if you have
               | the associated logs - so that you can understand what
               | each piece of the system was doing for a single unit of
               | work.
               | 
               | Ehhh, that's only if you view tracing as "the thing that
               | tells me that service A talks to service B". Spans in a
               | trace are just structured logs. They are your application
               | logging vehicle, especially if you don't have a legacy of
               | good in-app instrumentation via logs.
               | 
               | But even then the worlds are blurring a bit. OTel logs
               | burn in a span and trace ID, and depending on the backend
               | that correlated log may well just be treated as if it's a
               | part of the trace.
        
             | sofixa wrote:
             | > Authors Charity Majors, Liz Fong-Jones, and George
             | Miranda from Honeycomb explain what constitutes good
             | observability, show you how to improve upon what you're
             | doing today, and provide practical dos and don'ts for
             | migrating from legacy tooling, such as metrics, monitoring,
             | and log management. You'll also learn the impact
             | observability has on organizational culture (and vice
             | versa).
             | 
             | No wonder, it's either strong bias from people working in a
             | tracing vendor, or outright a sales pitch.
             | 
             | It's totally false though. Each pillar - metrics, logs and
             | traces have their place and serve different purposes. You
             | won't use traces to measure the number of requests hitting
             | your load balancer, or the amount of objects in the async
             | queue, or CPU utilisation, or network latency, or any
             | number of things. Logs can be more rich than traces, and a
             | nice pattern I've used with Grafana is linking the two, and
             | having the option to jump to corresponding log lines from a
             | trace which can describe the different actions performed
             | during that span.
        
               | phillipcarter wrote:
               | You can sorta measure some of this with traces. For
               | example, sampled traces that contain the sampling rate in
               | their metadata let you re-weight counts, thus allowing
               | you to accurately measure "number of requests to x".
               | Similarly, a good sampling of network latency can
               | absolutely be measured by trace data. Metrics will always
               | have their place, though, for reasons you mention -
               | measuring cpu utilization, # of objects in something etc.
               | Logs vs. traces is more nuanced I think. A trace is
               | nothing more than a collection of structured logs. I
               | would wager that nearly all use cases for structured
               | logging could be wholesale replaced by tracing. Security
               | logging and big object logging is an exception, although
               | that's also dependent on your vendor or backend.
        
       | tsamba wrote:
       | Interesting read. What did you find easier about using GCP's log
       | tooling for your internal system logs, rather than the OTel
       | collector?
        
         | clintonb wrote:
         | Their collector is used to send infrastructure logs to GCP
         | (instead of Datadog).
         | 
         | My guess is this is to save on costs. GCP logging is probably
         | cheaper than Datadog, and infrastructure logs may not be needed
         | as frequently as application logs.
        
         | bhyolken wrote:
         | Author here. This decision was more about ease of
         | implementation than anything else. Our internal application
         | logs were already being scooped up by GCP because we run our
         | services in GKE, and we already had a GCP->Datadog log syncer
         | [1] for some other GCP infra logs, so re-using the GCP-based
         | pipeline was the easiest way to handle our application logs
         | once we removed the Datadog agent.
         | 
         | In the future, we'll probably switch these logs to also go
         | through our collector, and it shouldn't be super hard (because
         | we already implemented a golang OTel log handler for the
         | external case), but we just haven't gotten around to it yet.
         | 
         | [1]
         | https://docs.datadoghq.com/integrations/google_cloud_platfor...
        
       | roskilli wrote:
       | > Moreover, we encountered some rough edges in the metrics-
       | related functionality of the Go SDK referenced above. Ultimately,
       | we had to write a conversion layer on top of the OTel metrics API
       | that allowed for simple, Prometheus-like counters, gauges, and
       | histograms.
       | 
       | Have encountered this a lot from teams attempting to use the
       | metrics SDK.
       | 
       | Are you open to comment on specifics here and also what kind of
       | shim you had to put in front of the SDK? It would be great to
       | continue to retrieve feedback so that we can as a community have
       | a good idea of what remains before it's possible to use the SDK
       | for real world production use cases in anger. Just wiring up the
       | setup in your app used to be fairly painful but that has gotten
       | somewhat better over the last 12-24 months, I'd love to also hear
       | what is currently causing compatibility issues w/ the metric
       | types themselves using the SDK which requires a shim and what the
       | shim is doing to achieve compatibility.
        
         | bhyolken wrote:
         | Sure, happy to provide more specifics!
         | 
         | Our main issue was the lack of a synchronous gauge. The
         | officially supported asynchronous API of registering a callback
         | function to report a gauge metric is very different from how we
         | were doing things before, and would have required lots of
         | refactoring of our code. Instead, we wrote a wrapper that
         | exposes a synchronous-like API: https://gist.github.com/yolken-
         | airplane/027867b753840f7d15d6....
         | 
         | It seems like this is a common feature request across many of
         | the SDKs, and it's in the process of being fixed in some of
         | them (https://github.com/open-telemetry/opentelemetry-
         | specificatio...)? I'm not sure what the plans are for the
         | golang SDK specifically.
         | 
         | Another, more minor issue, is the lack of support for
         | "constant" attributes that are applied to all observations of a
         | metric. We use these to identify the app, among other use
         | cases, so we added wrappers around the various "Add", "Record",
         | "Observe", etc. calls that automatically add these. (It's
         | totally possible that this is supported and I missed it, in
         | which case please let me know.)
         | 
         | Overall, the SDK was generally well-written and well-
         | documented, we just needed some extra work to make the
         | interfaces more similar to the ones we were using before.
        
           | arccy wrote:
           | the official SDKs will only support an api once there's a
           | spec that allows it.
           | 
           | for const attributes, generally these should be defined at
           | the resource / provider level: https://pkg.go.dev/go.opentele
           | metry.io/otel/sdk/metric#WithR...
        
           | roskilli wrote:
           | Thanks for the detailed response.
           | 
           | I am surprised there is no gauge update API yet (instead of
           | callback only), this is a common use case and I don't think
           | folks should be expected to implement their own. Especially
           | since it will lead to potentially allocation heavy bespoke
           | implementations, depending on use case given
           | mutex+callback+other structures that likely need to be heap
           | allocated (vs a simple int64 wrapper with atomic update/load
           | APIs).
           | 
           | Also I would just say that the fact the APIs differ a lot to
           | more common popular Prometheus client libraries does beg the
           | question of do we need more complicated APIs that folks have
           | a harder time using. Now is the time to modernize these
           | before everyone is instrumented with some generation of a
           | client library that would need to change/evolve. The whole
           | idea of an OTel SDK is instrument once and then avoid needing
           | to re-instrument again when making changes to your
           | observability pipeline and where it's pointed. This becomes a
           | hard sell if OTel SDK needs to shift fairly significantly to
           | support more popular & common use cases with more typical
           | APIs and by doing so leaves a whole bunch of OTel
           | instrumented code that needs to be modernized to a different
           | looking API.
        
       | CSMastermind wrote:
       | > The data collected from these streams is sent to several
       | vendors including Datadog (for application logs and metrics),
       | Honeycomb (for traces), and Google Cloud Logging (for
       | infrastructure logs).
       | 
       | It sounds like they were in a place that a lot of companies are
       | in where they don't have a single pane of glass for
       | observability. One of if not the main benefit I've gotten out of
       | Datadog is having everything in Datadog so that it's all
       | connected and I can easily jump from a trace to logs for
       | instance.
       | 
       | One of the terrible mistakes I see companies make with this
       | tooling is fragmenting like this. Everyone has their own personal
       | preference for tool and ultimately the collective experience is
       | significantly worse than the sum of its parts.
        
         | devin wrote:
         | Eh, personally I view honeycomb and datadog as different enough
         | offerings that I can see why you'd choose to have both.
        
         | dexterdog wrote:
         | Depending on your usage it can be prohibitively expensive to
         | use datadog for everything like that. We have it for just our
         | prod env because it's just not worth what it brings to the
         | table to put all of our logs into it.
        
           | dabeeeenster wrote:
           | Is prod not 99% of your logs?
        
           | shric wrote:
           | I once worked out what it would cost to send our company's
           | prod logs to datadog. It was 1.5x our total AWS cost. The
           | company ran entirely on AWS
        
         | maccard wrote:
         | I've spent a small amount of time in datadog, lots in grafana,
         | and somewhere in between in honeycomb. Out applications are
         | designed to emit traces, and comparing honeycomb with tracing
         | to a traditional app with metrics and logs, I would choose
         | tracing every time.
         | 
         | It annoys me that logs are overlooked in honeycomb, (and
         | metrics are... fine). But, given the choice between a single
         | pane of glass in grafana or having to do logs (and metrics
         | sometimes) in cloudwatch but spending 95% of my time in
         | honeycomb - I'd pick honeycomb every time
        
           | mdtusz wrote:
           | Agreed - honeycomb has been a boon, however some improvements
           | to metric displays and the ability to set the default "board"
           | used in the home page would be very welcome. Also would be
           | pretty happy if there was a way to drop events on the
           | honeycomb side for a way to dynamically filter - e.g. "don't
           | even bother storing this trace if it has a http.status_code <
           | 400". This is surprisingly painful to implement on the
           | application side (at least in rust).
           | 
           | Hopefully someone that works there is reading this.
        
             | masterj wrote:
             | It sounds like you should look into their tail-sampling
             | Refinery tool https://docs.honeycomb.io/manage-data-
             | volume/refinery/
        
               | phillipcarter wrote:
               | Yep, this is the one to use. Refinery handles exactly
               | this scenario (and more).
        
           | viraptor wrote:
           | Have you tried the traces in grafana/tempo yet?
           | https://grafana.com/docs/grafana/latest/panels-
           | visualization...
           | 
           | It seems to miss some aggregation stuff, but also it's
           | improving every time I check. I wonder if anyone's used it in
           | anger yet and how far is it from replacing datadog or
           | honeycomb.
        
             | arccy wrote:
             | tempo still feels very much: look at a trace that you found
             | from elsewhere (like logs).
             | 
             | with so much information in traces and the pure volume, the
             | aggregation really is the key to actionable info out of a
             | tracing setup if it's going to be the primary entry point.
        
             | maccard wrote:
             | I've not. Honestly, I'm not in the market for tool shopping
             | at the moment, I need another honeycomb-style moment of
             | "this is incredible" to start looking again. I think it
             | would take "Honeycomb, but we handle metric rollups and do
             | logs" right now.
        
           | ankit01-oss wrote:
           | You can also check out SigNoz -
           | https://github.com/SigNoz/signoz. It has logs, metrics, and
           | traces under a single pane. If you're using otel libraries
           | and otel collector you can do a lot of correlation between
           | your logs and traces. I am a maintainer, and we have seen a
           | lot of our users using signoz to have the ease of having
           | three signals in a single pane.
        
           | serverlessmom wrote:
           | I think Honeycomb is perfect for one kind of user, who's
           | entirely concerned with traces and very long retention. For a
           | more general OpenTelemetry-native solution, check out Signoz.
        
         | rewmie wrote:
         | > It sounds like they were in a place that a lot of companies
         | are in where they don't have a single pane of glass for
         | observability.
         | 
         | One of the biggest features of AWS which is very easy to take
         | for granted and go unnoticed is Amazon CloudWatch. It supports
         | metrics, logging, alarms, metrics from alarms, alarms from
         | alarms, querying historical logs, trigger actions, etc etc etc.
         | and it covers each and every single service provided by AWS
         | including metaservices like AWS Config and Cloudtrail.
         | 
         | And you barely notice it. It's just there, and you can see
         | everything.
         | 
         | > One of the terrible mistakes I see companies make with this
         | tooling is fragmenting like this.
         | 
         | So much this. It's not fun at all to have to go through logs
         | and metrics on any application,and much less so if for some
         | reason their maintainers scattered their metrics emission to
         | the four winds. However, with AWS all roads lead to Cloudwatch,
         | and everything is so much better.
        
           | yourapostasy wrote:
           | _> ...with AWS all roads lead to Cloudwatch, and everything
           | is so much better._
           | 
           | Most of my clients are not in the product-market fit for AWS
           | CloudWatch, because most of their developers don't have the
           | development, testing and operational maturity/discipline to
           | use CloudWatch cost-effectively (this is at root an
           | organization problem, but let's not go off onto that giant
           | tangent). So the only realistic tracing strategy we converged
           | upon to recommend for them is "grab everything, and retain it
           | up to the point in time we won't be blamed for not knowing
           | root cause" (which in some specific cases can be up to
           | years!), while we undertake the long journey with them to
           | upskill their teams.
           | 
           | This would make using CloudWatch everywhere rapidly climb up
           | into the top three largest line item in the AWS bill, easily
           | justifying spinning that tracing functionality in-house. So
           | we wind up opting into self-managed tooling like Elastic
           | Observability or Honeycomb where the pricing is friendlier to
           | teams in unfortunate situations that need to start with
           | everything for CYA, much as I would like to stay within
           | CloudWatch.
           | 
           | Has anyone found a better solution to these use cases where
           | the development maturity level is more prosaic, or is this
           | really the best local maxima at the industry's current SOTA?
        
           | everfrustrated wrote:
           | In addition, one of the largest limitations of CloudWatch is
           | it doesn't work well with a many-aws-account strategy.
           | 
           | Some part of the value of Datadog etc is having a single pane
           | of glass over many aws accounts.
        
         | badloginagain wrote:
         | I feel we hold up single-observability-solution as the Holy
         | Grail, and I can see the argument for it- one place to
         | understand the health of your services.
         | 
         | But I've also been in terrible vendor lock-in situations, being
         | bent over the barrel because switching to a better solution is
         | so damn expensive.
         | 
         | At least now with OTel you have an open standard that allows
         | you to switch easier, but even then I'd rather have 2 solutions
         | that meet my exact observability requirements than a single
         | solution that does everything OKish.
        
           | mikeshi42 wrote:
           | Biased as a founder in the space [1] but I think with
           | OpenTelemetry + OSS extensible observability tooling, the
           | holy grail of one tool is more realizable than ever.
           | 
           | Vendor lock in with Otel now is hopefully a thing of the past
           | - but now that more obs solutions are going open source,
           | hopefully it's not necessarily true that one tool would be
           | mediocre over all use cases (since DD and the likes are
           | inherently limited by their own engineering teams, vs OSS
           | products can have community/customer contributions to improve
           | the surface area over time on top of the core maintainer's
           | work).
           | 
           | [1] https://github.com/hyperdxio/hyperdx
        
           | pranay01 wrote:
           | I think that OpenTelemetry will solve this problem of vendor
           | lock in. I am a founder building in this space[1] and we see
           | many of our users switching to opentelemetry as that provides
           | an easy way to switch if needed in future.
           | 
           | At SigNoz, we have metrics, traces and logs in a single
           | application which helps you correlate across signals much
           | more easily - and being natively based on opentelemetry makes
           | this correlation much easier as it leverages the standard
           | data format.
           | 
           | Though this might take sometime, as many teams have
           | proprietary SDK in their code, which is not easy to rip out.
           | Opentelemetry auto-instrumentation[2] makes it much easier,
           | and I think that's the path people will follow to get started
           | 
           | [1]https://github.com/SigNoz/signoz [2]https://opentelemetry.
           | io/docs/instrumentation/java/automatic...
        
             | sofixa wrote:
             | Switch the backend destination of metrics/traces/logs, but
             | all your dashboards, alerts, and potentially legacy data
             | still need to be migrated. Drastically better than before
             | where instrumentation and agents were custom for each
             | backend, but there's still hurdles.
        
       | nevon wrote:
       | I would love to save a few hundred thousands a year by running
       | Otel collector over Datadog agents, just on the cost-per-host
       | alone. Unfortunately that would also mean giving up Datatog APM
       | and NPM, as far as I can tell, which have been really valuable.
       | Going back to just metrics and traces would feel like quite the
       | step backwards and be a hard sell.
        
         | arccy wrote:
         | you can submit opentelemetry traces to datadog which should be
         | the equivalent of apm/npm, though maybe with a less polished
         | integration.
        
           | nevon wrote:
           | Just traces are a long way off from APM and NPM. APM gives me
           | the ability to debug memory leaks from continuous heap
           | snapshots, or performance issues through CPU profiling. NPM
           | is almost like having tcpdump running constantly, showing me
           | where there's packet loss or other forms of connectivity
           | issues.
        
             | porker wrote:
             | Thank you for sharing this, I've had "look at tracing" on
             | my to do list for months and assumed it was identical to
             | APM. It seems it won't be a direct substitute, which helps
             | explain the cost difference.
        
       | throwaway084t95 wrote:
       | What is the "first principles" argument that observability
       | decomposes into logs, metrics, and tracing? I see this dogma
       | accepted everywhere, but I'm inquisitive about it
        
         | yannyu wrote:
         | First you had logs. Everyone uses logs because it's easy. Logs
         | are great, but suddenly you're spending a crapton of time or
         | money maintaining terabytes or petabytes of storage and ingest
         | of logs. And even worse, in some cases for these logs, you
         | don't actually care about 99% of the log line and simply want a
         | single number, such as CPU utilization or the value of the
         | shopping cart or latency.
         | 
         | So, someone says, "let's make something smaller and more
         | portable than logs. We need to track numerical data over time
         | more easily, so that we can see pretty charts of when these
         | values are outside of where they should be." This ends up being
         | metrics and a time-series database (TSDB), built to handle not
         | arbitrary lines of text but instead meant to parse out metadata
         | and append numerical data to existing time-series based on that
         | metadata.
         | 
         | Between metrics and logs, you end up with a good idea of what's
         | going on with your infrastructure, but logs are still too
         | verbose to understand what's happening with your applications
         | past a certain point. If you have an application crashing
         | repeatedly, or if you've got applications running slowly,
         | metrics and logs can't really help you there. So companies
         | built out Application Performance Monitoring, meant to tap
         | directly into the processes running on the box and spit out all
         | sorts of interesting runtime metrics and events about not just
         | the applications, but the specific methods and calls those
         | applications are utilizing within their stack/code.
         | 
         | Initially, this works great if you're running these APM tools
         | on a single box within monolithic stacks, but as the world
         | moved toward Cloud Service Providers and
         | containerized/ephemeral infrastructure, APM stopped being as
         | effective. When a transaction starts to go through multiple
         | machines and microservices, APM deployed on those boxes
         | individually can't give you the context of how these disparate
         | calls relate to a holistic transaction.
         | 
         | So someone says, "hey, what if we include transaction IDs in
         | these service calls, so that we can post-hoc stitch together
         | these individual transaction lines into a whole transaction,
         | end-to-end?" Which is how you end up with the concept of spans
         | and traces, taking what worked well with Application
         | Performance Monitoring and generalizing that out into the
         | modern microservices architectures that are more common today.
        
       | shoelessone wrote:
       | I really really want to use OTel for a small project but have
       | always had a really tough time finding a path that is cheap or
       | free for a personal project.
       | 
       | In theory you can send telemetry data with OTel to Cloud Watch,
       | but I've struggle to connect the dots with the front end
       | application (e.g. React/Next.js).
        
         | arccy wrote:
         | grafana cloud, honeycomb, etc have free tiers, though you'll
         | have to watch how much data you send them. or you can self host
         | something like signoz or the elastic stack. frontend will
         | typically go to an instance of opentelemetry collector to
         | filter/convert to the protocol for the storage backend.
        
         | yourapostasy wrote:
         | Have you checked out Jaeger [1]? It is lightweight enough for a
         | personal project, open source, and featureful enough to really
         | help "turn on the lightbulb" with other engineers to show them
         | the difference between logging/monitoring and tracing.
         | 
         | [1] https://www.jaegertracing.io/
        
       | Jedd wrote:
       | The killer feature of OpenTelemetry for us is brokering (with
       | ETL).
       | 
       | Partly this lets us easily re-route & duplicate telemetry, partly
       | it means changes to backend products in the future won't be a big
       | disruption.
       | 
       | For metrics we're a mostly telegraf->prometheus->grafana mimir
       | shop - telegraf because its rock solid and feature-rich,
       | prometheus because there's no real competition in that tier, and
       | mimir because of scale & self-host options.
       | 
       | Our scale problem means most online pricing calculators generate
       | overflow errors.
       | 
       | Our non-security log destination preference is Loki - for similar
       | reasons to Mimir - though a SIEM it definitely is not.
       | 
       | Tracing to a vendor, but looking to bring that back to grafana
       | Tempo. Product maturity is a long way off commercial APM
       | offerings, but it feels like the feature-set is about 70% there
       | and converging rapidly. Off-the-shelf tracing products have an
       | appealingly low cost of entry, which only briefly defers lock-in
       | & pricing shocks.
        
         | pranay01 wrote:
         | Yeah, the ability to send to multiple sources is quite powerful
         | and most of this comes from the configurability of Otel
         | Collector [1].
         | 
         | If you are looking for a open source backend for OpenTelemetry,
         | then you can explore SigNoz[2] (I am one of the founders) We
         | have a quite a decent product for APM/tracing leveraging
         | opentelemerty native data format and semantic convention.
         | 
         | [1]https://opentelemetry.io/docs/collector/
         | [2]https://github.com/SigNoz/signoz
        
           | Jedd wrote:
           | Hi Pranay - actually I've had a signoz tab open for about 5
           | weeks - once I find time I'm meaning to run it up in my lab.
        
             | pranay01 wrote:
             | Awesome! Do reach out to us in our slack community[1] if
             | you have any questions or need any help on setting things
             | up
             | 
             | [1] https://signoz.io/slack
        
       | nullify88 wrote:
       | One thing that's slightly off putting about OpenTelemetry is how
       | resource attributes don't get included as prometheus labels for
       | metrics, instead they are on an info metric which requires a join
       | to enrich the metric you are interested in.
       | 
       | Luckily the prometheus exporters have a switch to enable this
       | behaviour, but there's talk of removing this functionality
       | because it breaks the spec. If you were to use the OpenTelemetry
       | protocol in to something like Mimir, you don't have the option of
       | enabling that behaviour unless you use prometheus remote write.
       | 
       | Our developers aren't a fan of that.
       | 
       | https://opentelemetry.io/docs/specs/otel/compatibility/prome...
        
       | jon-wood wrote:
       | At the risk of being downvoted (probably justly) for having a
       | moan, can we please have a moratorium on every blog post needing
       | to have a generally irrelevant picture attached to it? On opening
       | this page I can see 28 words that are actually relevant because
       | almost the entire view is consumed by a huge picture of a graph
       | and the padding around it.
       | 
       | This is endemic now. Doesn't matter what someone is writing about
       | there'll be some pointless stock photo taking up half the page.
       | There'll probably be some more throughout the page. Stop it
       | please.
        
       ___________________________________________________________________
       (page generated 2023-11-17 23:02 UTC)