[HN Gopher] GitHub CI/CD observability with OpenTelemetry step b...
       ___________________________________________________________________
        
       GitHub CI/CD observability with OpenTelemetry step by step guide
        
       Author : ankit01-oss
       Score  : 124 points
       Date   : 2025-06-11 12:42 UTC (4 days ago)
        
 (HTM) web link (signoz.io)
 (TXT) w3m dump (signoz.io)
        
       | reactordev wrote:
       | As someone who has some experience in observability at scale, the
       | issue with SigNoz, Prom, etc is that they can only operate on the
       | data that is exposed by the underlying infrastructure where the
       | IaaS has all the information to provide a better experience.
       | Hence CloudWatch.
       | 
       | That said, if you own your infrastructure, I'd build out a signoz
       | cluster in a heartbeat. Otel is awesome but once you set down a
       | path for your org, it's going to be extremely painful to switch.
       | Choose otel if you're a hybrid cloud or you have on premises
       | stuff. If you're on AWS, CloudWatch is a better option simply
       | because they have the data. Dead simple tracing.
        
         | 6r17 wrote:
         | I did have some bad experiences with OTEL and have lot of
         | freedom on deployment ; I never read of Signoz will definitely
         | check it out ; SigNoz is working with OTEL I suppose ?
         | 
         | I wonder if there are any other adapters for trace injest
         | instead of OTEL ?
        
           | darkstar_16 wrote:
           | Jaeger collector perhaps but then you'd have to use the
           | Jaeger UI. Signoz has a much nicer UI that feels more
           | integrated but last I checked had annoying bugs in the UI
           | like not keeping the time selection when I navigated between
           | screens.
        
             | 6r17 wrote:
             | Definitely should look up the tech more ; i lazily
             | commented as Signoz clearly state it ingest most than 50
             | different sources ;
        
           | elza_1111 wrote:
           | yep, SigNoz is OpenTelemetry native. You can instrument your
           | application with OpenTelemetry and send telemetry data
           | direclty to signoz.
        
           | bbkane wrote:
           | There are a few: I've played with https://uptrace.dev and
           | https://openobserve.ai/ . OpenObserve is a single binary, so
           | easy to set up
        
             | mdaniel wrote:
             | be cognizant of their licenses (AGPLv3), it matters in some
             | shops
             | 
             | https://github.com/uptrace/uptrace/blob/v1.7.6/LICENSE
             | 
             | https://github.com/openobserve/openobserve/blob/v0.14.7/LIC
             | E...
        
         | FunnyLookinHat wrote:
         | I think you're looking at OTel from a strictly infrastructure
         | perspective - which Cloudwatch does effectively solve without
         | any added effort. But OTel really begins to shine when you
         | instrument your backends. Some languages (Node.js) have a whole
         | slew of auto-instrumentation, giving you rich traces with spans
         | detailing each step of the http request, every SQL query, and
         | even usage of AWS services. Making those traces even more
         | valuable is that they're linked across services.
         | 
         | We've frequently seen a slowdown or error at the top of our
         | stack, and the teams are able to immediately pinpoint the
         | problem as a downstream service. Not only that, they can see
         | the specific issue in the downstream service almost
         | immediately!
         | 
         | Once you get to that level of detail, having your
         | infrastructure metrics pulled into your Otel provider does
         | start to make some sense. If you observe a slowdown in a
         | service, being able to see that the DB CPU is pegged at the
         | same time is meaningful, etc.
         | 
         | [Edit - Typo!]
        
           | makeavish wrote:
           | Agree with you on this. OTel agents allows exporting all
           | host/k8s metrics correlated with your logs and traces. Though
           | exporting AWS service specific metrics with OTel is not easy.
           | To solve this SigNoz has 1-Click AWS Integrations:
           | https://signoz.io/blog/native-aws-integrations-with-
           | autodisc...
           | 
           | Also SigNoz has native correlation between different signals
           | out of the box.
           | 
           | PS: I am SigNoz Maintainer
        
           | elza_1111 wrote:
           | FYI for anyone reading, OTel does have great auto-
           | instrumentation for Python, Java and .NET also
        
           | reactordev wrote:
           | Not confusing anything. Yes you can meter your own
           | applications, generate your own metrics, but most
           | organizations start their observability journey with the
           | hardware and latency metrics.
           | 
           | Otel provides a means to sugar any metric with labels and
           | attributes which is great (until you have high cardinality)
           | but there are still things that are at the infrastructure
           | level that only CloudWatch knows of (on AWS). If you're
           | running K8s on your own hardware - Otel would be my first
           | choice.
        
         | elza_1111 wrote:
         | There are integrations that let you monitor your AWS resources
         | also on SigNoz. That said, I personally think CloudWatch is
         | painful in so many other ways as well,
         | 
         | Check this out, https://signoz.io/blog/6-silent-traps-inside-
         | cloudWatch-that...
        
         | mdaniel wrote:
         | A child comment mentioned k8s but I also have been chomping at
         | the bit to try out the eBPF hooks in https://github.com/pixie-
         | io/pixie (or even https://github.com/coroot/coroot or
         | https://github.com/parca-dev/parca ) all of which are Apache 2
         | licensed
         | 
         | The demo for https://github.com/draios/sysdig was also just
         | amazing, but I don't have any idea what the storage
         | requirements would be for leaving it running
        
       | bravesoul2 wrote:
       | That's a genius idea. So obvious in retrospect.
        
       | hrpnk wrote:
       | Has anyone seen OTel being used well for long-running batch/async
       | processes? Wonder how the suggestions stack up to monolith builds
       | for Apps that take about an hour.
        
         | zdc1 wrote:
         | I've tried and failed at tracing transactions that span
         | multiple queues (with different backends). At the end I just
         | published some custom metrics for the transaction's success
         | count / failure count / duration and moved on my with life.
        
         | makeavish wrote:
         | You can use SpanLinks to analyse your async processes. This
         | guide might be helpful introduction:
         | https://dev.to/clericcoder/mastering-trace-analysis-with-spa...
         | 
         | Also SigNoz supports rendering practically unlimited number of
         | spans in trace detail UI and allows filtering them as well
         | which has been really useful in analyzing batch processes:
         | https://signoz.io/blog/traces-without-limits/
         | 
         | You can further run aggregation on spans to monitor failures
         | and latency.
         | 
         | PS: I am SigNoz maintainer
        
           | ai-christianson wrote:
           | Is this better than Honeycomb?
        
             | mdaniel wrote:
             | "Better" is always "for what metric" but if nothing else
             | having the source code to the stack is always "better" IMHO
             | even if one doesn't choose to self-host, and that goes
             | double for SigNoz choosing a permissive license, so one
             | doesn't have to get lawyers involved to run it
             | 
             | ---
             | 
             | While digging into Honeycomb's open source story, I did
             | find these two awesome toys, one relevant to the otel
             | discussion and one just neato
             | 
             | https://github.com/honeycombio/refinery _(Apache 2)_ --
             | Refinery is a tail-based sampling proxy and operates at the
             | level of an entire trace. Refinery examines whole traces
             | and intelligently applies sampling decisions to each trace.
             | These decisions determine whether to keep or drop the trace
             | data in the sampled data forwarded to Honeycomb.
             | 
             | https://github.com/honeycombio/gritql _(MIT)_ -- GritQL is
             | a declarative query language for searching and modifying
             | source code
        
         | madduci wrote:
         | I use Otel running in a GKE cluster and tracking Jenkins jobs,
         | whose spans/traces can track long time running jobs pretty well
        
         | dboreham wrote:
         | It doesn't matter how long things take. The best way to
         | understand this is to realize that OTel tracing (and all other
         | similar things) are really "fancy logging systems". Some agent
         | code emits a log message every time something happens (e.g.
         | batch job begins, batch job ends). Something aggregates those
         | log messages into some place they can be coherently scanned.
         | Then something scans those messages generating some
         | visualization you view. Everything could be done with text
         | messages in text files and some awk script. A tracing system is
         | just that with batteries included and a pretty UI. Understood
         | this way it should now be clear why the duration of a monitored
         | task is not relevant -- once the "begin task" message has been
         | generated all that has to happen is the sampling agent
         | remembers the span ID. Then when the "end task" message is
         | emitted it has the same span ID. That way the two can be
         | correlated and rendered as a task with some duration. There's
         | always a way to propagate the span ID from place to place (e.g.
         | in a http header so correlation can be done between
         | processes/machines). This explains sibling comments about not
         | being able to track tasks between workflows: the span ID wasn't
         | propagated.
        
           | imiric wrote:
           | That's a good way of looking at it, but it assumes that both
           | start and end events will be emitted and will successfully
           | reach the backend. What happens if one of them doesn't?
        
             | lijok wrote:
             | Depends on the visualization system. It can either not
             | display the entire trace or communicate to the user that
             | the start of the trace hasn't been received or the trace
             | hasn't yet concluded. It really is just a bunch of
             | structured log lines with a common attribute to tie them
             | together.
        
             | candiddevmike wrote:
             | AIUI, there aren't really start or end messages, they're
             | spans. A span is technically an "end" message and will have
             | parent or child spans.
        
               | BoiledCabbage wrote:
               | I don't know the details but does a span have a
               | beginning?
               | 
               | Is that beginning "logged" at a separate point in time
               | from when the span end is logged?
               | 
               | > AIUI, there aren't really start or end messages,
               | 
               | Can you explain this sentence a bit more? How does it
               | have a duration without a start and end?
        
               | hinkley wrote:
               | It's been a minute since I worked on this but IIRC no,
               | which means that if the request times out you have to be
               | careful to end the span, and also all of the dependent
               | calls show up at the collector in reverse chronological
               | order.
               | 
               | The thing is that at scale you'd never be able to
               | guarantee that the start of the span showed up at a
               | collector in chronological order anyway, especially due
               | to the queuing intervals being distinct per collection
               | sidecar. But what you could do with two events is
               | discover spans with no orderly ending to them. You could
               | easily truncate traces that go over the span limit
               | instead of just dropping them on the floor (fuck you for
               | this, OTEL, this is the biggest bullshit in the entire
               | spec). And you could reduce the number of traceids in
               | your parsing buffer that have no metadata associated with
               | them, both in aggregate and number of messages in the
               | limbo state per thousand events processed.
        
               | nijave wrote:
               | A span is a discrete event emitted on completion. It
               | contains arbitrary metadata (plus a few mandatory fields
               | if you're following the OTEL spec).
               | 
               | As such, it doesn't really have a beginning or end except
               | that it has fields for duration and timestamps.
               | 
               | I'd check out the OTEL docs since I think seeing the
               | examples as JSON helps clarify things. It looks like they
               | have events attached to spans which is optional.
               | https://opentelemetry.io/docs/concepts/signals/traces/
        
             | hinkley wrote:
             | Ugh. One of the reasons I never turned on the tracing code
             | I painstakingly refactored into our stats code was
             | discovering that OTEL makes no attempts to introduce a span
             | to the collector prior to child calls talking about it. Is
             | that really how you want to do event correlation? Time
             | traveling seems like an expensive operation when you're
             | dealing with 50,000 trace events per second.
             | 
             | The other turns out to be our OPs teams problem more than
             | OTEL's. Well a little of both. If a trace goes over a limit
             | then OTEL just silently drops the entire thing, and the
             | default size on AWS is useful for toy problems not
             | retrofitting onto live systems. It's the silent failure
             | defaults of OTEL that are giant footguns. Give me a fucking
             | error log on data destruction, you asshats.
             | 
             | I'll just use Prometheus next time, which is apparently
             | what our OPs team recommended (except one individual who
             | was the one I talked to).
        
               | nijave wrote:
               | You can usually turn logging on but a lot of the OTEL
               | stack defaults to best effort and silently drops data.
               | 
               | We had Grafana Agent running which was wrapping the
               | reference implementation OTEL collector written in go and
               | it was pretty easy to see when data was being dropped via
               | logs.
               | 
               | I think some limitation is also on the storage backend.
               | We were using Grafana Cloud Tempo which imposes limits.
               | I'd think using a backend that doesn't enforce recency
               | would help.
               | 
               | With the OTEL collector I'd think you could utilize some
               | processors/connectors or write your own to handle
               | individual spans that get too big. Not sure on backends
               | but my current company uses Datadog and their proprietary
               | solution handles >30k spans per trace pretty easily.
               | 
               | I think the biggest issue is the low cohesion, high DIY
               | nature of OTEL. You can build powerful solutions but you
               | really need to get low level and assemble everything
               | yourself tuning timeouts, limits, etc for your use case.
        
               | hinkley wrote:
               | > I think the biggest issue is the low cohesion, high DIY
               | nature of OTEL
               | 
               | OTEL is the SpringBoot of telemetry and if you think
               | those are fighting words then I picked the right ones.
        
           | hinkley wrote:
           | Every time people talk about OTel I discover half the people
           | are talking about spans rather that stats. For stats it's not
           | a 'fancy logger' because it's condensing the data at various
           | steps.
           | 
           | And if you've ever tried to trace a call tree using
           | correlationIDs and Splunk queries and still say OTEL is 'just
           | a fancy' then you're in dangerous territory, even if it's
           | just by way of explanation. Don't feed the masochists. When
           | masochists derail attempts at pain reduction they become
           | sadists.
        
         | sethammons wrote:
         | We had a hell of a time attempting to roll out OTel for that
         | kind of work. Our scale was also billions of requests per day.
         | 
         | We ended up taking tracing out of these jobs, and only using on
         | requests that finish in short order, like UI web requests. For
         | our longer jobs and fanout work, we started passing a metadata
         | object around that appended timing data related that specific
         | job and then at egress, would capture the timing metadata and
         | flag abnormalities.
        
       | sali0 wrote:
       | noob question, i'm currently adding telemetry to my backend.
       | 
       | I was at first implementing otel throughout my api, but ran into
       | some minor headaches and a lot of boilerplate. I shopped a bit
       | around and saw that Sentry has a lot of nice integrations
       | everywhere, and _seems_ to have all the same features (metrics,
       | traces, error reporting). I 'm considering just using Sentry for
       | both backend and frontend and other pieces as well.
       | 
       | Curious if anyone has thoughts on this. Assuming Sentry can
       | fulfill our requirements, the only thing taht really concerns me
       | is vendor-lockin. But I'm wondering other people's thoughts
        
         | whatevermom wrote:
         | Sentry isn't really a full on observability platform. It's for
         | error reporting only (that is annotated with traces and logs).
         | It turns out that for most projects, this is sufficient. Can't
         | comment on the vendor lock-in part.
        
         | srikanthccv wrote:
         | >I was at first implementing otel throughout my api, but ran
         | into some minor headaches and a lot of boilerplate
         | 
         | OTeL also has numerous integrations
         | https://opentelemetry.io/ecosystem/registry/. In contrast,
         | Sentry lacks traditional metrics and other capabilities that
         | OTeL offers. IIRC, Sentry experimented with "DDM" (Delightful
         | Developer Metrics), but this feature was deprecated and removed
         | while still in alpha/beta.
         | 
         | Sentry excels at error tracking and provides excellent browser
         | integration. This might be sufficient for your needs, but if
         | you're looking for the comprehensive observability features
         | that OpenTelemetry provides, you'd likely need a full
         | observability platform.
        
         | dboreham wrote:
         | You can run your own sentry server (or at least last time I
         | worked with it you could). But as others have noted sentry is
         | not going to provide the same functionality as OTel.
        
           | mdaniel wrote:
           | The word "can" is doing a lot of work in your comment, based
           | on the now horrific number of moving parts[1] and I think
           | David has even said the self-hosting story isn't a priority
           | for them. Also, don't overlook the license, if your shop is
           | sensitive to non-FOSS licensing terms
           | 
           | 1: https://github.com/getsentry/self-
           | hosted/blob/25.5.1/docker-...
        
         | vrosas wrote:
         | Think of otel as just a standard data format for your
         | logs/traces/metrics that your backend(s) emit, and some open
         | source libraries for dealing with that data. You can pipe it
         | straight to an observability vendor that accepts these formats
         | (pretty much everyone does - datadog, stackdriver, etc) or you
         | can simply write the data to a database and wire up your own
         | dashboards on top of it (i.e. graphana).
         | 
         | Otel can take a little while to understand because, like many
         | standards, it's designed by committee and the
         | code/documentation will reflect that. LLMs can help but the
         | last time I was asking them about otel they constantly gave me
         | code that was out of date with the latest otel libraries.
        
         | stackskipton wrote:
         | Ops type here, Otel is great but if your metrics are not there,
         | please fix that. In particular, consider just import
         | prometheus_client and going from there.
         | 
         | Prometheus is bog easy to run, Grafana understands it and
         | anything involving alerting/monitoring from logs is bad idea
         | for future you, I PROMISE YOU, PLEASE DON'T!
        
           | avtar wrote:
           | > anything involving alerting/monitoring from logs is bad
           | idea for future you
           | 
           | Why is issuing alerts for log events a bad idea?
        
             | _kblcuk_ wrote:
             | It's trivial to alter or remove log lines without knowing
             | or realizing that it affects some alerting or monitoring
             | somewhere. That's why there are dedicated monitoring and
             | alerting systems to start with.
        
               | sethammons wrote:
               | Same with metrics.
               | 
               | If you need an artifact from your system, it should be
               | tested. We test our logs and many types of metrics. Too
               | many incidents from logs or metrics changing and no
               | longer causing alerts. Never got to build out my alert
               | test bed that exercises all know alerts in prod,
               | verifying they continue to work.
        
       | totetsu wrote:
       | I spent some time working on this. First I tried to make a GitHub
       | action that was triggered on completion of your other actions and
       | passed along the context of the triggering action in the
       | environment, then used the GitHub api to call out extra details
       | of the steps and tasks etc, and the logs and make that all into a
       | process trace and send it via an otel connection to like jaeger
       | or grafana, to get flamchart views of performance of steps. I
       | thought maybe it would be better to do this directly from the
       | runner hosts by watching log files, but the api has more detailed
       | information.
        
       | candiddevmike wrote:
       | How does SigNoz compare to the other "all-in-one" OTel platforms?
       | What part of the open-core bit is behind a paywall?
        
         | makeavish wrote:
         | Only SAML, Multiple ingestion keys and Premium Support is under
         | paywall. SSO is not under paywall. Check pricing page for
         | detailed comparison: https://signoz.io/pricing/
        
       | 127dot1 wrote:
       | That's a poor title: the article is not about CI/CD, it is
       | particularly about GitHub CI/CD and thus is useless for the most
       | CI/CD cases.
        
         | dang wrote:
         | Ok, we've added Github to the title above.
        
       | remram wrote:
       | I have thought about that before, but I was blocked by the really
       | poor file support for OTel. I couldn't find an easy way to dump a
       | file from the collector running in my CI job and load it on my
       | laptop for analysis, which is the way I would like to go.
       | 
       | Maybe this has changed?
        
         | sweetgiorni wrote:
         | https://github.com/open-telemetry/opentelemetry-collector-co...
        
           | remram wrote:
           | And the receiver: https://github.com/open-
           | telemetry/opentelemetry-collector-co...
           | 
           | I'll have to try this!
        
       ___________________________________________________________________
       (page generated 2025-06-15 23:01 UTC)