[HN Gopher] GitHub CI/CD observability with OpenTelemetry step b...
___________________________________________________________________
GitHub CI/CD observability with OpenTelemetry step by step guide
Author : ankit01-oss
Score : 124 points
Date : 2025-06-11 12:42 UTC (4 days ago)
(HTM) web link (signoz.io)
(TXT) w3m dump (signoz.io)
| reactordev wrote:
| As someone who has some experience in observability at scale, the
| issue with SigNoz, Prom, etc is that they can only operate on the
| data that is exposed by the underlying infrastructure where the
| IaaS has all the information to provide a better experience.
| Hence CloudWatch.
|
| That said, if you own your infrastructure, I'd build out a signoz
| cluster in a heartbeat. Otel is awesome but once you set down a
| path for your org, it's going to be extremely painful to switch.
| Choose otel if you're a hybrid cloud or you have on premises
| stuff. If you're on AWS, CloudWatch is a better option simply
| because they have the data. Dead simple tracing.
| 6r17 wrote:
| I did have some bad experiences with OTEL and have lot of
| freedom on deployment ; I never read of Signoz will definitely
| check it out ; SigNoz is working with OTEL I suppose ?
|
| I wonder if there are any other adapters for trace injest
| instead of OTEL ?
| darkstar_16 wrote:
| Jaeger collector perhaps but then you'd have to use the
| Jaeger UI. Signoz has a much nicer UI that feels more
| integrated but last I checked had annoying bugs in the UI
| like not keeping the time selection when I navigated between
| screens.
| 6r17 wrote:
| Definitely should look up the tech more ; i lazily
| commented as Signoz clearly state it ingest most than 50
| different sources ;
| elza_1111 wrote:
| yep, SigNoz is OpenTelemetry native. You can instrument your
| application with OpenTelemetry and send telemetry data
| direclty to signoz.
| bbkane wrote:
| There are a few: I've played with https://uptrace.dev and
| https://openobserve.ai/ . OpenObserve is a single binary, so
| easy to set up
| mdaniel wrote:
| be cognizant of their licenses (AGPLv3), it matters in some
| shops
|
| https://github.com/uptrace/uptrace/blob/v1.7.6/LICENSE
|
| https://github.com/openobserve/openobserve/blob/v0.14.7/LIC
| E...
| FunnyLookinHat wrote:
| I think you're looking at OTel from a strictly infrastructure
| perspective - which Cloudwatch does effectively solve without
| any added effort. But OTel really begins to shine when you
| instrument your backends. Some languages (Node.js) have a whole
| slew of auto-instrumentation, giving you rich traces with spans
| detailing each step of the http request, every SQL query, and
| even usage of AWS services. Making those traces even more
| valuable is that they're linked across services.
|
| We've frequently seen a slowdown or error at the top of our
| stack, and the teams are able to immediately pinpoint the
| problem as a downstream service. Not only that, they can see
| the specific issue in the downstream service almost
| immediately!
|
| Once you get to that level of detail, having your
| infrastructure metrics pulled into your Otel provider does
| start to make some sense. If you observe a slowdown in a
| service, being able to see that the DB CPU is pegged at the
| same time is meaningful, etc.
|
| [Edit - Typo!]
| makeavish wrote:
| Agree with you on this. OTel agents allows exporting all
| host/k8s metrics correlated with your logs and traces. Though
| exporting AWS service specific metrics with OTel is not easy.
| To solve this SigNoz has 1-Click AWS Integrations:
| https://signoz.io/blog/native-aws-integrations-with-
| autodisc...
|
| Also SigNoz has native correlation between different signals
| out of the box.
|
| PS: I am SigNoz Maintainer
| elza_1111 wrote:
| FYI for anyone reading, OTel does have great auto-
| instrumentation for Python, Java and .NET also
| reactordev wrote:
| Not confusing anything. Yes you can meter your own
| applications, generate your own metrics, but most
| organizations start their observability journey with the
| hardware and latency metrics.
|
| Otel provides a means to sugar any metric with labels and
| attributes which is great (until you have high cardinality)
| but there are still things that are at the infrastructure
| level that only CloudWatch knows of (on AWS). If you're
| running K8s on your own hardware - Otel would be my first
| choice.
| elza_1111 wrote:
| There are integrations that let you monitor your AWS resources
| also on SigNoz. That said, I personally think CloudWatch is
| painful in so many other ways as well,
|
| Check this out, https://signoz.io/blog/6-silent-traps-inside-
| cloudWatch-that...
| mdaniel wrote:
| A child comment mentioned k8s but I also have been chomping at
| the bit to try out the eBPF hooks in https://github.com/pixie-
| io/pixie (or even https://github.com/coroot/coroot or
| https://github.com/parca-dev/parca ) all of which are Apache 2
| licensed
|
| The demo for https://github.com/draios/sysdig was also just
| amazing, but I don't have any idea what the storage
| requirements would be for leaving it running
| bravesoul2 wrote:
| That's a genius idea. So obvious in retrospect.
| hrpnk wrote:
| Has anyone seen OTel being used well for long-running batch/async
| processes? Wonder how the suggestions stack up to monolith builds
| for Apps that take about an hour.
| zdc1 wrote:
| I've tried and failed at tracing transactions that span
| multiple queues (with different backends). At the end I just
| published some custom metrics for the transaction's success
| count / failure count / duration and moved on my with life.
| makeavish wrote:
| You can use SpanLinks to analyse your async processes. This
| guide might be helpful introduction:
| https://dev.to/clericcoder/mastering-trace-analysis-with-spa...
|
| Also SigNoz supports rendering practically unlimited number of
| spans in trace detail UI and allows filtering them as well
| which has been really useful in analyzing batch processes:
| https://signoz.io/blog/traces-without-limits/
|
| You can further run aggregation on spans to monitor failures
| and latency.
|
| PS: I am SigNoz maintainer
| ai-christianson wrote:
| Is this better than Honeycomb?
| mdaniel wrote:
| "Better" is always "for what metric" but if nothing else
| having the source code to the stack is always "better" IMHO
| even if one doesn't choose to self-host, and that goes
| double for SigNoz choosing a permissive license, so one
| doesn't have to get lawyers involved to run it
|
| ---
|
| While digging into Honeycomb's open source story, I did
| find these two awesome toys, one relevant to the otel
| discussion and one just neato
|
| https://github.com/honeycombio/refinery _(Apache 2)_ --
| Refinery is a tail-based sampling proxy and operates at the
| level of an entire trace. Refinery examines whole traces
| and intelligently applies sampling decisions to each trace.
| These decisions determine whether to keep or drop the trace
| data in the sampled data forwarded to Honeycomb.
|
| https://github.com/honeycombio/gritql _(MIT)_ -- GritQL is
| a declarative query language for searching and modifying
| source code
| madduci wrote:
| I use Otel running in a GKE cluster and tracking Jenkins jobs,
| whose spans/traces can track long time running jobs pretty well
| dboreham wrote:
| It doesn't matter how long things take. The best way to
| understand this is to realize that OTel tracing (and all other
| similar things) are really "fancy logging systems". Some agent
| code emits a log message every time something happens (e.g.
| batch job begins, batch job ends). Something aggregates those
| log messages into some place they can be coherently scanned.
| Then something scans those messages generating some
| visualization you view. Everything could be done with text
| messages in text files and some awk script. A tracing system is
| just that with batteries included and a pretty UI. Understood
| this way it should now be clear why the duration of a monitored
| task is not relevant -- once the "begin task" message has been
| generated all that has to happen is the sampling agent
| remembers the span ID. Then when the "end task" message is
| emitted it has the same span ID. That way the two can be
| correlated and rendered as a task with some duration. There's
| always a way to propagate the span ID from place to place (e.g.
| in a http header so correlation can be done between
| processes/machines). This explains sibling comments about not
| being able to track tasks between workflows: the span ID wasn't
| propagated.
| imiric wrote:
| That's a good way of looking at it, but it assumes that both
| start and end events will be emitted and will successfully
| reach the backend. What happens if one of them doesn't?
| lijok wrote:
| Depends on the visualization system. It can either not
| display the entire trace or communicate to the user that
| the start of the trace hasn't been received or the trace
| hasn't yet concluded. It really is just a bunch of
| structured log lines with a common attribute to tie them
| together.
| candiddevmike wrote:
| AIUI, there aren't really start or end messages, they're
| spans. A span is technically an "end" message and will have
| parent or child spans.
| BoiledCabbage wrote:
| I don't know the details but does a span have a
| beginning?
|
| Is that beginning "logged" at a separate point in time
| from when the span end is logged?
|
| > AIUI, there aren't really start or end messages,
|
| Can you explain this sentence a bit more? How does it
| have a duration without a start and end?
| hinkley wrote:
| It's been a minute since I worked on this but IIRC no,
| which means that if the request times out you have to be
| careful to end the span, and also all of the dependent
| calls show up at the collector in reverse chronological
| order.
|
| The thing is that at scale you'd never be able to
| guarantee that the start of the span showed up at a
| collector in chronological order anyway, especially due
| to the queuing intervals being distinct per collection
| sidecar. But what you could do with two events is
| discover spans with no orderly ending to them. You could
| easily truncate traces that go over the span limit
| instead of just dropping them on the floor (fuck you for
| this, OTEL, this is the biggest bullshit in the entire
| spec). And you could reduce the number of traceids in
| your parsing buffer that have no metadata associated with
| them, both in aggregate and number of messages in the
| limbo state per thousand events processed.
| nijave wrote:
| A span is a discrete event emitted on completion. It
| contains arbitrary metadata (plus a few mandatory fields
| if you're following the OTEL spec).
|
| As such, it doesn't really have a beginning or end except
| that it has fields for duration and timestamps.
|
| I'd check out the OTEL docs since I think seeing the
| examples as JSON helps clarify things. It looks like they
| have events attached to spans which is optional.
| https://opentelemetry.io/docs/concepts/signals/traces/
| hinkley wrote:
| Ugh. One of the reasons I never turned on the tracing code
| I painstakingly refactored into our stats code was
| discovering that OTEL makes no attempts to introduce a span
| to the collector prior to child calls talking about it. Is
| that really how you want to do event correlation? Time
| traveling seems like an expensive operation when you're
| dealing with 50,000 trace events per second.
|
| The other turns out to be our OPs teams problem more than
| OTEL's. Well a little of both. If a trace goes over a limit
| then OTEL just silently drops the entire thing, and the
| default size on AWS is useful for toy problems not
| retrofitting onto live systems. It's the silent failure
| defaults of OTEL that are giant footguns. Give me a fucking
| error log on data destruction, you asshats.
|
| I'll just use Prometheus next time, which is apparently
| what our OPs team recommended (except one individual who
| was the one I talked to).
| nijave wrote:
| You can usually turn logging on but a lot of the OTEL
| stack defaults to best effort and silently drops data.
|
| We had Grafana Agent running which was wrapping the
| reference implementation OTEL collector written in go and
| it was pretty easy to see when data was being dropped via
| logs.
|
| I think some limitation is also on the storage backend.
| We were using Grafana Cloud Tempo which imposes limits.
| I'd think using a backend that doesn't enforce recency
| would help.
|
| With the OTEL collector I'd think you could utilize some
| processors/connectors or write your own to handle
| individual spans that get too big. Not sure on backends
| but my current company uses Datadog and their proprietary
| solution handles >30k spans per trace pretty easily.
|
| I think the biggest issue is the low cohesion, high DIY
| nature of OTEL. You can build powerful solutions but you
| really need to get low level and assemble everything
| yourself tuning timeouts, limits, etc for your use case.
| hinkley wrote:
| > I think the biggest issue is the low cohesion, high DIY
| nature of OTEL
|
| OTEL is the SpringBoot of telemetry and if you think
| those are fighting words then I picked the right ones.
| hinkley wrote:
| Every time people talk about OTel I discover half the people
| are talking about spans rather that stats. For stats it's not
| a 'fancy logger' because it's condensing the data at various
| steps.
|
| And if you've ever tried to trace a call tree using
| correlationIDs and Splunk queries and still say OTEL is 'just
| a fancy' then you're in dangerous territory, even if it's
| just by way of explanation. Don't feed the masochists. When
| masochists derail attempts at pain reduction they become
| sadists.
| sethammons wrote:
| We had a hell of a time attempting to roll out OTel for that
| kind of work. Our scale was also billions of requests per day.
|
| We ended up taking tracing out of these jobs, and only using on
| requests that finish in short order, like UI web requests. For
| our longer jobs and fanout work, we started passing a metadata
| object around that appended timing data related that specific
| job and then at egress, would capture the timing metadata and
| flag abnormalities.
| sali0 wrote:
| noob question, i'm currently adding telemetry to my backend.
|
| I was at first implementing otel throughout my api, but ran into
| some minor headaches and a lot of boilerplate. I shopped a bit
| around and saw that Sentry has a lot of nice integrations
| everywhere, and _seems_ to have all the same features (metrics,
| traces, error reporting). I 'm considering just using Sentry for
| both backend and frontend and other pieces as well.
|
| Curious if anyone has thoughts on this. Assuming Sentry can
| fulfill our requirements, the only thing taht really concerns me
| is vendor-lockin. But I'm wondering other people's thoughts
| whatevermom wrote:
| Sentry isn't really a full on observability platform. It's for
| error reporting only (that is annotated with traces and logs).
| It turns out that for most projects, this is sufficient. Can't
| comment on the vendor lock-in part.
| srikanthccv wrote:
| >I was at first implementing otel throughout my api, but ran
| into some minor headaches and a lot of boilerplate
|
| OTeL also has numerous integrations
| https://opentelemetry.io/ecosystem/registry/. In contrast,
| Sentry lacks traditional metrics and other capabilities that
| OTeL offers. IIRC, Sentry experimented with "DDM" (Delightful
| Developer Metrics), but this feature was deprecated and removed
| while still in alpha/beta.
|
| Sentry excels at error tracking and provides excellent browser
| integration. This might be sufficient for your needs, but if
| you're looking for the comprehensive observability features
| that OpenTelemetry provides, you'd likely need a full
| observability platform.
| dboreham wrote:
| You can run your own sentry server (or at least last time I
| worked with it you could). But as others have noted sentry is
| not going to provide the same functionality as OTel.
| mdaniel wrote:
| The word "can" is doing a lot of work in your comment, based
| on the now horrific number of moving parts[1] and I think
| David has even said the self-hosting story isn't a priority
| for them. Also, don't overlook the license, if your shop is
| sensitive to non-FOSS licensing terms
|
| 1: https://github.com/getsentry/self-
| hosted/blob/25.5.1/docker-...
| vrosas wrote:
| Think of otel as just a standard data format for your
| logs/traces/metrics that your backend(s) emit, and some open
| source libraries for dealing with that data. You can pipe it
| straight to an observability vendor that accepts these formats
| (pretty much everyone does - datadog, stackdriver, etc) or you
| can simply write the data to a database and wire up your own
| dashboards on top of it (i.e. graphana).
|
| Otel can take a little while to understand because, like many
| standards, it's designed by committee and the
| code/documentation will reflect that. LLMs can help but the
| last time I was asking them about otel they constantly gave me
| code that was out of date with the latest otel libraries.
| stackskipton wrote:
| Ops type here, Otel is great but if your metrics are not there,
| please fix that. In particular, consider just import
| prometheus_client and going from there.
|
| Prometheus is bog easy to run, Grafana understands it and
| anything involving alerting/monitoring from logs is bad idea
| for future you, I PROMISE YOU, PLEASE DON'T!
| avtar wrote:
| > anything involving alerting/monitoring from logs is bad
| idea for future you
|
| Why is issuing alerts for log events a bad idea?
| _kblcuk_ wrote:
| It's trivial to alter or remove log lines without knowing
| or realizing that it affects some alerting or monitoring
| somewhere. That's why there are dedicated monitoring and
| alerting systems to start with.
| sethammons wrote:
| Same with metrics.
|
| If you need an artifact from your system, it should be
| tested. We test our logs and many types of metrics. Too
| many incidents from logs or metrics changing and no
| longer causing alerts. Never got to build out my alert
| test bed that exercises all know alerts in prod,
| verifying they continue to work.
| totetsu wrote:
| I spent some time working on this. First I tried to make a GitHub
| action that was triggered on completion of your other actions and
| passed along the context of the triggering action in the
| environment, then used the GitHub api to call out extra details
| of the steps and tasks etc, and the logs and make that all into a
| process trace and send it via an otel connection to like jaeger
| or grafana, to get flamchart views of performance of steps. I
| thought maybe it would be better to do this directly from the
| runner hosts by watching log files, but the api has more detailed
| information.
| candiddevmike wrote:
| How does SigNoz compare to the other "all-in-one" OTel platforms?
| What part of the open-core bit is behind a paywall?
| makeavish wrote:
| Only SAML, Multiple ingestion keys and Premium Support is under
| paywall. SSO is not under paywall. Check pricing page for
| detailed comparison: https://signoz.io/pricing/
| 127dot1 wrote:
| That's a poor title: the article is not about CI/CD, it is
| particularly about GitHub CI/CD and thus is useless for the most
| CI/CD cases.
| dang wrote:
| Ok, we've added Github to the title above.
| remram wrote:
| I have thought about that before, but I was blocked by the really
| poor file support for OTel. I couldn't find an easy way to dump a
| file from the collector running in my CI job and load it on my
| laptop for analysis, which is the way I would like to go.
|
| Maybe this has changed?
| sweetgiorni wrote:
| https://github.com/open-telemetry/opentelemetry-collector-co...
| remram wrote:
| And the receiver: https://github.com/open-
| telemetry/opentelemetry-collector-co...
|
| I'll have to try this!
___________________________________________________________________
(page generated 2025-06-15 23:01 UTC)