[HN Gopher] eBPF-based auto-instrumentation outperforms manual i...
___________________________________________________________________
eBPF-based auto-instrumentation outperforms manual instrumentation
Author : edenfed
Score : 155 points
Date : 2023-10-30 14:10 UTC (8 hours ago)
(HTM) web link (odigos.io)
(TXT) w3m dump (odigos.io)
| nevodavid10 wrote:
| This is great. Can you elaborate on how the performance is
| better?
| Barakikia wrote:
| Our focus was on latency. The reason we were able to cut it
| down was due to the fact that eBPF-based automatic
| instrumentation separates the recording from the processing.
| grazio wrote:
| How did you actually reduce the latency here ?
| RonFeder wrote:
| The main factor for reduced latency is the separation
| between recording and processing of data. The eBPF programs
| are the only overhead for the instrumented process in terms
| of latency. The eBPF programs transfer the collected data
| to a separate process which handles all the exporting. In
| contrast to manually adding code to an application which
| adds latency and memory footprint in terms of handling the
| exported data.
| CSDude wrote:
| Somewhat related, I mainly code in Kotlin. Adding open telemetry
| was just adding agent to command line args (usual Java/JVM magic
| most people don't like). Then I had a project in Go and I got so
| tired of all the steps it took (setup and ensuring each context
| is instrumented) and just gave up. We still add our manual
| instrumentation for customization, but auto-instrumentation made
| adoption much easier in the day 0.
| edenfed wrote:
| I think eBPF has also great potential to help JVM-based
| languages. Especially around performance aspects even comparing
| to the current java agents which use bytecode manipulation.
| marwis wrote:
| The article mentions avoiding GC pressure and separation
| between recording and processing as big wins for performance
| for runtimes like Java but you could do the same inside Java
| by using ring buffer, no?
| edenfed wrote:
| Interesting idea. I think that as long as you able to do
| processing, serializing and delivery in other process and
| save this work from your application runtime you should see
| great performance
| avita1 wrote:
| How do you solve the context propagation issue with eBPF based
| instrumentation?
|
| E.g. if you get a RPC request coming in, and make an RPC request
| in order to serve the incoming RPC request. The traced program
| needs to track some ID for that request from the time it comes
| in, through to the place where the the HTTP request comes out.
| And then that ID has to get injected into a header on the wire so
| the next program sees the same request ID.
|
| IME that's where most of the overhead (and value) from a manual
| tracing library comes from.
| edenfed wrote:
| It depends on the programming language being instrumented. For
| Go we are assuming the context.Context object is passed around
| between different functions or goroutines. For Java, we are
| using a combination of ThreadLocal tracing and Runnable tracing
| to support use cases like reactive and multithreaded
| applications.
| camel_gopher wrote:
| That's a very big assumption, at least for Go based
| applications.
| edenfed wrote:
| We also thinking on implementing fallback mechanism to
| automatically propagate context on the same goroutine if
| context.Context is not passed
| nulld3v wrote:
| I don't think it's unreasonable, you need a Context to make
| a gRPC call and you get one when handling a gRPC call. It
| usually doesn't get lost in between.
| otterley wrote:
| True for gRPC, but not necessarily for HTTP - the HTTP
| client and server packages that ship with Go predate the
| Context package by quite a long while.
| spullara wrote:
| Going to be rough for supporting virtual threads then?
| edenfed wrote:
| We have a solution for virtual thread as well. Currently
| working on a blog post describing exactly how. Will update
| once releases
| marwis wrote:
| ScopedValue solves that problem: https://docs.oracle.com/en
| /java/javase/21/docs/api/java.base...
| rocmcd wrote:
| 100%. Context propagation is _the_ key to distributed tracing,
| otherwise you're only seeing one side of every transaction.
|
| I was hoping odigos was language/runtime-agnostic since it's
| eBPF-based, but I see it's mentioned in the repo that it only
| supports:
|
| > Java, Python, .NET, Node.js, and Go
|
| Apart from Go (that is a WIP), these are the languages already
| supported with Otel's (non-eBPF-based) auto-instrumentation.
| Apart from a win on latency (which is nice, but could in theory
| be combated with sampling), why else go this route?
| edenfed wrote:
| eBPF instrumentation does not require code changes,
| redeployment or restart to running applications.
|
| We are constantly adding more language support for eBPF
| instrumentation and are aiming to cover the most popular
| programming languages soon.
|
| Btw, not sure that sampling is really the solution to combat
| overhead, after all you probably do want that data. Trying to
| fix production issue when the data you need is missing due to
| sampling is not fun
| rocmcd wrote:
| All good points, thank you.
|
| What's the limit on language support? Is it theoretically
| possible to support any language/runtime? Or does it come
| down to the protocol (HTTP, gRPC, etc) being used by the
| communicating processes?
| edenfed wrote:
| We already solved compiled languages (Go, C, Rust) and
| JIT languages (Java, C#). Interpreted languages (Python,
| JS) are the only ones left, hopefully we will solve these
| as well soon. The big challenge is supporting all the
| different runtimes, once that is solved implementing
| support for different protocols / open-source libraries
| is not as complicated.
| jetbalsa wrote:
| Got to get PHP on that list :)
| phillipcarter wrote:
| FWIW it's theoretically possible to support any
| language/runtime, but since eBPF is operating at the
| level it's at, there's no magic abstraction layer to plug
| into. Every runtime and/or protocol involves different
| segments of memory and certain bytes meaning certain
| things. It's all in service towards having no additional
| requirements for an end-user to install, but once you're
| in eBPF world everything is runtime-and-protocol-and-
| library-specific.
| RonFeder wrote:
| The eBPF programs handle passing the context through the
| requests by adding a field to the header as you mentioned. The
| injected field is according to the w3c standard.
| heyeb_25 wrote:
| If I am manually implemented all my logs, what do I need to do to
| move to Odgios?
| edenfed wrote:
| Nothing special, if you are working on Kubernetes its as easy
| as running `odigos install` CLI and pointing to your current
| monitoring system.
| bakery_bake wrote:
| According to what you say, nobody should implement logs manually?
| I will check Odigos.
| edenfed wrote:
| Logs are easy and familiar API for adding additional data to
| your traces. They still have their place, Odigos is just adding
| much more context.
| jrockway wrote:
| They don't really show any of the settings they used, but for
| traces, I imagine if you have a reasonable sampling rate, then
| you aren't going to be running any code for most requests, so it
| won't increase latency. (Looking at their chart, I guess they are
| sampling .1% of requests, since 99.9% is where latency starts
| increasing. I am not sure if I would trace .1% of pages loads to
| google.com, as their table implies. Rather, I'd pick something
| like 1 request per second, so that latency does not increase as
| load increases.)
|
| A lot of Go metrics libraries, specifically Prometheus, introduce
| a lot of lock contention around incrementing metrics. This was
| unacceptably slow for our use case at work and I ended up writing
| a metrics system that doesn't take any locks for most cases.
|
| (There is the option to introduce a lock for metrics that are
| emitted on a timed basis; i.e. emit tx_bytes every 10s or 1MiB
| instead of at every Write() call. But this lock is not global to
| the program; it's unique to the metric and key=value "fields" on
| the metric. So you can have a lot of metrics around and not
| content on locks.)
|
| The metrics are then written to the log, which can be processed
| in real time to synthesize distributed traces and prometheus
| metrics, if you really want them:
| https://github.com/pachyderm/pachyderm/blob/master/src/inter...
| (Our software is self-hosted, and people don't have those systems
| set up, so we mostly consume metrics/traces in log form. When
| customers have problems, we prepare a debug bundle that is mostly
| just logs, and then we can further analyze the logs on our side
| to see event traces, metrics, etc.)
|
| As for eBPF, that's something I've wanted to use to enrich logs
| with more system-level information, but most customers that run
| our software in production aren't allowed to run anything as
| root, and thus eBPF is unavailable to them. People will tolerate
| it for things like Cilium or whatever, but not for ordinary
| applications that users buy and request that their production
| team install for them. Production Linux at big companies is super
| locked down, it seems, much to my disappointment. (Personally, my
| threat model for Linux is that if you are running code on the
| machine, you probably have root through some yet-undiscovered
| kernel bug. Historically, I've been right. But that is not the
| big companies' security teams' mental model, it appears. They
| aren't paranoid enough to run each k8s pod in a hypervisor, but
| are paranoid enough to prevent using CAP_SYS_ADMIN or root.)
| edenfed wrote:
| Thanks for the valuable feedback! We used a constant throughout
| of 10,000 rps. The exact testing setup can be found under "how
| we tested".
|
| I think the example you gave for the lock used by Prometheus
| library is a great example why generation of traces/metrics is
| a great fit for offloading to different process (an agent).
|
| Patchyderm looks very interesting however I am not sure how you
| can generate distributed traces based on metrics, how do you
| fill in the missing context propagation?
|
| Our way to deal with eBPF root requirements is to be
| transparent as possible. This is why we donated the code to the
| CNCF and developing as part of the OpenTelemetry community. We
| hope that being open will make users trust us. You can see the
| relevant code here: https://github.com/open-
| telemetry/opentelemetry-go-instrumen...
| jrockway wrote:
| > I am not sure how you can generate distributed traces based
| on metrics
|
| Every log line gets an x-request-id field, and then when you
| combine the logs from the various components, you can see the
| propagation throughout our system. The request ID is a UUIDv4
| but the mandatory 4 nibble in the UUIDv4 gets replaced with a
| digit that represents where the request came from; background
| task, web UI, CLI, etc. I didn't take the approach of
| creating a separate span ID to show sub-requests. Since you
| have all the logs, this extra piece of information isn't
| super necessary though my coworkers have asked for it a few
| times because every other system has it.
|
| Since metrics are also log lines, they get the request-id, so
| you can do really neat things like "show me when this
| particular download stalled" or "show me how much bandwidth
| we're using from the upstream S3 server". The aggregations
| can take place after the fact, since you have all the raw
| data in the logs.
|
| If we were running this such that we tailed the logs and sent
| things to Jaeger/Prometheus, a lot of this data would have to
| go away for cardinality reasons. But squirreling the logs
| away safely, and then doing analysis after the fact when a
| problem is suspected ends up being pretty workable. (We still
| do have a Prometheus exporter not based on the logs, for
| customers that do want alerts. For log storage, we bundle
| Loki.)
| otterley wrote:
| The column in the table claiming the "number of page loads that
| would experience the 99th %ile" is mathematically suspect. It
| directly contradicts what a percentile is.
|
| By definition, at 99th percentile, if I have 100 page loads, the
| _one_ with the worst latency would be over the 99th percentile.
| That 's not 85.2%, 87.1%, 67.6%, etc. The formula shown in that
| column makes no sense at all.
| edenfed wrote:
| I recommend watching Gil Tene's talk, I think he explains the
| math better than I do:
| https://www.youtube.com/watch?v=lJ8ydIuPFeU
| tpankaj wrote:
| That's not what that column is supposed to mean afaict. The way
| I read it is it's showing that if the website requires hundreds
| of different parallel backend service calls to serve the page
| load, what's the probability a page load hits the p99
| instrumentation latency?
|
| We have a similar chart at my job to illustrate the point that
| high p99 latency on a backend service doesn't mean only 1% of
| end-user page loads are affected.
| otterley wrote:
| Ah, I see. So, for example, if one page request would result
| in 190 different backend requests to fulfill, then the
| possibility that at least one of those subrequests exceeds
| the 99th percentile would be 85.2%. That makes a lot more
| sense.
| bjt12345 wrote:
| But what if the 100 page loads are just a sample of the
| population?
| chabad360 wrote:
| How hard is it to use Odigos without k8s? We mainly use docker
| compose for our deployments (because it's convenient, and we
| don't need scale), but I'm having trouble finding anything in the
| documentation that explains the mechanism for hooking into the
| container (and hence I have no clue how to repurpose it).
| edenfed wrote:
| We are currently supporting just Kubernetes environments.
| docker-compose, VMs, and Serverless are on our roadmap and will
| be ready soon
| ranting-moth wrote:
| Website doesn't display correctly on FF on android. Text bleeds
| on left and right side.
| edenfed wrote:
| Thank you for reporting will fix ASAP
| zengid wrote:
| Anyone from the dtrace community want to enlighten a n00b about
| how eBPF compares to what dtrace does?
| zengid wrote:
| From the hot takes in this post from 2018 [0], I may be asking
| a contentious question.
|
| [0] https://news.ycombinator.com/item?id=16375938
| edenfed wrote:
| I don't have a lot of experience using dtrace, but AFAIK the
| big advantage of eBPF over dtrace is that you do not need to
| instrument your application with static probes during coding.
| tanelpoder wrote:
| DTrace (on Solaris at least) can instrument any userspace
| symbol or address, no need for static tracepoints in the app.
|
| One problem that DTrace has is that the "pid" provider that
| you use for userspace app tracing only works on processes
| that are already running. So, if more processes with the
| executable of interest launch after you've started DTrace,
| its pid provider won't catch the new ones. Then you end up
| doing some tricks like tracking exec-s of the binary and
| restarting your DTrace script...
| bcantrill wrote:
| That's not exactly correct, and is merely a consequence of
| the fact that you are trying to use the pid provider. The
| issue that you're seeing is that pid probes are created on-
| the-fly -- and if you don't demand that they are created in
| a new process, they in fact won't be. USDT probes generally
| don't have this issue (unless they are explicitly lazily
| created -- and some are). So you don't actually need/want
| to restart your DTrace script, you just want to force
| probes to be created in new processes (which will
| necessitate some tricks, just different ones).
| tanelpoder wrote:
| So how would you demand that they'd be created in a new
| process? I was already using pid* provider years ago when
| I was working on this (and wasn't using static compiled-
| in tracepoints).
| bcantrill wrote:
| They're really very different -- with very different origins
| and constraints. If you want to hear about my own experiences
| with bpftrace, I got into this a bit recently.[0] (And in fact,
| one of my questions about the article is how they deal with
| silently dropped data in eBPF -- which I found to be pretty
| maddening.)
|
| [0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m0s
| edenfed wrote:
| By dropped data do you mean by exceeding the size of the
| allocated ring buffer/perf buffer? If so this is configurable
| by the user, so you can adjust is according to the expected
| load
| bcantrill wrote:
| eBPF can drop data silently under quite a few conditions,
| unfortunately. And -- most frustratingly -- it's silent, so
| it's not even entirely clear which condition you've fallen
| into. This alone is a pretty significant with respect to
| DTrace: when/where DTrace drops data, there is _always_ an
| indicator as to why. And to be clear, this isn 't a
| difference merely of implementation (though that too,
| certainly), but of principle: DTrace, at root, is a
| debugger -- and it strives to be as transparent to the user
| as possible as to the truth of the underlying system.
| zengid wrote:
| I listened to this live! That's probably why I was wondering,
| because I remember you talking about something you used in
| Linux that didn't quite live up to your expectations with
| DTrace, but I didn't catch all of the names. Thanks!
| Thaxll wrote:
| Of course it outperforms it, but it's basic instrumentation, how
| do you properly select the labels for example? In your
| application you will have custom instrumentation for business
| logic, so what do you do? Now you have two systems instrumenting
| the same app?
| edenfed wrote:
| You can enrich the spans created by eBPF by using OpenTelemetry
| APIs as usual, the eBPF instrumentation is a replacement for
| the instrumentation SDK. The eBPF program will detect the data
| recorded via the APIs and will add it to the final trace
| combining both automatic and manually created data.
___________________________________________________________________
(page generated 2023-10-30 23:00 UTC)