[HN Gopher] Show HN: Xcapture-BPF - like Linux top, but with Xra...
___________________________________________________________________
Show HN: Xcapture-BPF - like Linux top, but with Xray vision
Author : tanelpoder
Score : 371 points
Date : 2024-07-03 20:52 UTC (1 days ago)
(HTM) web link (0x.tools)
(TXT) w3m dump (0x.tools)
| jamesy0ung wrote:
| I've never used eBPF, does anyone have some good resources for
| learning it?
| tanelpoder wrote:
| Brendan Gregg's site (and book) is probably the best starting
| point (he was involved in DTrace work & rollout 20 years ago
| when at Sun) and was/is instrumental in pushing eBPF in Linux
| even further than DTrace ever went:
|
| https://brendangregg.com/ebpf.html
| bcantrill wrote:
| Just a quick clarification: while Brendan was certainly an
| active DTrace user and evangelist, he wasn't involved in the
| development of DTrace itself -- or its rollout. (Brendan came
| to Sun in 2006; DTrace was released in 2003.) As for eBPF
| with respect to DTrace, I would say that they are different
| systems with different goals and approaches rather than one
| eclipsing the other. (There are certainly many things that
| DTrace can do that eBPF/BCC cannot, some of the details of
| which we elaborated on in our 20th anniversary of DTrace's
| initial integration.[0])
|
| Edit: We actually went into much more specific detail on
| eBPF/BCC in contrast to DTrace a few weeks after the 20th
| anniversary podcast.[1]
|
| [0] https://www.youtube.com/watch?v=IeUFzBBRilM
|
| [1] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m7s
| tanelpoder wrote:
| Thanks, yes I was more or less aware of that (I'd been
| using DTrace since Solaris 10 beta in 2004 or 2003?)... By
| rollout I really meant "getting the word out there"...
| that's half the battle in my experience (that's why this
| post here! :-)
|
| What I loved about DTrace was that once it was out, even in
| beta, it was pretty complete and worked - all the DTrace
| ports that I've tried, including on Windows (!) a few years
| ago were very limited or had some showstopper issues. I
| guess eBPF was like that too some years ago, but by now
| it's pretty sweet even for more regular consumer who don't
| keep track of its development.
|
| Edit: Oh, wasn't aware of the timeline, I may have some
| dates (years) wrong in my memory
| abrookewood wrote:
| Yes, not involved in DTrace itself, but he did write a
| bunch of DTrace Tools which led to an interesting meeting
| with a Sun exec:
| https://www.brendangregg.com/blog/2021-06-04/an-
| unbelievable...
| anonfordays wrote:
| >As for eBPF with respect to DTrace, I would say that they
| are different systems with different goals and approaches
|
| For sure. Different systems, different times.
|
| >rather than one eclipsing the other.
|
| It does seem that DTrace has been eclipsed though, at least
| in Linux (which runs the vast majority of the world's
| compute). Is there a reason to use DTrace over eBPF for
| tracing and observability in Linux?
|
| >There are certainly many things that DTrace can do that
| eBPF/BCC cannot
|
| This may be true, but that gap is closing. There are
| certainly many things that eBPF can do that DTrace cannot,
| like Cilium.
| tanelpoder wrote:
| Perhaps familiarity with the syntax of DTrace if coming
| from Solaris-heavy enterprise background. But then again,
| too many years have passed since Solaris was a major
| mainstream platform. Oracle ships and supports DTrace on
| (Oracle) Linux by the way, but DTrace 2.0 on Linux is a
| scripting frontend that gets compiled to eBPF under the
| hood.
|
| Back when I tried to build xcapture with DTrace, I could
| launch the script and use something like
| /pid$oracle::func:entry/ but IIRC the probe was attached
| only to the processes that already existed and not any
| new ones that were started after loading the DTrace
| probes. Maybe I should have used some lower level APIs or
| something - but eBPF on Linux automatically handles both
| existing and new processes.
| bch wrote:
| > eBPF on Linux automatically handles both existing and
| new processes
|
| Without knowing your particular case, DTrace does too -
| it'd certainly be tricky to use if you're trying to debug
| software that "instantly crashes on startup" if it
| couldn't do that. "execname" (not "pid") is where I'd
| look, or perhaps that part of the predicate is skipable;
| regardless, should be possible.
| tanelpoder wrote:
| For example I used something like
| "pid:module:funcname:entry" probe for userspace things
| (not pid$123 or pid$target, just pid to catch all PIDs
| using the module/funcname of interest). And back when I
| tested, it didn't automatically catch any new PIDs so
| these probes were not fired for them unless I restarted
| my DTrace script (but it was probably year <2010 when I
| last tested it).
|
| Execname is a variable in DTrace and not a probe (?), so
| how would it help with automatically attaching to new
| PIDs? Now that I recall more details, there was no issue
| with statically defined kernel "fbt" probes nor
| "profile", but the userspace pid one was where I hit this
| limitation.
| bch wrote:
| > Execname is a variable in DTrace and not a probe (?),
| so how would it help with automatically attaching to new
| PIDs?
|
| You're correct, and I may have provided "a solution" to a
| misunderstanding of your problem - I don't think the "not
| matching new procs/pids" is inherent in DTrace, so indeed
| you might have run into an implementation issue (as it
| was 15 years ago). I misunderstood you as perhaps using a
| predicate matching a specific pid; my fault.
| mgaunard wrote:
| It lets you hook into various points in the kernel; ultimately
| you need to learn how the Linux kernel is structured to make
| the most of it.
|
| Unlike a module, it can only really read data, not modify data
| structures, so it's nice for things like tracing kernel events.
|
| The XDP subsystem is particularly designed for you to apply
| filters to network data before it makes it to the network
| stack, but it still doesn't give you the same level of control
| or performance as DPDK, since you still need the data to go to
| the kernel.
| tanelpoder wrote:
| Yep (the 0x.tools author here). If you look into my code,
| you'll see that I'm _not_ a good developer :-) But I have a
| decent understanding of Linux kernel flow and kernel /app
| interaction dynamics, thanks to many years of troubleshooting
| large (Oracle) database workloads. So I knew exactly what I
| wanted to measure and how, just had to learn the eBPF parts.
| That's why I picked BCC instead of libbpf as I was somewhat
| familiar with it already, but fully dynamic and "self-
| updating" libbpf loading approach is the goal for v3 (help
| appreciated!)
| mgaunard wrote:
| Myself I've only built simple things, like tracing sched
| switch events for certain threads, and killing the process
| if they happen (specifically designed as a safety for
| pinned threads).
| tanelpoder wrote:
| Same here, until now. I built the earlier xcapture v1
| (also in the repo) about 5 years ago and it just samples
| various /proc/PID/task/TID pseudofiles regularly, it also
| allows you get pretty far with the thread-level activity
| measurement approach, especially when combined with
| always-on low frequency on-CPU sampling with perf.
| tptacek wrote:
| I was going to ask "why BCC" (BCC is super clunky) but
| you're way ahead of us. This is great work, thanks for
| posting it.
| tanelpoder wrote:
| Yeah, I already see limitations, the last one was
| yesterday when I installed earlier Ubuntu versions to see
| how far back this can go - and even Ubuntu 22.04 didn't
| work out of the box, ended up with some BCC/kernel header
| mismatch issue [1] although the kernel itself supported
| it. A workaround was to download & compile the latest BCC
| yourself, but I don't want to go there as the
| customers/systems I work on wouldn't go there anyway.
|
| But libbpf with CO-RE will solve these issues as I
| understand, so as long as the _kernel_ supports what you
| need, the CO-RE binary will work.
|
| This raises another issue for me though, it's not easy,
| but _easier_ , for enterprises to download and run a
| single python + single C source file (with <500 code
| lines to review) than a compiled CO-RE binary, but my
| long term plan/hope is that I (we) get the RedHats and
| AWSes of this world to just provide the eventual mature
| release as a standard package.
|
| [1] https://github.com/iovisor/bcc/issues/3993
| tptacek wrote:
| XDP, in its intended configuration, passes pointers to
| packets still on the driver DMA rings (or whatever) directly
| to BPF code, which can modify packets and forward them to
| other devices, bypassing the kernel stack completely. You can
| XDP_PASS a packet if you'd like it to hit the kernel,
| creating an skbuff, and bouncing it through all the kernel's
| network stack code, but the idea is that you don't want to do
| that; if you do, just use TC BPF, which is equivalently
| powerful and more flexible.
| mgaunard wrote:
| Yes for XDP there is a dedicated API, but for any of the
| other hooks like tracepoints, it's all designed to give you
| read-only access.
|
| The whole CO-RE thing is about having a kernel-version-
| agnostic way of reading fields from kernel data structures.
| tptacek wrote:
| Right, I'm just pushing back on the DPDK thing.
| jiripospisil wrote:
| There's a bunch of examples over at
| https://github.com/iovisor/bcc
| rascul wrote:
| You might find some interesting stuff here
|
| https://ebpf.io/
| lathiat wrote:
| I'll toot my own horn here. But there are plenty of
| presentations about it, Brendan Gregg's are usually pretty
| great.
|
| "bpftrace recipes: 5 real problems solved" - Trent Lloyd
| (Everything Open 2023)
| https://www.youtube.com/watch?v=ZDTfcrp9pJI
| __turbobrew__ wrote:
| I use BCC tools weekly to debug production issues. Recently I
| found we were massively pressuring page caches due to having a
| large number of loopback devices with their own page cache.
| Enabling direct io on the loopback devices fixed the issue.
|
| eBPF is really a superpower, it lets you do things which are
| incomprehensible if you don't know about it.
| tptacek wrote:
| I'd love to hear more of this debugging story!
| __turbobrew__ wrote:
| Containers are offered block storage by creating a loopback
| device with a backing file on the kubelet's file system. We
| noticed that on some very heavily utilized nodes that iowait
| was using 60% of all the available cores on the node.
|
| I first confirmed that nvme drives were healthy according to
| SMART, I then worked up the stack and used BCC tools to look
| at block io latency. Block io latency was quite low for the
| NVME drives (microseconds) but was hundreds of milliseconds
| for the loopback block devices.
|
| This lead me to believe that something was wrong with the
| loopback devices and not the underlying NVMEs. I used
| cachestat/cachetop and found that the page cache miss rate
| was very high and that we were thrashing the page cache
| constantly paging in and out data. From there I inspected the
| loopback devices using losetup and found that direct io was
| disabled and the sector size of the loopback device did not
| match the sector size of the backing filesystem.
|
| I modified the loopback devices to use the same sector size
| as the block size of the underlying file system and enabled
| direct io. Instantly, the majority of the page cache was
| freed, iowait went way down, and io throughout went way up.
|
| Without BCC tools I would have never been able to figure this
| out.
|
| Double caching loopback devices is quite the footgun.
|
| Another interesting thing we hit is that our version of
| losetup would happily fail to enable direct io but still give
| you a loopback device, this has since been fixed:
| https://github.com/util-linux/util-
| linux/commit/d53346ed082d...
| jauntywundrkind wrote:
| There's also either Composefs or Puzzlefs, both of which
| attempt to let the page cache work across containers!
|
| https://github.com/containers/composefs
| https://github.com/project-machine/puzzlefs
| FooBarWidget wrote:
| Which container runtime are you using? As far as I know
| both Docker and containerd use overlay filesystems instead
| of loopback devices.
|
| And how did you know that tweaking the sector size to equal
| the underlying filesystem's block size would prevent double
| caching? Where can one get this sort of knowledge?
| __turbobrew__ wrote:
| The loopback devices came from a CSI which creates a
| backing file on the kubelet's filesystem and mounts it
| into the container as a block device. We use containerd.
|
| I knew that enabling direct io would most likely disable
| double caching because that is literally the point of
| enabling direct io on a loopback device. Initially I just
| tried enabling direct io on the loopback devices, but
| that failed with a cryptic "invalid argument" error.
| After some more research I found that direct IO needs the
| sector size to match the filesystems block size in some
| cases to work.
| M_bara wrote:
| We had something similar about 10 years ago where I worked.
| Customer instances were backed via loopback devices to
| local disks. We didn't think of this - face palm - on the
| loop back devices. What we ended up doing was writing a
| small daemon to posix fadvise the kernel to skip the page
| cache... your solution is way simpler and more elegant...
| hats off to you
| jyxent wrote:
| I've been learning BCC / bpftrace recently to debug a memory
| leak issue on a customer's system, and it has been super
| useful.
| malkia wrote:
| Relatively how expensive is to capture the callstack when doing
| sample profiling?
|
| With Intel CET's tech there should be way to capture a shadow
| stack, that really just contains entry points, but wondering if
| that's going to be used...
| tanelpoder wrote:
| The on-cpu sample profiling is not a big deal for my use cases
| as I don't need the "perf" sampling to happen at 10kHz or
| anything (more like 10-1Hz, but always on).
|
| But the sched_switch tracepoint is the hottest event, without
| stack sampling it's 200-500ns per event (on my Xeon 63xx CPUs),
| depending on what data is collected. I use #ifdefs to compile
| in only the fields that are actually used (smaller thread_state
| struct, fewer branches and instructions to decode & cache).
| Surprisingly when I collect kernel stack, the overhead jumps
| higher up compared to user stack (kstack goes from say 400ns to
| 3200ns, while ustack jumps to 2800ns per event or so).
|
| I have done almost zero optimizations (and I figure using
| libbpf/BTF/CO-RE will help too). But I'm ok with these numbers
| for most of _my_ workloads of interest, and since eBPF programs
| are not cast in stone, can do further reductions, like actually
| sampling stacks in the sched_switch probe on every 10th
| occurrence or something.
|
| So in worst case, this full-visibility approach might not be
| usable as always-on instrumentation for _some_ workloads (like
| some redis /memcached/mysql lookups doing 10M context
| switches/s on a big server), but even with such workloads, a
| temporary increase in instrumentation overhead might be ok,
| when there are known recurring problems to troubleshoot.
| malkia wrote:
| Awesome info!!! Thanks a lot!
| metroholografix wrote:
| Folks who find this useful might also be interested in otel-
| profiling-agent [1] which Elastic recently opensourced and
| donated to OpenTelemetry. It's a low-overhead eBPF-based
| continuous profiler which, besides native code, can unwind stacks
| from other widely used runtimes (Hotspot, V8, Python, .NET, Ruby,
| Perl, PHP).
|
| [1] https://github.com/elastic/otel-profiling-agent
| 3abiton wrote:
| I am trying to wrap my head around it, still unclear what it
| does l.
| zikohh wrote:
| That's like most of Grafana's documentation
| kbouck wrote:
| Grafana has one too called Beyla.
|
| https://grafana.com/oss/beyla-ebpf/
___________________________________________________________________
(page generated 2024-07-04 23:01 UTC)