[HN Gopher] Show HN: Xcapture-BPF - like Linux top, but with Xra...
       ___________________________________________________________________
        
       Show HN: Xcapture-BPF - like Linux top, but with Xray vision
        
       Author : tanelpoder
       Score  : 371 points
       Date   : 2024-07-03 20:52 UTC (1 days ago)
        
 (HTM) web link (0x.tools)
 (TXT) w3m dump (0x.tools)
        
       | jamesy0ung wrote:
       | I've never used eBPF, does anyone have some good resources for
       | learning it?
        
         | tanelpoder wrote:
         | Brendan Gregg's site (and book) is probably the best starting
         | point (he was involved in DTrace work & rollout 20 years ago
         | when at Sun) and was/is instrumental in pushing eBPF in Linux
         | even further than DTrace ever went:
         | 
         | https://brendangregg.com/ebpf.html
        
           | bcantrill wrote:
           | Just a quick clarification: while Brendan was certainly an
           | active DTrace user and evangelist, he wasn't involved in the
           | development of DTrace itself -- or its rollout. (Brendan came
           | to Sun in 2006; DTrace was released in 2003.) As for eBPF
           | with respect to DTrace, I would say that they are different
           | systems with different goals and approaches rather than one
           | eclipsing the other. (There are certainly many things that
           | DTrace can do that eBPF/BCC cannot, some of the details of
           | which we elaborated on in our 20th anniversary of DTrace's
           | initial integration.[0])
           | 
           | Edit: We actually went into much more specific detail on
           | eBPF/BCC in contrast to DTrace a few weeks after the 20th
           | anniversary podcast.[1]
           | 
           | [0] https://www.youtube.com/watch?v=IeUFzBBRilM
           | 
           | [1] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m7s
        
             | tanelpoder wrote:
             | Thanks, yes I was more or less aware of that (I'd been
             | using DTrace since Solaris 10 beta in 2004 or 2003?)... By
             | rollout I really meant "getting the word out there"...
             | that's half the battle in my experience (that's why this
             | post here! :-)
             | 
             | What I loved about DTrace was that once it was out, even in
             | beta, it was pretty complete and worked - all the DTrace
             | ports that I've tried, including on Windows (!) a few years
             | ago were very limited or had some showstopper issues. I
             | guess eBPF was like that too some years ago, but by now
             | it's pretty sweet even for more regular consumer who don't
             | keep track of its development.
             | 
             | Edit: Oh, wasn't aware of the timeline, I may have some
             | dates (years) wrong in my memory
        
             | abrookewood wrote:
             | Yes, not involved in DTrace itself, but he did write a
             | bunch of DTrace Tools which led to an interesting meeting
             | with a Sun exec:
             | https://www.brendangregg.com/blog/2021-06-04/an-
             | unbelievable...
        
             | anonfordays wrote:
             | >As for eBPF with respect to DTrace, I would say that they
             | are different systems with different goals and approaches
             | 
             | For sure. Different systems, different times.
             | 
             | >rather than one eclipsing the other.
             | 
             | It does seem that DTrace has been eclipsed though, at least
             | in Linux (which runs the vast majority of the world's
             | compute). Is there a reason to use DTrace over eBPF for
             | tracing and observability in Linux?
             | 
             | >There are certainly many things that DTrace can do that
             | eBPF/BCC cannot
             | 
             | This may be true, but that gap is closing. There are
             | certainly many things that eBPF can do that DTrace cannot,
             | like Cilium.
        
               | tanelpoder wrote:
               | Perhaps familiarity with the syntax of DTrace if coming
               | from Solaris-heavy enterprise background. But then again,
               | too many years have passed since Solaris was a major
               | mainstream platform. Oracle ships and supports DTrace on
               | (Oracle) Linux by the way, but DTrace 2.0 on Linux is a
               | scripting frontend that gets compiled to eBPF under the
               | hood.
               | 
               | Back when I tried to build xcapture with DTrace, I could
               | launch the script and use something like
               | /pid$oracle::func:entry/ but IIRC the probe was attached
               | only to the processes that already existed and not any
               | new ones that were started after loading the DTrace
               | probes. Maybe I should have used some lower level APIs or
               | something - but eBPF on Linux automatically handles both
               | existing and new processes.
        
               | bch wrote:
               | > eBPF on Linux automatically handles both existing and
               | new processes
               | 
               | Without knowing your particular case, DTrace does too -
               | it'd certainly be tricky to use if you're trying to debug
               | software that "instantly crashes on startup" if it
               | couldn't do that. "execname" (not "pid") is where I'd
               | look, or perhaps that part of the predicate is skipable;
               | regardless, should be possible.
        
               | tanelpoder wrote:
               | For example I used something like
               | "pid:module:funcname:entry" probe for userspace things
               | (not pid$123 or pid$target, just pid to catch all PIDs
               | using the module/funcname of interest). And back when I
               | tested, it didn't automatically catch any new PIDs so
               | these probes were not fired for them unless I restarted
               | my DTrace script (but it was probably year <2010 when I
               | last tested it).
               | 
               | Execname is a variable in DTrace and not a probe (?), so
               | how would it help with automatically attaching to new
               | PIDs? Now that I recall more details, there was no issue
               | with statically defined kernel "fbt" probes nor
               | "profile", but the userspace pid one was where I hit this
               | limitation.
        
               | bch wrote:
               | > Execname is a variable in DTrace and not a probe (?),
               | so how would it help with automatically attaching to new
               | PIDs?
               | 
               | You're correct, and I may have provided "a solution" to a
               | misunderstanding of your problem - I don't think the "not
               | matching new procs/pids" is inherent in DTrace, so indeed
               | you might have run into an implementation issue (as it
               | was 15 years ago). I misunderstood you as perhaps using a
               | predicate matching a specific pid; my fault.
        
         | mgaunard wrote:
         | It lets you hook into various points in the kernel; ultimately
         | you need to learn how the Linux kernel is structured to make
         | the most of it.
         | 
         | Unlike a module, it can only really read data, not modify data
         | structures, so it's nice for things like tracing kernel events.
         | 
         | The XDP subsystem is particularly designed for you to apply
         | filters to network data before it makes it to the network
         | stack, but it still doesn't give you the same level of control
         | or performance as DPDK, since you still need the data to go to
         | the kernel.
        
           | tanelpoder wrote:
           | Yep (the 0x.tools author here). If you look into my code,
           | you'll see that I'm _not_ a good developer :-) But I have a
           | decent understanding of Linux kernel flow and kernel /app
           | interaction dynamics, thanks to many years of troubleshooting
           | large (Oracle) database workloads. So I knew exactly what I
           | wanted to measure and how, just had to learn the eBPF parts.
           | That's why I picked BCC instead of libbpf as I was somewhat
           | familiar with it already, but fully dynamic and "self-
           | updating" libbpf loading approach is the goal for v3 (help
           | appreciated!)
        
             | mgaunard wrote:
             | Myself I've only built simple things, like tracing sched
             | switch events for certain threads, and killing the process
             | if they happen (specifically designed as a safety for
             | pinned threads).
        
               | tanelpoder wrote:
               | Same here, until now. I built the earlier xcapture v1
               | (also in the repo) about 5 years ago and it just samples
               | various /proc/PID/task/TID pseudofiles regularly, it also
               | allows you get pretty far with the thread-level activity
               | measurement approach, especially when combined with
               | always-on low frequency on-CPU sampling with perf.
        
             | tptacek wrote:
             | I was going to ask "why BCC" (BCC is super clunky) but
             | you're way ahead of us. This is great work, thanks for
             | posting it.
        
               | tanelpoder wrote:
               | Yeah, I already see limitations, the last one was
               | yesterday when I installed earlier Ubuntu versions to see
               | how far back this can go - and even Ubuntu 22.04 didn't
               | work out of the box, ended up with some BCC/kernel header
               | mismatch issue [1] although the kernel itself supported
               | it. A workaround was to download & compile the latest BCC
               | yourself, but I don't want to go there as the
               | customers/systems I work on wouldn't go there anyway.
               | 
               | But libbpf with CO-RE will solve these issues as I
               | understand, so as long as the _kernel_ supports what you
               | need, the CO-RE binary will work.
               | 
               | This raises another issue for me though, it's not easy,
               | but _easier_ , for enterprises to download and run a
               | single python + single C source file (with <500 code
               | lines to review) than a compiled CO-RE binary, but my
               | long term plan/hope is that I (we) get the RedHats and
               | AWSes of this world to just provide the eventual mature
               | release as a standard package.
               | 
               | [1] https://github.com/iovisor/bcc/issues/3993
        
           | tptacek wrote:
           | XDP, in its intended configuration, passes pointers to
           | packets still on the driver DMA rings (or whatever) directly
           | to BPF code, which can modify packets and forward them to
           | other devices, bypassing the kernel stack completely. You can
           | XDP_PASS a packet if you'd like it to hit the kernel,
           | creating an skbuff, and bouncing it through all the kernel's
           | network stack code, but the idea is that you don't want to do
           | that; if you do, just use TC BPF, which is equivalently
           | powerful and more flexible.
        
             | mgaunard wrote:
             | Yes for XDP there is a dedicated API, but for any of the
             | other hooks like tracepoints, it's all designed to give you
             | read-only access.
             | 
             | The whole CO-RE thing is about having a kernel-version-
             | agnostic way of reading fields from kernel data structures.
        
               | tptacek wrote:
               | Right, I'm just pushing back on the DPDK thing.
        
         | jiripospisil wrote:
         | There's a bunch of examples over at
         | https://github.com/iovisor/bcc
        
         | rascul wrote:
         | You might find some interesting stuff here
         | 
         | https://ebpf.io/
        
         | lathiat wrote:
         | I'll toot my own horn here. But there are plenty of
         | presentations about it, Brendan Gregg's are usually pretty
         | great.
         | 
         | "bpftrace recipes: 5 real problems solved" - Trent Lloyd
         | (Everything Open 2023)
         | https://www.youtube.com/watch?v=ZDTfcrp9pJI
        
       | __turbobrew__ wrote:
       | I use BCC tools weekly to debug production issues. Recently I
       | found we were massively pressuring page caches due to having a
       | large number of loopback devices with their own page cache.
       | Enabling direct io on the loopback devices fixed the issue.
       | 
       | eBPF is really a superpower, it lets you do things which are
       | incomprehensible if you don't know about it.
        
         | tptacek wrote:
         | I'd love to hear more of this debugging story!
        
           | __turbobrew__ wrote:
           | Containers are offered block storage by creating a loopback
           | device with a backing file on the kubelet's file system. We
           | noticed that on some very heavily utilized nodes that iowait
           | was using 60% of all the available cores on the node.
           | 
           | I first confirmed that nvme drives were healthy according to
           | SMART, I then worked up the stack and used BCC tools to look
           | at block io latency. Block io latency was quite low for the
           | NVME drives (microseconds) but was hundreds of milliseconds
           | for the loopback block devices.
           | 
           | This lead me to believe that something was wrong with the
           | loopback devices and not the underlying NVMEs. I used
           | cachestat/cachetop and found that the page cache miss rate
           | was very high and that we were thrashing the page cache
           | constantly paging in and out data. From there I inspected the
           | loopback devices using losetup and found that direct io was
           | disabled and the sector size of the loopback device did not
           | match the sector size of the backing filesystem.
           | 
           | I modified the loopback devices to use the same sector size
           | as the block size of the underlying file system and enabled
           | direct io. Instantly, the majority of the page cache was
           | freed, iowait went way down, and io throughout went way up.
           | 
           | Without BCC tools I would have never been able to figure this
           | out.
           | 
           | Double caching loopback devices is quite the footgun.
           | 
           | Another interesting thing we hit is that our version of
           | losetup would happily fail to enable direct io but still give
           | you a loopback device, this has since been fixed:
           | https://github.com/util-linux/util-
           | linux/commit/d53346ed082d...
        
             | jauntywundrkind wrote:
             | There's also either Composefs or Puzzlefs, both of which
             | attempt to let the page cache work across containers!
             | 
             | https://github.com/containers/composefs
             | https://github.com/project-machine/puzzlefs
        
             | FooBarWidget wrote:
             | Which container runtime are you using? As far as I know
             | both Docker and containerd use overlay filesystems instead
             | of loopback devices.
             | 
             | And how did you know that tweaking the sector size to equal
             | the underlying filesystem's block size would prevent double
             | caching? Where can one get this sort of knowledge?
        
               | __turbobrew__ wrote:
               | The loopback devices came from a CSI which creates a
               | backing file on the kubelet's filesystem and mounts it
               | into the container as a block device. We use containerd.
               | 
               | I knew that enabling direct io would most likely disable
               | double caching because that is literally the point of
               | enabling direct io on a loopback device. Initially I just
               | tried enabling direct io on the loopback devices, but
               | that failed with a cryptic "invalid argument" error.
               | After some more research I found that direct IO needs the
               | sector size to match the filesystems block size in some
               | cases to work.
        
             | M_bara wrote:
             | We had something similar about 10 years ago where I worked.
             | Customer instances were backed via loopback devices to
             | local disks. We didn't think of this - face palm - on the
             | loop back devices. What we ended up doing was writing a
             | small daemon to posix fadvise the kernel to skip the page
             | cache... your solution is way simpler and more elegant...
             | hats off to you
        
         | jyxent wrote:
         | I've been learning BCC / bpftrace recently to debug a memory
         | leak issue on a customer's system, and it has been super
         | useful.
        
       | malkia wrote:
       | Relatively how expensive is to capture the callstack when doing
       | sample profiling?
       | 
       | With Intel CET's tech there should be way to capture a shadow
       | stack, that really just contains entry points, but wondering if
       | that's going to be used...
        
         | tanelpoder wrote:
         | The on-cpu sample profiling is not a big deal for my use cases
         | as I don't need the "perf" sampling to happen at 10kHz or
         | anything (more like 10-1Hz, but always on).
         | 
         | But the sched_switch tracepoint is the hottest event, without
         | stack sampling it's 200-500ns per event (on my Xeon 63xx CPUs),
         | depending on what data is collected. I use #ifdefs to compile
         | in only the fields that are actually used (smaller thread_state
         | struct, fewer branches and instructions to decode & cache).
         | Surprisingly when I collect kernel stack, the overhead jumps
         | higher up compared to user stack (kstack goes from say 400ns to
         | 3200ns, while ustack jumps to 2800ns per event or so).
         | 
         | I have done almost zero optimizations (and I figure using
         | libbpf/BTF/CO-RE will help too). But I'm ok with these numbers
         | for most of _my_ workloads of interest, and since eBPF programs
         | are not cast in stone, can do further reductions, like actually
         | sampling stacks in the sched_switch probe on every 10th
         | occurrence or something.
         | 
         | So in worst case, this full-visibility approach might not be
         | usable as always-on instrumentation for _some_ workloads (like
         | some redis /memcached/mysql lookups doing 10M context
         | switches/s on a big server), but even with such workloads, a
         | temporary increase in instrumentation overhead might be ok,
         | when there are known recurring problems to troubleshoot.
        
           | malkia wrote:
           | Awesome info!!! Thanks a lot!
        
       | metroholografix wrote:
       | Folks who find this useful might also be interested in otel-
       | profiling-agent [1] which Elastic recently opensourced and
       | donated to OpenTelemetry. It's a low-overhead eBPF-based
       | continuous profiler which, besides native code, can unwind stacks
       | from other widely used runtimes (Hotspot, V8, Python, .NET, Ruby,
       | Perl, PHP).
       | 
       | [1] https://github.com/elastic/otel-profiling-agent
        
         | 3abiton wrote:
         | I am trying to wrap my head around it, still unclear what it
         | does l.
        
           | zikohh wrote:
           | That's like most of Grafana's documentation
        
         | kbouck wrote:
         | Grafana has one too called Beyla.
         | 
         | https://grafana.com/oss/beyla-ebpf/
        
       ___________________________________________________________________
       (page generated 2024-07-04 23:01 UTC)