[HN Gopher] How we found and fixed an eBPF Linux kernel vulnerab...
       ___________________________________________________________________
        
       How we found and fixed an eBPF Linux kernel vulnerability
        
       Author : xxmarkuski
       Score  : 213 points
       Date   : 2024-08-08 10:39 UTC (12 hours ago)
        
 (HTM) web link (bughunters.google.com)
 (TXT) w3m dump (bughunters.google.com)
        
       | katzinsky wrote:
       | The one time I tried to use eBPF it wasn't expressive enough for
       | what I needed.
       | 
       | Does the limited flexibility it provides really justify the added
       | kernel space complexity? I can understand it for packet filtering
       | but some of the other stuff it's used for like sandboxing just
       | isn't convincing.
        
         | knorker wrote:
         | There are other technologies for this, such as DTrace. The
         | kernel's choice isn't eBPF or nothing, it's eBPF or something
         | else like it.
         | 
         | You may not use it much, but some people use it all day. I
         | think FAANG engineers have said that they run tens (hundreds?)
         | of these things on all servers, all the time. And that's
         | excluding one-offs. And FAANG has full time kernel coders on
         | staff, so they're also funding this complexity that they use.
         | 
         | But also yes, I've solved problems by using eBPF. Problems that
         | are basically unsolvable by non-kernel-gurus without eBPF. I
         | rarely need it. But when I need it, there's nothing else that
         | does the trick.
         | 
         | In some cases, even for kernel gurus, it's a choice between
         | eBPF or maintaining a custom kernel patch forever.
        
           | katzinsky wrote:
           | I'm not sure "Google engineers use it" is a very good counter
           | argument. They have a very high tolerance for complexity and
           | like most large corporations what actually gets built and
           | used tends to be driven more by internal politics than
           | technical merit.
        
             | eggnet wrote:
             | Google would maintain a kernel patch or upstream a patch if
             | that was the right choice for a given problem.
        
               | katzinsky wrote:
               | That's really begging the question. I don't believe they
               | would as they have consistently over engineered solutions
               | in the past.
        
             | knorker wrote:
             | I don't mean it as a counter argument, or I don't think the
             | way you mean it, at least.
             | 
             | You may not use it at your smaller scale. But there are
             | millions of machines out there that do use it, and the
             | alternative for the same functionality is much worse.
             | 
             | I bet you never use SCTP sockets either. eBPF is used much
             | more than SCTP.
             | 
             | And its users "fund" its development, so it's not a burden
             | to those who don't use it.
             | 
             | But are you sure your systems don't use it? Run "bpftool
             | prog" to see. Whatever you see there someone thought was
             | better than the alternative.
        
           | lynxmachine wrote:
           | > I've solved problems by using eBPF. Problems that are
           | basically unsolvable by non-kernel-gurus without eBPF. I
           | rarely need it.
           | 
           | Would you mind giving some examples? I recently started
           | learning about ebpf's from Liz Rice's book and is curious
           | about what makes ebpf the correct choice in a particular
           | scenario.
        
           | znpy wrote:
           | > There are other technologies for this, such as DTrace. The
           | kernel's choice isn't eBPF or nothing, it's eBPF or something
           | else like it.
           | 
           | To add on this point: I successfully used SystemTap a few
           | years ago to debug an issue i was having.
           | 
           | Before going further: keep in mind that my point of view (at
           | the time) was the one of somebody working as a devops
           | engineer, debugging some annoyances with containers (managed
           | by Kubernetes) going OOM. I'm no kernel developer and I have
           | a basic-good understanding of the C language based on first-
           | years university course and geekyness/nerdyness. So in this
           | context I'm a glorified hobbyist.
           | 
           | Learning SystemTap is easier in my opinion. I followed a
           | tutorial by RedHat to get the hang of the manual parts but
           | after that I remember being fairly easy:
           | 
           | 1. Try to reproduce the issue you're having (fairly easy for
           | me)
           | 
           | 2. Skim the source code of the linux about the part that you
           | think might be relevant (for me it was the oom killer)
           | 
           | 3. Add probes in there, see if they fire when you reproduce
           | the issue
           | 
           | 4. Look back at the source code of the kernel and see what
           | chain of data structures and fields you can follow to reach
           | the piece of information you need
           | 
           | 5. Improve your probes
           | 
           | 6. If successful, you're done
           | 
           | 7. Goto 4
           | 
           | I think it took like one or two days between following the
           | tutorial and getting a working probe.
           | 
           | It was a pleasant couple of days.
        
           | fch42 wrote:
           | DTrace and eBPF are "not so different" in the sense that
           | dtrace programs / hooks are also a form of low-level code /
           | instruction set that the kernel (dtrace driver) validates at
           | load. It's an "internal" artifact of dtrace though,
           | https://github.com/illumos/illumos-
           | gate/blob/master/usr/src/... and to my knowledge, nothing
           | like a clang/gcc "dtrace target" exists to translate more-or-
           | less arbitrary higher-level language "to low-level dtrace".
           | 
           | The additional flexibility eBPF gets from this is amazing
           | really. While dtrace is a more-targeted (and for its intended
           | usecases, in some situations still superior to eBPF) but also
           | less-general tool.
           | 
           | (citrus vs. stone fruit ...)
        
             | cryptonector wrote:
             | DTrace's bytecode machine is also very very limited. eBPF's
             | is much less limited. Limiting the scope of what a probe
             | can do is very important.
        
               | bcantrill wrote:
               | Yes, thank you. Long before eBPF existed, we spent a ton
               | of time on the safety of DTrace[0][1] -- there's a bunch
               | of subtlety to it. The proof is in the pudding, however:
               | thanks to our strict adherence to the safety constraint,
               | we have absolute confidence in using DTrace in
               | production.
               | 
               | [0] https://bcantrill.dtrace.org/2005/07/19/dtrace-
               | safety/
               | 
               | [1] https://www.usenix.org/legacy/publications/library/pr
               | oceedin..., SS3.3
        
               | saagarjha wrote:
               | I'm curious which part of these tenets would feel would
               | have prevented the bug demonstrated, besides "oh we tried
               | harder"? I don't see any of those that seem unique to
               | DTrace other than limiting where probes can be placed.
        
               | cryptonector wrote:
               | The DTrace bytecode VM is simply more limited:
               | - it cannot branch backwards (this is also true of eBPF)
               | - it can only do ternary operator branches       - it
               | cannot define functions       - functions it can call are
               | limited to some builtin ones       - it can only scribble
               | on the one pre-allocated probe buffer       - it can only
               | access the probe's defined parameters
        
               | tptacek wrote:
               | eBPF programs can absolutely branch backwards. You may be
               | thinking of cBPF.
        
               | cryptonector wrote:
               | I was thinking of the original BPF. I didn't realize that
               | eBPF added back branching.
        
               | tptacek wrote:
               | If the verifier can prove to itself that a loop is
               | bounded, it'll accept it. A good starting place for eBPF
               | itself: if a normal ARM program could do it, eBPF can do
               | it. It's a fully functional ISA.
        
               | cryptonector wrote:
               | I'm w/ the DTrace guys on this. A turing complete VM is a
               | bad idea for this purpose.
        
               | tptacek wrote:
               | It depends on what you're using it for. If you want to
               | expose this to untrusted code, yes, but I wouldn't be
               | comfortable doing that with DTrace either.
        
               | cryptonector wrote:
               | There's two untrusted code cases here: untrusted DTrace
               | scripts / users, and untrusted targets for inspection.
               | The latter has to be possible to examine, so the
               | observability tools (like DTrace) have to be secure for
               | that purpose. This means you want to make it difficult to
               | overflow buffers in the observability tools.
               | 
               | There's also a need to make sure that even trusted users
               | don't accidentally cause too much observability load.
               | That's why DTrace has a circular probe buffer pool, it's
               | why it drops probes under load, it's why it pre-allocates
               | each probe's buffer by computing how much the probe's
               | actions will write to it, it's why it doesn't allow
               | looping (since that would make the probe's effect less
               | predictable), etc.
               | 
               | Bryan, Adam, and Mike designed it this way two plus
               | decades ago, and Linux still hasn't caught up.
        
               | tptacek wrote:
               | Linux has a different design than DTrace; eBPF is more
               | capable as a trusted tool, and less capable for untrusted
               | tools. It doesn't make sense to say one approach has
               | "caught up" to the other, unless you really believe the
               | verifier will reach a state where nobody's going find
               | verifier bugs --- at which point eBPF will be strictly
               | superior. Beyond that, it's a matter of taste. What seems
               | clearly to be true is that eBPF is wildly more popular.
        
               | cryptonector wrote:
               | And I should say that DTrace probe actions _can
               | dereference pointers_ , but NULL dereferences do not
               | cause crashes, and rich type data is generally available.
        
               | bcantrill wrote:
               | Well, we didn't merely "try harder" -- we treated safety
               | as a constraint which informed every aspect of the
               | design. And yes, treating safety as a constraint rather
               | than merely an objective results in different
               | implementation decisions. From the article:
               | 
               |  _This working model significantly increases the attack
               | surface of the kernel, since it allows executing
               | arbitrary code at a high privilege level. Because of this
               | risk, programs have to be verified before they can be
               | loaded. This ensures that all eBPF security assumptions
               | are met. The verifier, which consists of complex code, is
               | responsible for this task._
               | 
               |  _Given how difficult the task of validating that a
               | program is safe to execute is, there have been many
               | vulnerabilities found within the eBPF verifier. When one
               | of these vulnerabilities is exploited, the result is
               | usually a local privilege escalation exploit (or
               | container escape in containerized environments). While
               | the verifier's code has been audited extensively, this
               | task also becomes harder as new features are added to
               | eBPF and the complexity of the verifier grows_
               | 
               | DTrace was developed over 20 years ago; there have not
               | been "many vulnerabilities" found in the verifier -- and
               | we have not grown the complexity of the verifier over
               | time. You can dismiss these as implementation details,
               | but these details reflect different views of the problem
               | and its contraints.
        
               | saagarjha wrote:
               | No, like, the bug that was demonstrated seems to be
               | fairly fundamental to running any sort of bytecode in the
               | kernel: they need to verify all branches, and this is
               | potentially slow, so they optimize it (which is where the
               | bug is). What are you doing differently? It seems to me
               | that you're either not going to optimize this or you are?
        
               | tptacek wrote:
               | The DTrace instruction set is more limited than that of
               | the eBPF VM; eBPF is essentially a fully functional ISA,
               | where DTrace was (if I'm remembering this right) designed
               | around the D script language. An eBPF program is often
               | just a clang C program, and you're trusting the kernel
               | verifier to reject it if it can't be proven safe.
               | Further: eBPF programs are JIT'd to actual machine code;
               | once you've loaded and verified an eBPF program, it has
               | conceptually all the same power as, say, shellcode you
               | managed to load into the kernel via an LPE.
               | 
               | That's not to say that security researchers couldn't find
               | DTrace vulnerabilities if they, for instance, built
               | DIF/DOF fuzzers of 2023 levels of sophistication for
               | them. I don't know that anyone's doing that, because
               | DTrace is more or less a dead letter.
        
         | ssahoo wrote:
         | Wouldn't even the classic loadable kernel mode driver be a
         | better choice than a patch and eBpf? I know they are unsafe but
         | people who deal with it, know the power comes with
         | responsibility.
        
           | tptacek wrote:
           | No? SREs roll eBPF programs on the fly just in the process of
           | debugging problems; if you tried to do that with an LKM,
           | you'd almost certainly blow up your system. People who write
           | Linux kernel code routinely crash their systems in the
           | process of development.
        
       | techwiz137 wrote:
       | In my country we have a saying. "Porcupine in the pants". Sounds
       | like for all the good it can do, it isn't written safely and
       | carefully.
        
         | deskr wrote:
         | With experience you'll realise that despite things being done
         | safely and carefully, mistakes can and do pop up.
        
       | tptacek wrote:
       | A reminder that on the platforms eBPF is most commonly used,
       | verifier bugs don't matter much, because unprivileged code isn't
       | allowed to load eBPF programs to begin with. Bugs like this are
       | thus root -> ring0 vulnerabilities. That's not nothing, but for
       | serverside work it's usually worth the tradeoff, especially
       | because eBPF's track record for kernel LPEs is actually pretty
       | strong compared to the kernel as a whole.
       | 
       | In the setting eBPF is used today, most of the value of the
       | verifier is that it's hard to _accidentally_ crash your kernel
       | with a bad eBPF program. That is comically untrue about an
       | ordinary LKM.
        
         | chc4 wrote:
         | The PoC uses eBPF maps as their out-of-bounds pointer, but it
         | sounds like it would also be exploitable via non-extended BPF
         | programs loadable via seccomp since it's just improper scalar
         | value range tracking, which doesn't require any privileges on
         | most platforms.
         | 
         | And, of course, root -> ring0 is less of a problem with
         | unprivileged user namespaces where you can make yourself
         | "root", as we've seen in every eBPF bug PoC since distros
         | started turning that on (and have since turned it off again,
         | mostly)
        
         | 10000truths wrote:
         | Verifier bugs matter because resolving them is a prerequisite
         | for secure unprivileged use of eBPF.
        
           | tptacek wrote:
           | Put it this way: verifier bugs matter, but people probably
           | don't do unscheduled fleetwide updates to fix them.
        
         | dumpling777 wrote:
         | Let's not forget also that we can give CAP_BPF to containers.
         | With things like Cilium on the rise, the attack vector of
         | landing in container environment that has cap_bpf is more and
         | more realistic
        
           | tptacek wrote:
           | I don't believe shared-kernel container systems are real
           | security boundaries to begin with, so, to me, a container
           | running with CAP_BPF isn't much different than any other
           | program a machine owner might opt to run; the point is that
           | you trust the workload, and so the verifier is more of a
           | safety net than a vault door.
        
       | mrbluecoat wrote:
       | > "Uno no es ninguno" (One is none)
       | 
       | I believe that translates to "One is not none"
       | 
       | https://bughunters.google.com/blog/6303226026131456/a-deep-d...
        
         | DanielVZ wrote:
         | Thats the direct translation but for some reason in spanish our
         | double negations are usually just negations.
        
         | kmarc wrote:
         | It doesn't; It translates to "One is none" This is the infamous
         | double negation many foreign speakers (including me) struggles
         | with.
         | 
         | https://spanish.stackexchange.com/questions/26777/how-does-d...
        
         | samatman wrote:
         | Perhaps we should translate this as "one ain't nothin'".
        
       | TacticalCoder wrote:
       | > "Uno no es ninguno" (One is none)
       | 
       | Literally "One not is none", aka "One is _not_ none ".
        
         | jolmg wrote:
         | In Spanish, it's common for double negatives to not actually be
         | double negatives. For example, if you wanted to say "there's
         | nothing here", you'd say "no hay nada aqui", which word-for-
         | word means "there's not nothing here".
         | 
         | Checking out the Royal Spanish Academy, here's what they say
         | about it:
         | 
         | https://www.rae.es/espanol-al-dia/doble-negacion-no-vino-nad...
         | 
         | > The so-called "double negation" is due to the obligatory
         | negative agreement that must be established in Spanish, and
         | other Romance languages, in certain circumstances (see New
         | Grammar, SS 48.3d), which results in the joint presence in the
         | statement of the adverb _no_ and other elements that also have
         | a negative meaning.
         | 
         | > The concurrence of these two "negations" does not annul the
         | negative meaning of the statement.
        
       ___________________________________________________________________
       (page generated 2024-08-08 23:00 UTC)