[HN Gopher] How we found and fixed an eBPF Linux kernel vulnerab...
       ___________________________________________________________________
        
       How we found and fixed an eBPF Linux kernel vulnerability
        
       Author : xxmarkuski
       Score  : 257 points
       Date   : 2024-08-08 10:39 UTC (1 days ago)
        
 (HTM) web link (bughunters.google.com)
 (TXT) w3m dump (bughunters.google.com)
        
       | katzinsky wrote:
       | The one time I tried to use eBPF it wasn't expressive enough for
       | what I needed.
       | 
       | Does the limited flexibility it provides really justify the added
       | kernel space complexity? I can understand it for packet filtering
       | but some of the other stuff it's used for like sandboxing just
       | isn't convincing.
        
         | knorker wrote:
         | There are other technologies for this, such as DTrace. The
         | kernel's choice isn't eBPF or nothing, it's eBPF or something
         | else like it.
         | 
         | You may not use it much, but some people use it all day. I
         | think FAANG engineers have said that they run tens (hundreds?)
         | of these things on all servers, all the time. And that's
         | excluding one-offs. And FAANG has full time kernel coders on
         | staff, so they're also funding this complexity that they use.
         | 
         | But also yes, I've solved problems by using eBPF. Problems that
         | are basically unsolvable by non-kernel-gurus without eBPF. I
         | rarely need it. But when I need it, there's nothing else that
         | does the trick.
         | 
         | In some cases, even for kernel gurus, it's a choice between
         | eBPF or maintaining a custom kernel patch forever.
        
           | katzinsky wrote:
           | I'm not sure "Google engineers use it" is a very good counter
           | argument. They have a very high tolerance for complexity and
           | like most large corporations what actually gets built and
           | used tends to be driven more by internal politics than
           | technical merit.
        
             | eggnet wrote:
             | Google would maintain a kernel patch or upstream a patch if
             | that was the right choice for a given problem.
        
               | katzinsky wrote:
               | That's really begging the question. I don't believe they
               | would as they have consistently over engineered solutions
               | in the past.
        
               | DaiPlusPlus wrote:
               | > Google would maintain a kernel patch
               | 
               | I look forward to seeing that patch on Google Graveyard
               | in a couple years' time.
        
             | knorker wrote:
             | I don't mean it as a counter argument, or I don't think the
             | way you mean it, at least.
             | 
             | You may not use it at your smaller scale. But there are
             | millions of machines out there that do use it, and the
             | alternative for the same functionality is much worse.
             | 
             | I bet you never use SCTP sockets either. eBPF is used much
             | more than SCTP.
             | 
             | And its users "fund" its development, so it's not a burden
             | to those who don't use it.
             | 
             | But are you sure your systems don't use it? Run "bpftool
             | prog" to see. Whatever you see there someone thought was
             | better than the alternative.
        
           | lynxmachine wrote:
           | > I've solved problems by using eBPF. Problems that are
           | basically unsolvable by non-kernel-gurus without eBPF. I
           | rarely need it.
           | 
           | Would you mind giving some examples? I recently started
           | learning about ebpf's from Liz Rice's book and is curious
           | about what makes ebpf the correct choice in a particular
           | scenario.
        
           | znpy wrote:
           | > There are other technologies for this, such as DTrace. The
           | kernel's choice isn't eBPF or nothing, it's eBPF or something
           | else like it.
           | 
           | To add on this point: I successfully used SystemTap a few
           | years ago to debug an issue i was having.
           | 
           | Before going further: keep in mind that my point of view (at
           | the time) was the one of somebody working as a devops
           | engineer, debugging some annoyances with containers (managed
           | by Kubernetes) going OOM. I'm no kernel developer and I have
           | a basic-good understanding of the C language based on first-
           | years university course and geekyness/nerdyness. So in this
           | context I'm a glorified hobbyist.
           | 
           | Learning SystemTap is easier in my opinion. I followed a
           | tutorial by RedHat to get the hang of the manual parts but
           | after that I remember being fairly easy:
           | 
           | 1. Try to reproduce the issue you're having (fairly easy for
           | me)
           | 
           | 2. Skim the source code of the linux about the part that you
           | think might be relevant (for me it was the oom killer)
           | 
           | 3. Add probes in there, see if they fire when you reproduce
           | the issue
           | 
           | 4. Look back at the source code of the kernel and see what
           | chain of data structures and fields you can follow to reach
           | the piece of information you need
           | 
           | 5. Improve your probes
           | 
           | 6. If successful, you're done
           | 
           | 7. Goto 4
           | 
           | I think it took like one or two days between following the
           | tutorial and getting a working probe.
           | 
           | It was a pleasant couple of days.
        
           | fch42 wrote:
           | DTrace and eBPF are "not so different" in the sense that
           | dtrace programs / hooks are also a form of low-level code /
           | instruction set that the kernel (dtrace driver) validates at
           | load. It's an "internal" artifact of dtrace though,
           | https://github.com/illumos/illumos-
           | gate/blob/master/usr/src/... and to my knowledge, nothing
           | like a clang/gcc "dtrace target" exists to translate more-or-
           | less arbitrary higher-level language "to low-level dtrace".
           | 
           | The additional flexibility eBPF gets from this is amazing
           | really. While dtrace is a more-targeted (and for its intended
           | usecases, in some situations still superior to eBPF) but also
           | less-general tool.
           | 
           | (citrus vs. stone fruit ...)
        
             | cryptonector wrote:
             | DTrace's bytecode machine is also very very limited. eBPF's
             | is much less limited. Limiting the scope of what a probe
             | can do is very important.
        
               | bcantrill wrote:
               | Yes, thank you. Long before eBPF existed, we spent a ton
               | of time on the safety of DTrace[0][1] -- there's a bunch
               | of subtlety to it. The proof is in the pudding, however:
               | thanks to our strict adherence to the safety constraint,
               | we have absolute confidence in using DTrace in
               | production.
               | 
               | [0] https://bcantrill.dtrace.org/2005/07/19/dtrace-
               | safety/
               | 
               | [1] https://www.usenix.org/legacy/publications/library/pr
               | oceedin..., SS3.3
        
               | saagarjha wrote:
               | I'm curious which part of these tenets would feel would
               | have prevented the bug demonstrated, besides "oh we tried
               | harder"? I don't see any of those that seem unique to
               | DTrace other than limiting where probes can be placed.
        
               | cryptonector wrote:
               | The DTrace bytecode VM is simply more limited:
               | - it cannot branch backwards (this is also true of eBPF)
               | - it can only do ternary operator branches       - it
               | cannot define functions       - functions it can call are
               | limited to some builtin ones       - it can only scribble
               | on the one pre-allocated probe buffer       - it can only
               | access the probe's defined parameters
        
               | tptacek wrote:
               | eBPF programs can absolutely branch backwards. You may be
               | thinking of cBPF.
        
               | cryptonector wrote:
               | I was thinking of the original BPF. I didn't realize that
               | eBPF added back branching.
        
               | tptacek wrote:
               | If the verifier can prove to itself that a loop is
               | bounded, it'll accept it. A good starting place for eBPF
               | itself: if a normal ARM program could do it, eBPF can do
               | it. It's a fully functional ISA.
        
               | cryptonector wrote:
               | I'm w/ the DTrace guys on this. A turing complete VM is a
               | bad idea for this purpose.
        
               | tptacek wrote:
               | It depends on what you're using it for. If you want to
               | expose this to untrusted code, yes, but I wouldn't be
               | comfortable doing that with DTrace either.
        
               | cryptonector wrote:
               | There's two untrusted code cases here: untrusted DTrace
               | scripts / users, and untrusted targets for inspection.
               | The latter has to be possible to examine, so the
               | observability tools (like DTrace) have to be secure for
               | that purpose. This means you want to make it difficult to
               | overflow buffers in the observability tools.
               | 
               | There's also a need to make sure that even trusted users
               | don't accidentally cause too much observability load.
               | That's why DTrace has a circular probe buffer pool, it's
               | why it drops probes under load, it's why it pre-allocates
               | each probe's buffer by computing how much the probe's
               | actions will write to it, it's why it doesn't allow
               | looping (since that would make the probe's effect less
               | predictable), etc.
               | 
               | Bryan, Adam, and Mike designed it this way two plus
               | decades ago, and Linux still hasn't caught up.
        
               | tptacek wrote:
               | Linux has a different design than DTrace; eBPF is more
               | capable as a trusted tool, and less capable for untrusted
               | tools. It doesn't make sense to say one approach has
               | "caught up" to the other, unless you really believe the
               | verifier will reach a state where nobody's going find
               | verifier bugs --- at which point eBPF will be strictly
               | superior. Beyond that, it's a matter of taste. What seems
               | clearly to be true is that eBPF is wildly more popular.
        
               | cryptonector wrote:
               | It's really hard to bring a host to its knees using
               | DTrace, yet it's quite powerful for observability. In my
               | opinion it is better to start with that then add extra
               | power where it's needed.
        
               | tptacek wrote:
               | I understand the argument, but it's clear which one
               | succeeded in the market. Meanwhile: we take pretty good
               | advantage of the extra power eBPF gives us over what
               | DTrace would, so I'm happy to be on the golden path for
               | the platform here. Like I said, though: this is a matter
               | of taste.
        
               | umanwizard wrote:
               | eBPF isn't Turing complete because it has to be able to
               | prove that loops are bounded.
        
               | cryptonector wrote:
               | And I should say that DTrace probe actions _can
               | dereference pointers_ , but NULL dereferences do not
               | cause crashes, and rich type data is generally available.
        
               | bcantrill wrote:
               | Well, we didn't merely "try harder" -- we treated safety
               | as a constraint which informed every aspect of the
               | design. And yes, treating safety as a constraint rather
               | than merely an objective results in different
               | implementation decisions. From the article:
               | 
               |  _This working model significantly increases the attack
               | surface of the kernel, since it allows executing
               | arbitrary code at a high privilege level. Because of this
               | risk, programs have to be verified before they can be
               | loaded. This ensures that all eBPF security assumptions
               | are met. The verifier, which consists of complex code, is
               | responsible for this task._
               | 
               |  _Given how difficult the task of validating that a
               | program is safe to execute is, there have been many
               | vulnerabilities found within the eBPF verifier. When one
               | of these vulnerabilities is exploited, the result is
               | usually a local privilege escalation exploit (or
               | container escape in containerized environments). While
               | the verifier's code has been audited extensively, this
               | task also becomes harder as new features are added to
               | eBPF and the complexity of the verifier grows_
               | 
               | DTrace was developed over 20 years ago; there have not
               | been "many vulnerabilities" found in the verifier -- and
               | we have not grown the complexity of the verifier over
               | time. You can dismiss these as implementation details,
               | but these details reflect different views of the problem
               | and its contraints.
        
               | saagarjha wrote:
               | No, like, the bug that was demonstrated seems to be
               | fairly fundamental to running any sort of bytecode in the
               | kernel: they need to verify all branches, and this is
               | potentially slow, so they optimize it (which is where the
               | bug is). What are you doing differently? It seems to me
               | that you're either not going to optimize this or you are?
        
               | tptacek wrote:
               | The DTrace instruction set is more limited than that of
               | the eBPF VM; eBPF is essentially a fully functional ISA,
               | where DTrace was (if I'm remembering this right) designed
               | around the D script language. An eBPF program is often
               | just a clang C program, and you're trusting the kernel
               | verifier to reject it if it can't be proven safe.
               | Further: eBPF programs are JIT'd to actual machine code;
               | once you've loaded and verified an eBPF program, it has
               | conceptually all the same power as, say, shellcode you
               | managed to load into the kernel via an LPE.
               | 
               | That's not to say that security researchers couldn't find
               | DTrace vulnerabilities if they, for instance, built
               | DIF/DOF fuzzers of 2023 levels of sophistication for
               | them. I don't know that anyone's doing that, because
               | DTrace is more or less a dead letter.
        
               | solarengineer wrote:
               | For those who read this thread - DTrace is in use in
               | Solaris and in Illumos, and various of us who use Illumos
               | for our production use cases (like Oxide does) still very
               | much use DTrace.
               | 
               | I appreciate the rest of tptacek's comment which is
               | informative. I also acknowledge that there may not be
               | fuzzers written that have been disclosed.
        
               | tptacek wrote:
               | Oh, sorry, totally fair call-out. There's like a huge
               | implicit "on Linux" thing in my brain about all this
               | stuff.
               | 
               | I'd also be open to an argument that the code quality in
               | DTrace is higher! I spent a week trying to unwind the
               | verifier so I could port a facsimile of it to userland.
               | It is a lot. My point about fuzzers and stuff isn't that
               | I'm concerned DTrace is full of bugs; I'd be surprised if
               | it was. My thing is just that everything written in
               | memory unsafe kernel code falls against Google Project
               | Zero-grade vulnerability research, at some point.
               | 
               | That's true of the rest of the kernel, too! So from a
               | threat perspective, maybe it doesn't matter. I think my
               | bias here --- that's all it is --- is that neither of
               | these instrumentation schemes are things I'd want to
               | expose to a shared-kernel cotenant.
               | 
               | Thanks for helping me clarify this.
        
         | ssahoo wrote:
         | Wouldn't even the classic loadable kernel mode driver be a
         | better choice than a patch and eBpf? I know they are unsafe but
         | people who deal with it, know the power comes with
         | responsibility.
        
           | tptacek wrote:
           | No? SREs roll eBPF programs on the fly just in the process of
           | debugging problems; if you tried to do that with an LKM,
           | you'd almost certainly blow up your system. People who write
           | Linux kernel code routinely crash their systems in the
           | process of development.
        
       | techwiz137 wrote:
       | In my country we have a saying. "Porcupine in the pants". Sounds
       | like for all the good it can do, it isn't written safely and
       | carefully.
        
         | deskr wrote:
         | With experience you'll realise that despite things being done
         | safely and carefully, mistakes can and do pop up.
        
           | bugtodiffer wrote:
           | True. There are some nasty bugs in some very well written
           | code.
        
       | tptacek wrote:
       | A reminder that on the platforms eBPF is most commonly used,
       | verifier bugs don't matter much, because unprivileged code isn't
       | allowed to load eBPF programs to begin with. Bugs like this are
       | thus root -> ring0 vulnerabilities. That's not nothing, but for
       | serverside work it's usually worth the tradeoff, especially
       | because eBPF's track record for kernel LPEs is actually pretty
       | strong compared to the kernel as a whole.
       | 
       | In the setting eBPF is used today, most of the value of the
       | verifier is that it's hard to _accidentally_ crash your kernel
       | with a bad eBPF program. That is comically untrue about an
       | ordinary LKM.
        
         | chc4 wrote:
         | The PoC uses eBPF maps as their out-of-bounds pointer, but it
         | sounds like it would also be exploitable via non-extended BPF
         | programs loadable via seccomp since it's just improper scalar
         | value range tracking, which doesn't require any privileges on
         | most platforms.
         | 
         | And, of course, root -> ring0 is less of a problem with
         | unprivileged user namespaces where you can make yourself
         | "root", as we've seen in every eBPF bug PoC since distros
         | started turning that on (and have since turned it off again,
         | mostly)
        
           | tptacek wrote:
           | I just want to say that this is a hell of a nerd snipe.
        
             | chc4 wrote:
             | LMAO
             | 
             | Ok that's fair. check_seccomp_filter actually has a more
             | restrictive list than just "BPF with no backwards jumps",
             | and in particular doesn't allow BPF_IND in the BPF_LDX, so
             | you can't read out of bounds because you can't use a
             | dynamic displacement...but BPF_STX _is_ allowed, so you can
             | probably write out of bounds? BPF_W is the seccomp_data
             | address and the control flow diagram they show to compute
             | incorrect scalar ranges doesn 't require any backwards
             | jumps...
        
               | tptacek wrote:
               | I feel like I just played the Uno Reverse card on the
               | nerd snipe.
        
         | 10000truths wrote:
         | Verifier bugs matter because resolving them is a prerequisite
         | for secure unprivileged use of eBPF.
        
           | tptacek wrote:
           | Put it this way: verifier bugs matter, but people probably
           | don't do unscheduled fleetwide updates to fix them.
        
           | mort96 wrote:
           | Verifier bugs matter _for the kernel, which wants eBPF to be
           | secure even for unprivileged accounts_.
           | 
           | Verifier bugs don't matter _that much, for most Linux users,
           | right now, because unprivileged accounts can 't use eBPF._
        
         | dumpling777 wrote:
         | Let's not forget also that we can give CAP_BPF to containers.
         | With things like Cilium on the rise, the attack vector of
         | landing in container environment that has cap_bpf is more and
         | more realistic
        
           | tptacek wrote:
           | I don't believe shared-kernel container systems are real
           | security boundaries to begin with, so, to me, a container
           | running with CAP_BPF isn't much different than any other
           | program a machine owner might opt to run; the point is that
           | you trust the workload, and so the verifier is more of a
           | safety net than a vault door.
        
             | kortilla wrote:
             | That pessimistic view is not shared by everyone who is
             | working on namespaces, cgroups, etc so I think that's a
             | pretty unproductive comment in this context.
             | 
             | It reminds me of early days in hypervisors when someone
             | would get an exploit to break out of the isolation and
             | someone would dismiss it because "virtual machines aren't
             | real isolation anyway".
             | 
             | Look, I get it and I frankly agree with you in the current
             | state of the world, but this is the time to shut up and get
             | out of the way of people trying to make forward progress.
             | Breakouts of containers are a big deal for people pushing
             | the boundary there.
        
               | tptacek wrote:
               | I don't know who you're really talking to (it's not me),
               | but all I'm saying is that CAP_BPF doesn't bother me
               | much, because it's problematic only for a security
               | boundary that is already problematic with a much lower
               | degree of difficulty for attackers than the eBPF
               | verifier.
        
       | mrbluecoat wrote:
       | > "Uno no es ninguno" (One is none)
       | 
       | I believe that translates to "One is not none"
       | 
       | https://bughunters.google.com/blog/6303226026131456/a-deep-d...
        
         | DanielVZ wrote:
         | Thats the direct translation but for some reason in spanish our
         | double negations are usually just negations.
        
         | kmarc wrote:
         | It doesn't; It translates to "One is none" This is the infamous
         | double negation many foreign speakers (including me) struggles
         | with.
         | 
         | https://spanish.stackexchange.com/questions/26777/how-does-d...
        
         | samatman wrote:
         | Perhaps we should translate this as "one ain't nothin'".
        
       | TacticalCoder wrote:
       | > "Uno no es ninguno" (One is none)
       | 
       | Literally "One not is none", aka "One is _not_ none ".
        
         | jolmg wrote:
         | In Spanish, it's common for double negatives to not actually be
         | double negatives. For example, if you wanted to say "there's
         | nothing here", you'd say "no hay nada aqui", which word-for-
         | word means "there's not nothing here".
         | 
         | Checking out the Royal Spanish Academy, here's what they say
         | about it:
         | 
         | https://www.rae.es/espanol-al-dia/doble-negacion-no-vino-nad...
         | 
         | > The so-called "double negation" is due to the obligatory
         | negative agreement that must be established in Spanish, and
         | other Romance languages, in certain circumstances (see New
         | Grammar, SS 48.3d), which results in the joint presence in the
         | statement of the adverb _no_ and other elements that also have
         | a negative meaning.
         | 
         | > The concurrence of these two "negations" does not annul the
         | negative meaning of the statement.
        
           | stirfish wrote:
           | I like to think of it as additive negatives, as opposed to
           | multiplicative negatives.
        
           | cassepipe wrote:
           | It's true but I don't think this would apply for such a
           | simple statement as in this case else how would you say "One
           | is _not_ none " in spanish ?
        
             | dgb23 wrote:
             | My guess is you wouldn't use negation.
        
             | mejutoco wrote:
             | Uno no es ninguno or uno no es cero or uno es diferente de
             | cero all communicate this correctly IMO.
        
               | cassepipe wrote:
               | But "Uno no es ninguno" is the original phrase that's
               | given for "One is none"
        
           | b0afc375b5 wrote:
           | I guess this is similar to english: "I ain't no snitch",
           | which is a double negative but is equivalent to its single
           | negative counterpart.
        
           | mejutoco wrote:
           | Same in French: "Je ne sais pas" means I do not know, not I
           | do not not know (aka I know).
           | 
           | In any case, the meaning of the sentence above: "uno no es
           | ninguno" in Spanish is clearly one is not zero, or one is not
           | none, or one is different than none.
           | 
           | "Uno no es nada" could be "one is nothing", and "one is not
           | nothing". It all depends on the frame of reference (in this
           | case English), but for this sentence, the "one is not none"
           | is correct IMO. I would never even do a second pass on that
           | sentence, as a native Spanish speaker (appeal to authority, I
           | know)
        
       ___________________________________________________________________
       (page generated 2024-08-09 23:02 UTC)