[HN Gopher] Io_uring and seccomp (2022)
       ___________________________________________________________________
        
       Io_uring and seccomp (2022)
        
       Author : pncnmnp
       Score  : 74 points
       Date   : 2024-10-09 14:42 UTC (3 days ago)
        
 (HTM) web link (blog.0x74696d.com)
 (TXT) w3m dump (blog.0x74696d.com)
        
       | leni536 wrote:
       | > But if you've got a separation of duties where a sysadmin sets
       | up seccomp filtering generically across applications
       | 
       | Is this even possible, regardless of io_uring?
        
         | amarshall wrote:
         | Well the article brings up containers as an example. If the
         | sysadmin controls "your" parent or root process (e.g. the login
         | shell), they can just perform seccomp filtering there and it
         | applies to everything within it (like any other sandbox).
        
           | 0x74696d wrote:
           | (author here) I'm one of the maintainers of HashiCorp's
           | Nomad, so that example was likely inspired by the separation
           | of duties that's part of our security model. In that
           | environment, there's a subset of task (ex. container)
           | configuration that's controlled by the cluster admin and a
           | subset that's controlled by the job author deploying onto the
           | cluster.
        
         | klooney wrote:
         | Yes- systemd will let you do that, as well
         | docker/containerd/podman.
        
       | deathanatos wrote:
       | This seems like an instance of an anti-pattern I've seen, which
       | is inflating "permission" and "API call" to the same thing.
       | 
       | IIRC, AWS does this, where permission is by API call. As an
       | example, you can have permission to call ssm:GetParameter _n_
       | times, but if you try to combine those _n_ API calls into a batch
       | with GetParameters, that 's a different IAM perm, _even though
       | exactly the same thing is occurring._
        
         | thayne wrote:
         | I find that so frustrating. Another example is uploading an
         | image to ECR (elastic container registry). You need like four
         | different permissions to do it, which I think correspond to
         | individual http requests, but it is usually just a single
         | docker/podman/skopeo command, and I can't think of a situation
         | where you would want to grant permission to initiate an upload
         | but not complete it.
         | 
         | Multipart uploads in s3 have a similar problem.
        
       | cpuguy83 wrote:
       | Both Docker and containerd have started to block io_uring in the
       | default profile for about a year now due to too many security
       | issues with it.
        
         | hinkley wrote:
         | Has anyone speculated yet about how much slower a secure
         | io_uring has to be? Is it still a net win once you lock it down
         | fully?
        
           | JackSlateur wrote:
           | As far as I know, io_uring is quite secure: a user cannot
           | perform a syscall through it unless it has the privileges
           | required to perform this syscall directly
           | 
           | I would gladly get more details about the exact purpose of
           | seccomp in a container environment. Reading a bit of
           | internet, I find that docker "uses seccomp to block mount(2),
           | which could be used to escape the container", which makes no
           | sense to me because mount(2) requires CAP_SYS_ADMIN
        
             | cpuguy83 wrote:
             | io_uring cve's: https://cve.mitre.org/cgi-
             | bin/cvekey.cgi?keyword=io_uring
             | 
             | seccomp is used for defense in depth. If someone managed to
             | escalate privileges through some means the seccomp policy
             | will still prevent them from doing nasty things or
             | escalating further.
        
             | poincaredisk wrote:
             | That's not contradictory. Capabilities in docker are also
             | limited, but both are used as a part of defense in depth.
        
           | cpuguy83 wrote:
           | That would be impossible to know. The main thing with
           | io_uring is it makes it so you don't need to context switch
           | (ie make system calls) to perform a number of operations.
        
         | bri3d wrote:
         | And Google, in ChromeOS, Android, and purportedly, Google
         | production servers, for around a year and a half, as well. For
         | this reason it's also disabled in several of the kernelCTF
         | configurations and in the ones where it remains (GKE), it only
         | pays out at half-rate in bug bounty.
        
       | eqvinox wrote:
       | Using seccomp with a default-open filter is a terrible idea to
       | begin with; it wasn't really designed for any of this. Seccomp in
       | its most basic form didn't even have a filter list, it just
       | allowed read() and write(). (And close() or something, don't
       | quote me on the details, the point is it was a fixed list.)
       | You're supposed to use it with a default-closed filter and fully
       | enumerate what you need. (Yes, that's hard in a lot of cases, but
       | still.)
       | 
       | There have been other cases where syscalls got cloned, mostly to
       | add new parameters, but either way seccomp with an "open" filter
       | can only ever be defense-in-depth, not a critical line in itself.
       | 
       | (Don't misunderstand, defense-in-depth is good, and keep using
       | seccomp for it. But an open seccomp filter MUST be considered
       | bypassable.)
        
         | poincaredisk wrote:
         | >it just allowed read() and write(). A fun consequence of this
         | is that even though there was a function to check if seccomp is
         | enabled or not, it could only ever do one of two things: return
         | "not enabled" or crash the process.
         | 
         | I agree with everything you wrote. I'll add that having a
         | whitelist is not easy too, I've witnessed many situations where
         | seccomp sandbox broke because glibc/python interpreter started
         | using a different syscall (for example openat with AT_FDCWD
         | instead of open)
        
           | eqvinox wrote:
           | > I've witnessed many situations where seccomp sandbox broke
           | because glibc/python interpreter started using a different
           | syscall (for example openat with AT_FDCWD instead of open)
           | 
           | ACK, that's what I meant with "hard in a lot of cases"... to
           | be honest I think this is a failure of the ecosystem at-
           | large. It's a bit of a half feature without some kind of
           | higher-level userspace mechanism to collect who needs what,
           | especially when a bunch libraries are involved. It's
           | admittedly a very hard problem, e.g. just because something
           | is linking libcurl as a 2nd or 3rd level dependency doesn't
           | mean you intend your process to ever make network
           | connections... I don't think it's unsolveable though.
        
       | FridgeSeal wrote:
       | Surely this is a seccomp shortcoming, or kernel auth shortcoming,
       | rather than an io_uring problem?
       | 
       | That is, seccomp is (apparently? I've never used it myself)
       | capable of intercepting direct calls. Obviously, that design
       | isn't going to be able to handle "indirect" calls in its default
       | implementation.
       | 
       | Either seccomp needs a way to act on the buffer or intercept
       | io_uring calls, or there's a need for a new auth mechanism that's
       | capable of handling io_uring style API's.
       | 
       | Torpedoing the whole api (a la gcp) feels like throwing the baby
       | out with the bath water.
        
         | tptacek wrote:
         | That framing doesn't make sense. System calls and their
         | arguments are an obvious security boundary and have been a
         | sandboxing component for decades. io_uring blows that boundary
         | apart. The "problem" is io_uring, not seccomp.
         | 
         | If you want to make a case for io_uring being benign for
         | security, the right argument is probably against all unmediated
         | shared-kernel multitenancy (ie: multitenancy either through
         | virtualization, or WASM/V8-type language runtimes, and nothing
         | else). It doesn't make sense to say system call filters are
         | flawed because someone came up with an omni-syscall that breaks
         | those filters.
        
           | asveikau wrote:
           | The syscall implementations themselves do checks and return
           | EPERM/EACCES when appropriate. The mechanism for doing the
           | syscall can change. I mean, in the 90s it happened via int
           | 0x80, then we got sysenter, then the vdso. io_uring just
           | moved part of it to user mode.
           | 
           | It seems like a totally reasonable design to me to "just" put
           | the right hooks into the filter mechanism and make it get
           | called the same way regardless of the syscall mechanism.
        
           | thayne wrote:
           | The obvious solution is to block operations over io_uring if
           | the equivalent syscall would have been blocked by seccomp.
           | But I'm not sure if there is some reason that wouldn't work.
           | 
           | Another possibility would be to allow setting restrictions on
           | all io_uring operations for the current and all child
           | processes, although that would be less convenient than using
           | the existing seccomp system.
        
             | tptacek wrote:
             | I assume it's not so much that it can't be done, just that
             | it hasn't been done yet.
        
       | 0x74696d wrote:
       | Author here! The motivating example of this post is frankly
       | pretty lousy in retrospect (and was even so soon after writing,
       | given the friendly reminder from Giovanni Campagna that `socket`
       | wasn't one of the io_uring opcodes). At best this is an
       | interesting limitation of seccomp. Maybe relevant if you were
       | using gVisor?
        
       | theamk wrote:
       | I was thinking about how one would change io_uring design to be
       | compatible with seccomp, and came up with a very simple one:
       | 
       | A new io_uring fd comes with all operations disabled by default.
       | User has to call "io_uring_register(fd, ENABLE_OP, op)" before
       | operation is used for the first time. Then seccomp filter can
       | easily filter enable_op calls to prohibit certain operations.
       | 
       | It could even be added now in backward-compatible way - add a new
       | feature to io_uring_setup that enables it. Then one could set
       | seccomp filter to only accept setup requests with this feature
       | set, and deny all others. Together, this should allow cooperating
       | programs to pass seccomp filter, while programs that won't
       | register ops could not use seccomp at all.
        
         | eqvinox wrote:
         | I agree and think your approach would work, but I need to point
         | out that seccomp BPF filters can also match on syscall
         | _arguments_. For example, you can allow fcntl(F_DUPFD, ...) but
         | deny fcntl(F_SETLEASE, ...). For some syscalls (fcntl, ioctl,
         | setsockopt, ...), this is rather important.
        
       ___________________________________________________________________
       (page generated 2024-10-12 23:02 UTC)