[HN Gopher] Io_uring and seccomp (2022)
___________________________________________________________________
Io_uring and seccomp (2022)
Author : pncnmnp
Score : 74 points
Date : 2024-10-09 14:42 UTC (3 days ago)
(HTM) web link (blog.0x74696d.com)
(TXT) w3m dump (blog.0x74696d.com)
| leni536 wrote:
| > But if you've got a separation of duties where a sysadmin sets
| up seccomp filtering generically across applications
|
| Is this even possible, regardless of io_uring?
| amarshall wrote:
| Well the article brings up containers as an example. If the
| sysadmin controls "your" parent or root process (e.g. the login
| shell), they can just perform seccomp filtering there and it
| applies to everything within it (like any other sandbox).
| 0x74696d wrote:
| (author here) I'm one of the maintainers of HashiCorp's
| Nomad, so that example was likely inspired by the separation
| of duties that's part of our security model. In that
| environment, there's a subset of task (ex. container)
| configuration that's controlled by the cluster admin and a
| subset that's controlled by the job author deploying onto the
| cluster.
| klooney wrote:
| Yes- systemd will let you do that, as well
| docker/containerd/podman.
| deathanatos wrote:
| This seems like an instance of an anti-pattern I've seen, which
| is inflating "permission" and "API call" to the same thing.
|
| IIRC, AWS does this, where permission is by API call. As an
| example, you can have permission to call ssm:GetParameter _n_
| times, but if you try to combine those _n_ API calls into a batch
| with GetParameters, that 's a different IAM perm, _even though
| exactly the same thing is occurring._
| thayne wrote:
| I find that so frustrating. Another example is uploading an
| image to ECR (elastic container registry). You need like four
| different permissions to do it, which I think correspond to
| individual http requests, but it is usually just a single
| docker/podman/skopeo command, and I can't think of a situation
| where you would want to grant permission to initiate an upload
| but not complete it.
|
| Multipart uploads in s3 have a similar problem.
| cpuguy83 wrote:
| Both Docker and containerd have started to block io_uring in the
| default profile for about a year now due to too many security
| issues with it.
| hinkley wrote:
| Has anyone speculated yet about how much slower a secure
| io_uring has to be? Is it still a net win once you lock it down
| fully?
| JackSlateur wrote:
| As far as I know, io_uring is quite secure: a user cannot
| perform a syscall through it unless it has the privileges
| required to perform this syscall directly
|
| I would gladly get more details about the exact purpose of
| seccomp in a container environment. Reading a bit of
| internet, I find that docker "uses seccomp to block mount(2),
| which could be used to escape the container", which makes no
| sense to me because mount(2) requires CAP_SYS_ADMIN
| cpuguy83 wrote:
| io_uring cve's: https://cve.mitre.org/cgi-
| bin/cvekey.cgi?keyword=io_uring
|
| seccomp is used for defense in depth. If someone managed to
| escalate privileges through some means the seccomp policy
| will still prevent them from doing nasty things or
| escalating further.
| poincaredisk wrote:
| That's not contradictory. Capabilities in docker are also
| limited, but both are used as a part of defense in depth.
| cpuguy83 wrote:
| That would be impossible to know. The main thing with
| io_uring is it makes it so you don't need to context switch
| (ie make system calls) to perform a number of operations.
| bri3d wrote:
| And Google, in ChromeOS, Android, and purportedly, Google
| production servers, for around a year and a half, as well. For
| this reason it's also disabled in several of the kernelCTF
| configurations and in the ones where it remains (GKE), it only
| pays out at half-rate in bug bounty.
| eqvinox wrote:
| Using seccomp with a default-open filter is a terrible idea to
| begin with; it wasn't really designed for any of this. Seccomp in
| its most basic form didn't even have a filter list, it just
| allowed read() and write(). (And close() or something, don't
| quote me on the details, the point is it was a fixed list.)
| You're supposed to use it with a default-closed filter and fully
| enumerate what you need. (Yes, that's hard in a lot of cases, but
| still.)
|
| There have been other cases where syscalls got cloned, mostly to
| add new parameters, but either way seccomp with an "open" filter
| can only ever be defense-in-depth, not a critical line in itself.
|
| (Don't misunderstand, defense-in-depth is good, and keep using
| seccomp for it. But an open seccomp filter MUST be considered
| bypassable.)
| poincaredisk wrote:
| >it just allowed read() and write(). A fun consequence of this
| is that even though there was a function to check if seccomp is
| enabled or not, it could only ever do one of two things: return
| "not enabled" or crash the process.
|
| I agree with everything you wrote. I'll add that having a
| whitelist is not easy too, I've witnessed many situations where
| seccomp sandbox broke because glibc/python interpreter started
| using a different syscall (for example openat with AT_FDCWD
| instead of open)
| eqvinox wrote:
| > I've witnessed many situations where seccomp sandbox broke
| because glibc/python interpreter started using a different
| syscall (for example openat with AT_FDCWD instead of open)
|
| ACK, that's what I meant with "hard in a lot of cases"... to
| be honest I think this is a failure of the ecosystem at-
| large. It's a bit of a half feature without some kind of
| higher-level userspace mechanism to collect who needs what,
| especially when a bunch libraries are involved. It's
| admittedly a very hard problem, e.g. just because something
| is linking libcurl as a 2nd or 3rd level dependency doesn't
| mean you intend your process to ever make network
| connections... I don't think it's unsolveable though.
| FridgeSeal wrote:
| Surely this is a seccomp shortcoming, or kernel auth shortcoming,
| rather than an io_uring problem?
|
| That is, seccomp is (apparently? I've never used it myself)
| capable of intercepting direct calls. Obviously, that design
| isn't going to be able to handle "indirect" calls in its default
| implementation.
|
| Either seccomp needs a way to act on the buffer or intercept
| io_uring calls, or there's a need for a new auth mechanism that's
| capable of handling io_uring style API's.
|
| Torpedoing the whole api (a la gcp) feels like throwing the baby
| out with the bath water.
| tptacek wrote:
| That framing doesn't make sense. System calls and their
| arguments are an obvious security boundary and have been a
| sandboxing component for decades. io_uring blows that boundary
| apart. The "problem" is io_uring, not seccomp.
|
| If you want to make a case for io_uring being benign for
| security, the right argument is probably against all unmediated
| shared-kernel multitenancy (ie: multitenancy either through
| virtualization, or WASM/V8-type language runtimes, and nothing
| else). It doesn't make sense to say system call filters are
| flawed because someone came up with an omni-syscall that breaks
| those filters.
| asveikau wrote:
| The syscall implementations themselves do checks and return
| EPERM/EACCES when appropriate. The mechanism for doing the
| syscall can change. I mean, in the 90s it happened via int
| 0x80, then we got sysenter, then the vdso. io_uring just
| moved part of it to user mode.
|
| It seems like a totally reasonable design to me to "just" put
| the right hooks into the filter mechanism and make it get
| called the same way regardless of the syscall mechanism.
| thayne wrote:
| The obvious solution is to block operations over io_uring if
| the equivalent syscall would have been blocked by seccomp.
| But I'm not sure if there is some reason that wouldn't work.
|
| Another possibility would be to allow setting restrictions on
| all io_uring operations for the current and all child
| processes, although that would be less convenient than using
| the existing seccomp system.
| tptacek wrote:
| I assume it's not so much that it can't be done, just that
| it hasn't been done yet.
| 0x74696d wrote:
| Author here! The motivating example of this post is frankly
| pretty lousy in retrospect (and was even so soon after writing,
| given the friendly reminder from Giovanni Campagna that `socket`
| wasn't one of the io_uring opcodes). At best this is an
| interesting limitation of seccomp. Maybe relevant if you were
| using gVisor?
| theamk wrote:
| I was thinking about how one would change io_uring design to be
| compatible with seccomp, and came up with a very simple one:
|
| A new io_uring fd comes with all operations disabled by default.
| User has to call "io_uring_register(fd, ENABLE_OP, op)" before
| operation is used for the first time. Then seccomp filter can
| easily filter enable_op calls to prohibit certain operations.
|
| It could even be added now in backward-compatible way - add a new
| feature to io_uring_setup that enables it. Then one could set
| seccomp filter to only accept setup requests with this feature
| set, and deny all others. Together, this should allow cooperating
| programs to pass seccomp filter, while programs that won't
| register ops could not use seccomp at all.
| eqvinox wrote:
| I agree and think your approach would work, but I need to point
| out that seccomp BPF filters can also match on syscall
| _arguments_. For example, you can allow fcntl(F_DUPFD, ...) but
| deny fcntl(F_SETLEASE, ...). For some syscalls (fcntl, ioctl,
| setsockopt, ...), this is rather important.
___________________________________________________________________
(page generated 2024-10-12 23:02 UTC)