[HN Gopher] Linux kernel heap buffer overflow in fs_context.c si...
___________________________________________________________________
Linux kernel heap buffer overflow in fs_context.c since version 5.1
Author : todsacerdoti
Score : 180 points
Date : 2022-01-20 15:35 UTC (7 hours ago)
(HTM) web link (seclists.org)
(TXT) w3m dump (seclists.org)
| carlhjerpe wrote:
| Somehow I first thought I was affected personally, but I'm on
| 5.15. Version numbers crossing 10 messes my head up more often
| than I'd be willing to admit in person.
| Eduard wrote:
| So you are affected...?
| carlhjerpe wrote:
| I need to stop posting HN on the subway, yes I'm affected.
| This might be a bad time to be on NixOS unstable. But it
| seems hydra has been green since 3 hours so I should get a
| patch soon.
| stormbrew wrote:
| > An unprivileged user can use unshare(CLONE_NEWNS|CLONE_NEWUSER)
| to enter a namespace with the CAP_SYS_ADMIN permission, and then
| proceed with exploitation to root the system.
|
| I'm confused by this, don't you need CAP_SYS_ADMIN to
| unshare(CLONE_NEWNS) to begin with?
|
| From unshare(2):
|
| > CLONE_NEWNS
|
| > This flag has the same effect as the clone(2) CLONE_NEWNS flag.
| Unshare the mount namespace, so that the calling process has a
| private copy of its namespace which is not shared with any other
| process. Specifying this flag automatically implies CLONE_FS as
| well. Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability.
| For further information, see mount_namespaces(7).
|
| Edit: Oh does this work specifically _because_ you 're also
| unsharing into a new user namespace where you have that
| capability? This is kind of wild tbh.
| alerighi wrote:
| Some distributions disable user namespaces by default because
| they are considered a dangerous feature. And it probably is, in
| the end.
| staticassertion wrote:
| Yeah, a lot of the Linux kernel code was reachable by root, and
| for a long time the attitude of a lot of kernel maintainers was
| that privesc from root didn't matter much.
|
| But now any code can be root in its own namespace... so all of
| this code that's far less scrutinized is now reachable.
| boring_twenties wrote:
| Exactly the main argument that was being made against
| enabling this by default all those years ago.
| jakeinspace wrote:
| Sorry to be somewhat off-topic, but I have a Linux kernel bug
| question. I found a very small kernel bug (no obvious security
| implication, only affecting 32-bit builds) at work a few weeks
| back while working on a custom kernel patch. I sent an email to
| the maintainers for that kernel subsystem, but didn't hear back.
| I'm not quite sure if I should keep pestering them until I get a
| response, or if I should be doing something else to get it
| addressed. Any suggestions from someone with experience?
| onphonenow wrote:
| If you are filing a bug you will be ignored (or could be
| ignored) forever.
|
| If you send in a patch - you are MUCH MUCH more likely to get a
| response. As it should be.
|
| What is obvious to you as a patch may not be obvious to others,
| so if you can write and test your patch that would go a long
| way towards getting things to move forward. Bug reports are
| noise to many maintainers (they know there are lots, their
| focus is on code that fixes bugs).
| tych0 wrote:
| Standard advice is to wait two full weeks, then bump your
| thread (or rebase the patch and send a v2 if there's new
| conflicts with the maintainer's tree).
| charcircuit wrote:
| It sounded like he just filed a bug.
| jakeinspace wrote:
| It's a 1-line patch, I didn't send it in my initial email but
| maybe that was a faux pas.
| tych0 wrote:
| I'd say sending the patch with git send-email --in-reply-
| to=<the header of your last email> is good. A patch is much
| easier to apply than write :)
| jakeinspace wrote:
| Thanks! Would you recommend I send to the maintainer(s),
| and Cc the mailing list?
| tych0 wrote:
| Yeah, I'd just use whatever the output of
| ./scripts/get_maintainer.pl says. It will also suggest
| any recent committers to that area of the code, which
| I've found useful in the past. Usually I put the
| maintainers as To:, and everyone else as Cc:.
| sweettea wrote:
| There are both maintainers and lists listed in MAINTAINERS (L:
| entries) -- did you Cc the mailing list? It might be good to
| bump the mailing list email if it's been several weeks, asking
| if there's more information you could provide.
| jakeinspace wrote:
| This is for timekeeping, which doesn't look to have its own
| mailing list (just points me to the vger linux-kernel mail
| list). I didn't Cc that mailing list, although I could.
| bonzini wrote:
| What subsystem is it?
| jakeinspace wrote:
| Timekeeping
| kidd0 wrote:
| Does it effect AWS ec2?
| [deleted]
| Diggsey wrote:
| I assume that `size + len + 2` can't _over_ flow :)
| mjw1007 wrote:
| The Debian 11 release notes say:
|
| << From Linux 5.10, all users are allowed to create user
| namespaces by default. This will allow programs such as web
| browsers and container managers to create more restricted
| sandboxes for untrusted or less-trusted code, without the need to
| run as root or to use a setuid-root helper.
|
| The previous Debian default was to restrict this feature to
| processes running as root, because it exposed more security
| issues in the kernel. However, as the implementation of this
| feature has matured, we are now confident that the risk of
| enabling it is outweighed by the security benefits it provides.
|
| If you prefer to keep this feature restricted, set the sysctl:
| user.max_user_namespaces = 0
|
| Note that various desktop and container features will not work
| with this restriction in place, including web browsers,
| WebKitGTK, Flatpak and GNOME thumbnailing. >>
|
| Does anyone know a reason to keep this feature enabled on a
| server, other than Docker's rootless mode?
| prpl wrote:
| If you have a multi tenant server and don't want to provide
| root access to users but want them to be able to run
| containers, otherwise it's probably not necessary
| dathinab wrote:
| some programs use it to sandbox themself without needing root.
| Through currently I can only think about desktop apps which do
| so.
| faisal_ksa wrote:
| I wander if rust (or any other memory safe system language in the
| future) could have avoided this exploit. If not, what could we do
| to avoid such exploits?
| gpm wrote:
| One method of forbidding the entire category of bugs is "bounds
| checks on integer arithmetic". Rust implements this in debug
| mode, but not by default in release mode, because it comes at a
| performance cost. To make this sort of solution ubiquitous you
| really want better hardware support to make bounds checking
| cheap.
|
| Realistically I think it is unlikely you would have written the
| same exploit in rust even with integer overflow wrapping by
| default, because in idiomatic rust you end up using types with
| lengths attached to them, and memcpy methods that check that
| you didn't fuck up the lengths before copying. You absolutely
| could end up writing it in rust though (using unsafe code, but
| at some level unsafe code is inevitable for this sort of work),
| and you could if you really wanted to implement a similar set
| of safer buffer types in C that would provide a similar degree
| of prevention (though it would be more cumbersome to use than
| in rust).
| menaerus wrote:
| This got nothing to do with the memory but to the fact how CPU
| works with the integers. This means that (low-level)
| programming language fundamentally cannot solve this problem
| but only alleviate it either by:
|
| 1. Changing the semantics of integer arithmetic (e.g. saturate
| on overflow)
|
| 2. Keeping the semantics but babysit the computation during
| runtime so that the overflow/underflow can never happen
| (expensive)
| duped wrote:
| Modern CPUs will alert you to overflow and under flow. Rust
| actually panics on overflow or under flow conditions in debug
| builds by default.
|
| It is not expensive to check for under flow at runtime in
| security critical code, and is actually mandatory for cases
| like this as it is UB in C.
| menaerus wrote:
| Sorry, but you're wrong in both of your claims.
|
| First, unsigned integer underflow and overflow is _not_ UB.
| It is very well defined operation (wrap-around arithmetic)
| and the bug in question is not the result of undefined
| behavior and rust or whatever other bs I keep hearing
| around would have not solved it. It's the fundamental
| artifact of how CPUs work.
|
| Secondly, CPUs have been "alerting" through their carry and
| overflow bits in registers since forever so this isn't some
| exclusive feature that only rust compiler writers were
| smart enough to take advantage of. The same code can be and
| is written where it matters in C and C++ code too.
|
| It's not only the question if such extra checks are
| expensive (which they are given that integer arithmetic is
| such a fundamental operation and your favorite language
| disables it in release builds for the sakes of, I guess,
| nothing?) but it is also a question of all known
| _semantics_ of unsigned integer arithmetic. That's simply
| the way they work and I see no near future where the CPU
| hardware engineers would change that (they will not).
| im3w1l wrote:
| You could imagine a version of the arithmetic
| instructions that traps on overflow. Or maybe a prefix
| for the normal instruction. Then it can be basically free
| in the happy path.
| mustache_kimono wrote:
| I'm not an expert, but I will say it may be easier to avoid an
| over/underflow with: https://doc.rust-
| lang.org/std/primitive.u32.html#method.satu...
|
| And to check if one has occurred with: https://doc.rust-
| lang.org/std/primitive.u32.html#method.chec...
| snvzz wrote:
| There's likely many more of these.
|
| As a reminder, Linux has millions of lines of code, and all of
| them run with supervisor privileges.
|
| This is not a good architecture. Generally, you'd try to minimize
| the attack surface.
|
| Multiserver, microkernel systems based on capabilities is where
| it's at.
|
| seL4 is the better microkernel to build such a system on.
| athrowaway3z wrote:
| There exist people who own their own hardware and are not
| providing an API to run arbitrary code.
|
| If they get together and build an OS they are generally more
| interested in throughput than security models.
|
| Both have pro's and con's but the fact is: one is more popular
| with the "just get something working" crowd for better and
| worse.
| [deleted]
| a-dub wrote:
| i predict that tanenbaum will ultimately win the famous
| monolithic vs. microkernel design debate.
|
| monolithic kernels are good for building features quickly and
| runtime performance, but the security design reminds me of 90s
| era computer security approaches, where firewalls were supposed
| to stop all threats and behind them security was lax on
| internal networks. microkernels are much more similar to
| today's more effective defense in depth approaches.
|
| what does it matter if your kernel is fast and featureful if
| you cannot trust it?
| snvzz wrote:
| >what does it matter if your kernel is fast and featureful if
| you cannot trust it?
|
| And if the kernel is Linux, I'm not so sure about fast.
|
| Relative to Linux, seL4 has:
|
| - Order of magnitude faster context switch.
|
| - Order of magnitude lower scheduler latency.
|
| - Order of magnitude faster Inter-Process Communication.
| VWWHFSfQ wrote:
| It's my understanding that the microkernel architecture is
| slow. Nearly unusably slow. And that's why nobody uses it. Am I
| off-base? I'm interested!
| dijit wrote:
| Yes, it's going to be slower
|
| All of those security checks between components and memory
| passing will cause it to be slower.
|
| But that doesn't mean it's a worthy trade off.
|
| People write software in python despite it being slower than
| C++.
| hutrdvnj wrote:
| Except that anything that is actually performance critical
| is written as a C extension python module.
|
| I think that low level filesystem operations are very
| performance critical.
| jeffbee wrote:
| I doubt that mounting a filesystem is performance-
| critical. You could afford to fork an unprivileged user-
| space process written in Perl to parse these mount
| options and that would be fast enough for everyone.
| YarickR2 wrote:
| Tell that to dockerd mounting images layer by layer ,
| with k8s doing all kinds of emptyDir/PVC mounts on top.
| Pod start up speeds are abysmal now, they would be
| positively glacial with userspafe permissions validations
| jeffbee wrote:
| That's exactly my point. People who are doing lots of
| mounts are already demonstrably not performance-sensitive
| to a difference of a few milliseconds. They already
| waited 20 minutes for the stupid cluster autoscaler to
| provision a machine for them. They DGAF.
| snvzz wrote:
| There's more myth than truth[0]. In the early days, they were
| slow. Mach, used in OSX, is a representative of those early
| days.
|
| Liedtke's L4 proved that a performant microkernel is
| possible.
|
| Later, SMP changed the scenario considerably, as all of a
| sudden the microkernel multiserver fits SMP like a globe,
| while monolithic kernels need the complexity of locks to
| handle it.
|
| [0] https://news.ycombinator.com/item?id=10824382
| jeffbee wrote:
| As more and more things move out of the kernel, the perceived
| performance problems of microkernels look less important. If
| you are doing your network protocols in user space, and your
| thread scheduling is in user space, and you're not using a
| traditional filesystem much, then suddenly nobody cares how
| fast the kernel is.
| athrowaway3z wrote:
| I'm really not understanding what you're saying here.
|
| What the hell is a microkernel here? What kind of security
| are you talking about?
|
| User space filesystem and network implementations still
| need access to the hardware. Multiplexing that access is a
| kernels job. The more you want to separate and hide that
| between clients the higher the cost.
|
| As far as i understand your argument you are saying "If an
| application has a dedicated hard drive there will be little
| overhead"
| [deleted]
| jeffbee wrote:
| People think a monolithic kernel can be faster because of
| high-level abstractions that make many calls within the
| kernel, for example if I write to a TCP socket and
| everything else is handled for me, in the kernel, by
| function calls only. People believe this is faster than
| having an isolated network stack that has to pass
| messages to an isolated network driver and all that.
|
| But increasingly people realize that the performance of
| writing to a TCP socket in a unikernel also pretty much
| sucks and you get a much better result as you move more
| and more of it into user space. For example you decide,
| correctly, that TCP is obsolete and you switch to QUIC.
| Now the existence of the Linux TCP stack is of no value
| to you. You furthermore discover that Linux firewall,
| traffic control, and routing also kinda sucks, so you
| start using raw networking. Then you discover that trying
| to get your frames processed on the right core at the
| right moment isn't great in Linux, so you just take over
| the whole net device with DPDK.
|
| Now, _nothing_ in the whole Linux network stack is of any
| use to you at all.
|
| The same thing can happen with storage. Maybe you started
| with files on XFS but then eventually you were using
| disaggregated storage where a service takes over the
| whole device with SPDK, and all the storage users are
| talking to the service instead of the kernel.
| stormbrew wrote:
| Yeah this, really. For the most part even monolithic
| kernels have kind of reversed trend in the last decade or
| so and there's a lot of push to move critical code out of
| the kernel. A lot of new kernel apis are built to avoid
| context switching, and part of that usually involves moving
| to a more asynchronous kind of communication between
| process and kernel.
|
| Often these APIs even look an awful lot like late
| generation microkernel shared memory buffer protocols. DRI
| and uring in linux for example.
|
| A lot of the "microkernels are inherently slow" meme is
| built on how earlier port-based message passing kernels
| like mach worked.
| nwmcsween wrote:
| This depends on the definition of the word microkernel, in
| the classical definition yes it will be much slower due to
| IPC, for something like an exokernel though it will be much
| faster than even a monolithic kernel.
| whimsicalism wrote:
| The compiler can't warn about something like this? I guess
| unsigned integer underflow can be the intended behavior often.
| mustache_kimono wrote:
| It could, for other reasons, never underflow. C expects that
| you know what you're doing, and C expected you did a bounds
| check. But I agree. Cases like this should have a lint warn on
| them, saying -- "Wake up programmer!"
| menaerus wrote:
| Fundamentally this problem cannot be solved at the compile-
| time level because, well, code is dealing with the values
| which are only known during code execution runtime. So I
| don't think compiler can do much here other than providing
| you with a hint that you may rewrite your expression but only
| to reduce the risk of a potential error, e.g. `if (len + 2 +
| size > PAGE_SIZE)` still still remains feasible to unsigned
| integer overflow and to handle the problem fully one must
| either:
|
| 1. Write a lot of convoluted if..else logic such as https://w
| iki.sei.cmu.edu/confluence/display/c/INT30-C.+Ensur... and ht
| tps://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensur...
|
| 2. Or use compiler built-in intrinsics, e.g.
| https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-
| Builtins...
|
| But almost nobody does that ... except probably where it
| really matters (not the OS kernel).
| whimsicalism wrote:
| Not solved at compile-time, but warned at compile time?
| touisteur wrote:
| Could be solved at compile time with proof of absence of
| runtime errors, which somehow forces you to handle all
| cases for any input.
| menaerus wrote:
| Warned about what exactly? Literally any operation on two
| unsigned integers can either underflow or overflow and
| any of those would still be correct and expected
| behavior.
| mustache_kimono wrote:
| What you say makes sense. I was obviously wishcasting. ;)
| dahfizz wrote:
| It's not obvious what the warning would be, unless you want a
| warning attached to every single arithmetic operation? The
| compiler can't know what `size` will be in this case.
| whimsicalism wrote:
| Comparisons in an if statement involving subtracting two
| unsigned variables from each other?
| pxeger1 wrote:
| This is CVE-2022-0185 if you need to know it
| caaqil wrote:
| See it in action:
| https://twitter.com/ryaagard/status/1483592308352294917
| encryptluks2 wrote:
| Arch shows some warnings about unprivileged user namespaces but
| it is enabled by default I believe which allows for rootless
| podman/docker. I didn't realize we'd actually see an exploit so
| soon
| tremon wrote:
| What's the origin of the legacy_parse_param size parameter (from
| struct fs_context->fs_private->data_size)? Is this a mount
| option, a format-time fs configuration option, or does it require
| writing a specially-crafted inode to disk? The exploit says the
| user needs CAP_SYS_ADMIN, so I'm guessing it's the first one?
| shakna wrote:
| From the commit [0] that added it:
|
| > Legacy filesystems are supported by the provision of a set of
| legacy fs_context operations that build up a list of mount
| options and then invoke fs_type->mount() from within the
| fs_context ->get_tree() operation. This allows all filesystems
| to be accessed using fs_context.
|
| And then the description of the function itself:
|
| > Add a parameter to a legacy config. We build up a comma-
| separated list of options.
|
| It looks to be the second one.
|
| [0]
| https://github.com/torvalds/linux/commit/3e1aeb00e6d132efc15...
| AshamedCaptain wrote:
| I am surprised that mounting is now allowed inside containers.
| Doesn't this expose a load of new surface attack for the
| kernel? All these pesky academical filesystem code does not
| inspire a lot of confidence when parsing user data/disk
| images....
| phendrenad2 wrote:
| Does anyone offer security fix backports for Linux? If I'm stuck
| on Linux 5.1, is my only recourse to update or patch it myself?
| rwmj wrote:
| It's a one line patch in old code so it should apply easily.
| However if you encounter this kind of problem a lot I'd highly
| advise some kind of long-term supported Linux distribution. (I
| work at Red Hat on RHEL, and that's what people pay us for)
| singron wrote:
| There are Linux stable branches that backport fixes. It looks
| like all the affected branches have the fix now: linux-5.16.y
| linux-5.15.y linux-5.10.y linux-5.4.y linux-rolling-lts linux-
| rolling-stable
|
| EDIT: notably absent are linux-5.1.y and other non-lts
| releases. If you can't stay on the most recent stable release,
| you should use lts releases.
| adfsdsaf wrote:
| If you're stuck on 5.1, you probably have a ton of other
| vulnerabilities too. 5.1 isn't even an LTS release, so support
| for it was dropped in 2019.
|
| 5.4 is the first LTS of 5.x, and is supported through 2025. You
| should try to find a way to get on an LTS kernel, or plan on
| managing a lot of kernel patches.
___________________________________________________________________
(page generated 2022-01-20 23:00 UTC)