[HN Gopher] A Linux 5.10 patch has caused user-space regressions
       ___________________________________________________________________
        
       A Linux 5.10 patch has caused user-space regressions
        
       Author : chmaynard
       Score  : 160 points
       Date   : 2022-05-26 16:48 UTC (6 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | lamontcg wrote:
       | Don't like the lecturing tone of this article.
       | 
       | This seems entirely successful to me and retired crappy and
       | insecure behavior.
       | 
       | The idea that you can never break anything means that some
       | historical mistakes can never get fixed.
       | 
       | The someone complains about how the language / operating system
       | is a bolted on mess of nonsense and people go chasing after the
       | next hot thing which has the benefit of just being newer and
       | being able to get away with breaking changes because people on
       | the bleeding edge put up with it.
       | 
       | The linux kernel is probably striking the right balance here, and
       | I doubt that this one breakage is a sign of the decay and
       | downfall of the Empire.
        
       | theteapot wrote:
       | Shouldn't they have left it for 6.0?
        
         | mhitza wrote:
         | Kernel doesn't adhere to a SemVer policy
        
           | theteapot wrote:
           | Maybe they should. Maybe, they, should.
        
             | SAI_Peregrinus wrote:
             | Naw. Kernel version number increases. Can't be more than
             | 255 in any field, except the EXTRAVERSION which is a
             | string. Kernel version numbering is exposed in the
             | userspace ABI, so changing it to have meaning would be a
             | breaking change for no substantial benefit. This[1] LWN
             | article is a good overview of the history. Effectively the
             | major gets incremented when the minor gets to 20 and Linus
             | runs out of fingers & toes to keep track.
             | 
             | [1] https://lwn.net/Articles/871989/
        
               | theteapot wrote:
               | > Can't be more than 255 in any field, except the
               | EXTRAVERSION which is a string. Kernel version numbering
               | is exposed in the userspace ABI, so changing it to have
               | meaning would be a breaking change.
               | 
               | Sounds like they should leave this change for 6.0 then.
        
               | charcircuit wrote:
               | >Can't be more than 255 in any field
               | 
               | They can just change the max. It's not that big of a
               | deal.
        
               | mrlonglong wrote:
               | How many digits does a penguin have?
        
               | ufo wrote:
               | I believe it's 14. (Birds feet have 4 toes and their
               | wings have 3 fingers)
        
       | ranger207 wrote:
       | As an aside, if you enjoy this article please consider
       | subscribing to LWN at [0]. Normally articles are only available
       | for free a week after they're posted; one of the perks of
       | subscribing is being able to send a link like the one posted here
       | so non-subscribers can view it before the week has passed. LWN's
       | main source of revenue as I understand it is subscriptions, and
       | the work they do is definitely worth it
       | 
       | [0] https://lwn.net/subscribe/Info
        
       | [deleted]
        
       | mark_undoio wrote:
       | Our software[1] got broken by a kernel change a few years ago.
       | The experience was quite interesting.
       | 
       | We were making use of `/proc/PID/pagemap`, which is a kernel-
       | generated file that _previously_ would show the physical
       | addresses of all the pages in a process 's address space.
       | Unfortunately, with the Rowhammer exploit, exposing this
       | information - even for one's own processes - to unprivileged
       | users went from being harmless to a security risk.
       | 
       | The first we saw of the change was when newer kernels started
       | reporting zeros for all physical addresses, unless we ran as
       | root. We raised this the LKML, explaining that we'd been relying
       | on this feature to implement a somewhat esoteric optimisation.
       | 
       | Linus replied very helpfully - the security fix trumped userspace
       | compatibility but he could see a secure way of getting us the
       | information we really needed, given the technique we'd described.
       | He invited us to submit a kernel patch and gave a few hints about
       | potential gotchas.
       | 
       | I did work one up but the kernel community actually jumped on it
       | as an opportunity to do more cleanup, so I ended up just signing
       | off on the patch they produced. It was all a remarkably smooth
       | and efficient process.
       | 
       | [1] Time travel debugging - http://undo.io
        
         | khuey wrote:
         | Heh, as someone who maintains another exotic debugger[1] my
         | experience has been that the kernel development process is kind
         | of a pain. We've had good experiences with people fixing
         | regressions once they're discovered but getting new features
         | into the kernel has been difficult. I think it took 10
         | revisions for me to get cpuid faulting into the kernel,
         | including multiple review cycles where I was first told "change
         | X to Y" and then in a subsequent cycle told "change Y to X".
         | 
         | [1] https://rr-project.org/
        
         | Havoc wrote:
         | > the security fix trumped userspace compatibility
         | 
         | TIL
         | 
         | All stories about kernels describe it as an absolute that
         | userspace is never broken, but that actually makes sense
        
           | Quekid5 wrote:
           | Yeah, there are rare exceptions to The Rule.
        
         | tanelpoder wrote:
         | Yeah, for similar reasons, the /proc/PID/wchan now shows just
         | "0" (for other users' processes) on newer kernels, unless you
         | run as root. Same with /proc/PID/stack, but it's implemented in
         | a different way, I can open() that file successfully, but the
         | read() syscall on the opened file descriptor returns EACCESS
         | error...                 $ ls -l /proc/$$/stack
         | -r-------- 1 tanel tanel 0 May 26 21:52 /proc/967141/stack
         | $        $ cat /proc/$$/stack       cat: /proc/967141/stack:
         | Permission denied       $        $ sudo cat /proc/$$/stack
         | [<0>] do_wait+0x1c3/0x230       [<0>] kernel_wait4+0xaf/0x150
         | [<0>] __do_sys_wait4+0x85/0x90       [<0>]
         | __x64_sys_wait4+0x1e/0x20       [<0>] do_syscall_64+0x49/0xc0
         | [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
         | 
         | Edit: Adding one more comment - my impression has been that the
         | "no userland-visible changes" promise applies to system calls -
         | how procfs presents data as human-readable text in the /proc
         | files has changed every now and then before (I recall the sar
         | command showing wrong numbers after a kernel update, for
         | example).
        
       | sc68cal wrote:
       | I think that leaving splice() as it was, is worse than breaking
       | the "NEVER BREAK USERSPACE" rule, in this specific instance. It
       | should not become a regular occurence, but I don't see a better
       | way. They tried to stick to it, things happened, but I am
       | optimistic that they will fix it and try better the next time.
        
       | i13e wrote:
       | Not sure if here is the place for this but here goes. Personally
       | I've been dealing with a regression of sorts on my laptop: it no
       | longer suspends without being quickly woken up again on the main
       | kernel release (works fine on LTS). Does anyone know if and where
       | I could file a bug report?
        
         | Beltalowda wrote:
         | I have some issues as well; I heard that [1] was the cause for
         | some breakage, which should be fixed in 5.18, but I didn't
         | verify as I haven't had much reason to use suspend lately.
         | 
         | [1]:
         | https://github.com/torvalds/linux/commit/dfbba2518aac4204203...
        
         | __turbobrew__ wrote:
         | I have noticed this as well
        
         | csdvrx wrote:
         | > my laptop: it no longer suspends without being quickly woken
         | up again on the main kernel release
         | 
         | Long story short: there are a lot of things in your laptop
         | generating interrupts.
         | 
         | Some of them you want to ignore, because it would cause the
         | behavior you describe (preventing sleep)
         | 
         | Some of them you really want to listen to closely, because if
         | it's an interrupt generated by say brushing on the power
         | button, not listening to it means not waking up from sleep
         | (traditional example: GPE96 on dells, cf for example
         | https://bugzilla.kernel.org/show_bug.cgi?id=102281) until a
         | longer press on the button generates an ACPI event or a
         | powerup.
         | 
         | You can configure or finetune that with /proc/acpi/wakeup which
         | hopefully will give you more context as to what other people
         | have explained here.
        
         | euank wrote:
         | > Does anyone know if and where I could file a bug report?
         | 
         | The linux kernel does take bug reports:
         | https://docs.kernel.org/admin-guide/reporting-issues.html
         | 
         | However, that bug probably isn't specific enough as you've
         | described it, unless you can find the commit causing it (such
         | as via a git bisect https://docs.kernel.org/admin-guide/bug-
         | bisect.html), or come up with a clearer repro.
         | 
         | Alternatively, if you're seeing the issue on a distro-
         | maintained kernel (such as on fedora/ubuntu/debian with their
         | kernel package), reporting the issue to the distro maintainers
         | may be more appropriate.
        
         | corbet wrote:
         | I've seen similar things. Messing with /proc/acpi/wakeup
         | (https://unix.stackexchange.com/questions/698185/laptop-
         | wakes...) made the problem go away for me. I've been meaning to
         | try to bisect it down and find the real problem, but haven't
         | had the time to do that yet...
        
         | mathstuf wrote:
         | I noticed it recently too, but disconnecting my Bluetooth
         | headset first allowed it to suspend properly (it seems like the
         | wifi chip kept it awake before making like a yo-yo back and
         | forth).
        
           | zonotope wrote:
           | Me 3 and it's maddening. I've posted this [1] on the Arch
           | Linux forums but haven't found a fix. I've seen it mentioned
           | around the web that it's a known bug, but I have yet to find
           | a bug report
           | 
           | 1: https://bbs.archlinux.org/viewtopic.php?id=276064
        
             | csdvrx wrote:
             | It's an open question whether Bluetooth should prevent
             | sleep, or let sleep go through.
             | 
             | On a well done "modern suspend/suspend to idle", I use
             | disconnected modern sleep + Windows Media Player on Windows
             | 11 to listen to songs on my bluetooth headphones for a cost
             | of about 1% of the battery per hour (as measured and
             | plotted with powercfg) which can come down to about half of
             | it, 0.5%/h when not using Bluetooth.
             | 
             | I wouldn't want Bluetooth to prevent sleep (a 1% drop per
             | hour is better than not sleeping!)
             | 
             | I also wouldn't want sleep to prevent me from using my
             | Bluetooth headset (the difference between 0.5% and 1% is
             | significant, but it doesn't matter much in practice if my
             | computer can be usable in the morning)
             | 
             | This is one of the rare examples of "modern suspend"
             | delivering on its promises, and blowing the good old ACPI
             | S3 away: I never saw a drop of battery <5% on S3 suspend-
             | to-ram unless it also involved S4 in a hybrid sleep of
             | "ACPI S3 suspend-to-ram then S4 suspend-to-disk after a
             | while or when I run out of power whichever comes first"!
        
       | mc4ndr3 wrote:
       | with_resource(mutex file handle etc etc *o, func callback) is a
       | safer pattern, for this reason. Thank you, Python
       | 
       | (with_resource is a generic example. Search for such capabilities
       | in your library.)
       | 
       | Computers exist to automate. Let the computer remember to close
       | resources upon scope end.
       | 
       | Go's defer is also helpful in this regard.
        
         | viraptor wrote:
         | That was my first thought - would they leave the set_fs
         | solution in place if C had a native solution for resource
         | blocks / RAII+drop.
        
           | shaded-enmity wrote:
           | A simple policy that both set_fs() calls need to happen
           | within the same function body with corresponding CI test
           | based on AST/DWARF inspection would have also prevented it.
           | Do you really want to rely on stack unwinding/destructors for
           | security sensitive code when stack is usually the first thing
           | that gets controlled by the attacker? Exception handling
           | (SEH) on Windows is an exploitation vector of it's own.
        
             | viraptor wrote:
             | I'm talking about the general idea not specific
             | implementation. Having something happen at function/block
             | exit doesn't mean a runtime configurable behaviour. If you
             | don't have exceptions, it's pretty easy to statically
             | compile that behaviour and guarantee it rather than rely on
             | checks.
        
               | shaded-enmity wrote:
               | You still need to implement a CI rule so that I don't
               | just call `set_fs()` without using
               | `preferred_syntax(set_fs)`, don't you?
        
               | viraptor wrote:
               | We're talking a theoretical language here, so... maybe?
               | It could be also that set_fs is usable only in the
               | preferred_syntax mode.
        
           | ars wrote:
           | It's not that hard to do in C. Make a macro that takes a
           | function, it runs:                 set_fs(KERN)
           | *function_pointer()       set_fs(USER)
           | 
           | There did it before, so they knew about that option, and if
           | they didn't do it, they must have had a reason.
        
             | simias wrote:
             | It's true, unfortunately the lack of lambdas in C makes
             | this pattern cumbersome. You could achieve it with a macro
             | instead, I suppose.
        
           | SolarNet wrote:
           | Except that things are weird in a kernel, one cannot
           | guarantee that language behavior is enforceable. Even with a
           | RAII/drop style solution there is a possibility that kernel
           | weirdness will prevent it from running correctly.
           | 
           | The issue with forgetting function calls like this are the
           | edge cases, and the kernel has a lot more of those.
        
       | teraflop wrote:
       | Even if you don't want to directly support splice() for every
       | possible filesystem or file descriptor, I don't understand why it
       | couldn't be "emulated" in those cases, by having the kernel just
       | pretend it was given a series of read()/write() calls. That might
       | not be as efficient, but surely it would be better than just
       | breaking completely.
        
         | amluto wrote:
         | It used to. But this emulation worked by doing set_fs() so that
         | the internal (outdated) read/write implementation could follow
         | a pointer into _kernel_ memory, and that facility is gone now.
        
           | teraflop wrote:
           | Ah, I think I get it. So the problem isn't just that these
           | drivers don't specifically support splice() -- it's that they
           | _also_ only support reading /writing with userspace buffers,
           | and set_fs() was just a hack around that limitation that is
           | no longer supported. That's kind of unfortunate.
        
             | amluto wrote:
             | Correct. The modern interface supports reading and writing
             | from a rather generic concept of a buffer, and drivers that
             | support that are usable with splice().
        
         | londons_explore wrote:
         | Perhaps because some of the semantics of a spliced pipe aren't
         | exactly the same as read() followed by write()?
        
         | derefr wrote:
         | The point of splice(2) is to be a fast-path; it's not in POSIX,
         | so software that wants to guarantee portability has to write
         | the slow path too, and only use splice(2) if it's present.
         | 
         | Such a blind/naive compatibility shim, would likely be much
         | slower than whatever hand-written slow path the developer has
         | in place. If the whole point of a call is to be a
         | fast/efficient version of something else, then the call is
         | _breaking its semantics_ if it isn 't more fast/efficient than
         | the alternative.
         | 
         | Because of this, it's better to just make autoconf et al
         | _detect splice(2) as absent for the given use-case_ (and fall
         | back to the hand-rolled portable slow path), rather than
         | relying on a naive kernel shim.
        
           | bonzini wrote:
           | You cannot in general use autoconf to detect runtime
           | behavior.
        
             | baisq wrote:
             | Doesn't that happen often anyway?
        
               | bonzini wrote:
               | Nope. You can run programs to detect some behavior, for
               | example printing sizeof(void _) or checking how some
               | rounding is performed. Those tests of course may_ affect*
               | the program behavior at runtime, but what they test is
               | still the compilation environment.
        
             | derefr wrote:
             | Many of autoconf's more fraught checks work by attempting
             | to compile and run little C programs to see what happens.
             | So if splice(2) no longer works when the source is e.g.
             | /dev/random, then autoconf could attempt to compile+run a
             | program that splice(2)s from /dev/random to a known-working
             | sink, and see whether that program SIGSEGVs or not when
             | run; and use that to decide whether to _allow_ an --enable-
             | splice configure flag to be passed, vs. bailing out if such
             | a flag is passed.
             | 
             | Compare and contrast: microarchitectural optimizations. You
             | can ask autoconf to enable them, but there's the implicit
             | assumption that if you're doing so, the build environment
             | _is_ going to be the runtime environment.
             | 
             | Autoconf can only protect you so far; it has to assume that
             | if you're asking passing power-user-only optimization
             | flags, it's because you have a good plan for what you're
             | going to be doing with the resulting binary.
             | 
             | Note that this is part of why embedded-system cross-
             | compilation toolchains aren't just GCC/LLVM plugin packages
             | you can install via your package manager -- they're quite
             | complex, often embedding e.g. a target-machine emulator for
             | autoconf's compiled detections to be run inside of, so that
             | it gets the right answers.
        
             | rootw0rm wrote:
             | I don't think they're suggesting runtime use, but rather
             | building software after the splice(2) change. That's how I
             | read it at least.
        
               | vlovich123 wrote:
               | Aside from the "where you build isn't where you run
               | problem", there's the "this syscall doesn't work when
               | running on filesystem X or accessing /dev/urandom".
               | 
               | Not only is the autoconf solution not fixing the problem,
               | it's placing a massive undue burden on developers. Linus
               | has been inconsistent here. Telemetry would have been
               | helpful here in aiding this work (ie support it with the
               | slow path but report the event so that distro maintainers
               | could provide feedback on broken paths). Once you think
               | you've eliminated the long tail of issues, then remove
               | and see if anything remains broken that telemetry didn't
               | catch.
        
               | bonzini wrote:
               | It would break if compiled on old kernel and run on new
               | kernel. When CI is containerized the kernel might not
               | even have anything to do with the distro that you're
               | building on.
        
         | [deleted]
        
       | manchmalscott wrote:
       | The only time I've had a kernel update break things was a weird
       | Proxmox bug where if you booted with the (at the time) latest
       | kernel, it would fail to start the first VM, and nothing you did
       | from the UI or the command line could touch it without timing
       | out. Rebooting to the previous kernel release and it just started
       | working again with no other changes.
       | 
       | That definitely broke my assumption that kernel updates were well
       | vetted for regressions.
        
         | bityard wrote:
         | "Don't break userspace" is more like a _goal_ than a law.
         | 
         | The kernel is an extremely complex piece of code, the
         | developers can't be asked to test every kernel release against
         | every piece of userspace software on every hardware
         | configuration. That's one of the reasons that new code and
         | significant changes require a bunch of mailing list discussion,
         | reviews, and signing-off.
         | 
         | Also, Proxmox ships with its own supported kernel, it shouldn't
         | be a big surprise that you ran into issues while straying off
         | the beaten path.
        
           | londons_explore wrote:
           | > The kernel is an extremely complex piece of code,
           | 
           | With no regression tests, nearly no unit tests, and lots of
           | bits of supported hardware that nobody on the Dev team uses
           | day to day...
           | 
           | It's a miracle that quality is as high as it is.
        
             | bonzini wrote:
             | There are plenty of tests, just not in Linus's tree. KVM
             | has not one but three suites of unit and integration tests,
             | for example, but only 60ish tests are in the kernel's
             | tools/testing/selftests directory.
        
         | bombcar wrote:
         | I suspect kernel updates aren't vetted that well in general;
         | the "don't break user space" is a ruling from Linus about what
         | not to _do_ - usually with regards to the user space API (don
         | 't remove or change things that already exist).
         | 
         | And this article is an example of where the decision was made
         | _to_ break user space.
        
           | josefx wrote:
           | Is it even a hard break? It seems to be more of a short lived
           | regression while patches for filesystem drivers are still
           | coming in to restore support and I don't think Linux ever
           | guaranteed a stable driver API.
        
         | simlevesque wrote:
         | > kernel updates were well vetted for regressions.
         | 
         | That seems like the kind of things that is easier said than
         | done.
        
         | caslon wrote:
         | > The only time I've had a kernel update break things
         | 
         | > That definitely broke my assumption that kernel updates were
         | well vetted for regressions.
         | 
         | Shouldn't that be evidence for the contrary? One break in a
         | decade or two of use isn't so bad. It's actually pretty good.
        
         | yjftsjthsd-h wrote:
         | > That definitely broke my assumption that kernel updates were
         | well vetted for regressions.
         | 
         | I would argue that it's also a failure by the distro; unless
         | there was something special about your exact setup that would
         | have made the bug not show up in testing, I would argue that
         | proxmox should have been testing updates to catch that kind of
         | problem before users noticed.
        
         | 867-5309 wrote:
         | I've just had that with the most recent Proxmox update and also
         | had no option but to regress
        
           | Fnoord wrote:
           | I had the issue as well but I managed to fix it by I changing
           | my boot parameters (exact change differs per bootloader and
           | hardware). Now I happily boot with .15 series (up from .13).
           | 
           | I expect some more users of Proxmox given Broadcom taking
           | over VMware (e.g. I'd like to merge away from ESXi to Proxmox
           | as I don't trust Broadcom). Hopefully it does the product and
           | community well.
        
       | loeg wrote:
       | > But it is true that this type of episode makes the kernel's "no
       | regressions" rule look a bit more like just a guideline. It does
       | not take too many of those to create breakage to the project's
       | reputation that is hard to splice back together.
        
       ___________________________________________________________________
       (page generated 2022-05-26 23:00 UTC)