[HN Gopher] A Linux 5.10 patch has caused user-space regressions
___________________________________________________________________
A Linux 5.10 patch has caused user-space regressions
Author : chmaynard
Score : 160 points
Date : 2022-05-26 16:48 UTC (6 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| lamontcg wrote:
| Don't like the lecturing tone of this article.
|
| This seems entirely successful to me and retired crappy and
| insecure behavior.
|
| The idea that you can never break anything means that some
| historical mistakes can never get fixed.
|
| The someone complains about how the language / operating system
| is a bolted on mess of nonsense and people go chasing after the
| next hot thing which has the benefit of just being newer and
| being able to get away with breaking changes because people on
| the bleeding edge put up with it.
|
| The linux kernel is probably striking the right balance here, and
| I doubt that this one breakage is a sign of the decay and
| downfall of the Empire.
| theteapot wrote:
| Shouldn't they have left it for 6.0?
| mhitza wrote:
| Kernel doesn't adhere to a SemVer policy
| theteapot wrote:
| Maybe they should. Maybe, they, should.
| SAI_Peregrinus wrote:
| Naw. Kernel version number increases. Can't be more than
| 255 in any field, except the EXTRAVERSION which is a
| string. Kernel version numbering is exposed in the
| userspace ABI, so changing it to have meaning would be a
| breaking change for no substantial benefit. This[1] LWN
| article is a good overview of the history. Effectively the
| major gets incremented when the minor gets to 20 and Linus
| runs out of fingers & toes to keep track.
|
| [1] https://lwn.net/Articles/871989/
| theteapot wrote:
| > Can't be more than 255 in any field, except the
| EXTRAVERSION which is a string. Kernel version numbering
| is exposed in the userspace ABI, so changing it to have
| meaning would be a breaking change.
|
| Sounds like they should leave this change for 6.0 then.
| charcircuit wrote:
| >Can't be more than 255 in any field
|
| They can just change the max. It's not that big of a
| deal.
| mrlonglong wrote:
| How many digits does a penguin have?
| ufo wrote:
| I believe it's 14. (Birds feet have 4 toes and their
| wings have 3 fingers)
| ranger207 wrote:
| As an aside, if you enjoy this article please consider
| subscribing to LWN at [0]. Normally articles are only available
| for free a week after they're posted; one of the perks of
| subscribing is being able to send a link like the one posted here
| so non-subscribers can view it before the week has passed. LWN's
| main source of revenue as I understand it is subscriptions, and
| the work they do is definitely worth it
|
| [0] https://lwn.net/subscribe/Info
| [deleted]
| mark_undoio wrote:
| Our software[1] got broken by a kernel change a few years ago.
| The experience was quite interesting.
|
| We were making use of `/proc/PID/pagemap`, which is a kernel-
| generated file that _previously_ would show the physical
| addresses of all the pages in a process 's address space.
| Unfortunately, with the Rowhammer exploit, exposing this
| information - even for one's own processes - to unprivileged
| users went from being harmless to a security risk.
|
| The first we saw of the change was when newer kernels started
| reporting zeros for all physical addresses, unless we ran as
| root. We raised this the LKML, explaining that we'd been relying
| on this feature to implement a somewhat esoteric optimisation.
|
| Linus replied very helpfully - the security fix trumped userspace
| compatibility but he could see a secure way of getting us the
| information we really needed, given the technique we'd described.
| He invited us to submit a kernel patch and gave a few hints about
| potential gotchas.
|
| I did work one up but the kernel community actually jumped on it
| as an opportunity to do more cleanup, so I ended up just signing
| off on the patch they produced. It was all a remarkably smooth
| and efficient process.
|
| [1] Time travel debugging - http://undo.io
| khuey wrote:
| Heh, as someone who maintains another exotic debugger[1] my
| experience has been that the kernel development process is kind
| of a pain. We've had good experiences with people fixing
| regressions once they're discovered but getting new features
| into the kernel has been difficult. I think it took 10
| revisions for me to get cpuid faulting into the kernel,
| including multiple review cycles where I was first told "change
| X to Y" and then in a subsequent cycle told "change Y to X".
|
| [1] https://rr-project.org/
| Havoc wrote:
| > the security fix trumped userspace compatibility
|
| TIL
|
| All stories about kernels describe it as an absolute that
| userspace is never broken, but that actually makes sense
| Quekid5 wrote:
| Yeah, there are rare exceptions to The Rule.
| tanelpoder wrote:
| Yeah, for similar reasons, the /proc/PID/wchan now shows just
| "0" (for other users' processes) on newer kernels, unless you
| run as root. Same with /proc/PID/stack, but it's implemented in
| a different way, I can open() that file successfully, but the
| read() syscall on the opened file descriptor returns EACCESS
| error... $ ls -l /proc/$$/stack
| -r-------- 1 tanel tanel 0 May 26 21:52 /proc/967141/stack
| $ $ cat /proc/$$/stack cat: /proc/967141/stack:
| Permission denied $ $ sudo cat /proc/$$/stack
| [<0>] do_wait+0x1c3/0x230 [<0>] kernel_wait4+0xaf/0x150
| [<0>] __do_sys_wait4+0x85/0x90 [<0>]
| __x64_sys_wait4+0x1e/0x20 [<0>] do_syscall_64+0x49/0xc0
| [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
|
| Edit: Adding one more comment - my impression has been that the
| "no userland-visible changes" promise applies to system calls -
| how procfs presents data as human-readable text in the /proc
| files has changed every now and then before (I recall the sar
| command showing wrong numbers after a kernel update, for
| example).
| sc68cal wrote:
| I think that leaving splice() as it was, is worse than breaking
| the "NEVER BREAK USERSPACE" rule, in this specific instance. It
| should not become a regular occurence, but I don't see a better
| way. They tried to stick to it, things happened, but I am
| optimistic that they will fix it and try better the next time.
| i13e wrote:
| Not sure if here is the place for this but here goes. Personally
| I've been dealing with a regression of sorts on my laptop: it no
| longer suspends without being quickly woken up again on the main
| kernel release (works fine on LTS). Does anyone know if and where
| I could file a bug report?
| Beltalowda wrote:
| I have some issues as well; I heard that [1] was the cause for
| some breakage, which should be fixed in 5.18, but I didn't
| verify as I haven't had much reason to use suspend lately.
|
| [1]:
| https://github.com/torvalds/linux/commit/dfbba2518aac4204203...
| __turbobrew__ wrote:
| I have noticed this as well
| csdvrx wrote:
| > my laptop: it no longer suspends without being quickly woken
| up again on the main kernel release
|
| Long story short: there are a lot of things in your laptop
| generating interrupts.
|
| Some of them you want to ignore, because it would cause the
| behavior you describe (preventing sleep)
|
| Some of them you really want to listen to closely, because if
| it's an interrupt generated by say brushing on the power
| button, not listening to it means not waking up from sleep
| (traditional example: GPE96 on dells, cf for example
| https://bugzilla.kernel.org/show_bug.cgi?id=102281) until a
| longer press on the button generates an ACPI event or a
| powerup.
|
| You can configure or finetune that with /proc/acpi/wakeup which
| hopefully will give you more context as to what other people
| have explained here.
| euank wrote:
| > Does anyone know if and where I could file a bug report?
|
| The linux kernel does take bug reports:
| https://docs.kernel.org/admin-guide/reporting-issues.html
|
| However, that bug probably isn't specific enough as you've
| described it, unless you can find the commit causing it (such
| as via a git bisect https://docs.kernel.org/admin-guide/bug-
| bisect.html), or come up with a clearer repro.
|
| Alternatively, if you're seeing the issue on a distro-
| maintained kernel (such as on fedora/ubuntu/debian with their
| kernel package), reporting the issue to the distro maintainers
| may be more appropriate.
| corbet wrote:
| I've seen similar things. Messing with /proc/acpi/wakeup
| (https://unix.stackexchange.com/questions/698185/laptop-
| wakes...) made the problem go away for me. I've been meaning to
| try to bisect it down and find the real problem, but haven't
| had the time to do that yet...
| mathstuf wrote:
| I noticed it recently too, but disconnecting my Bluetooth
| headset first allowed it to suspend properly (it seems like the
| wifi chip kept it awake before making like a yo-yo back and
| forth).
| zonotope wrote:
| Me 3 and it's maddening. I've posted this [1] on the Arch
| Linux forums but haven't found a fix. I've seen it mentioned
| around the web that it's a known bug, but I have yet to find
| a bug report
|
| 1: https://bbs.archlinux.org/viewtopic.php?id=276064
| csdvrx wrote:
| It's an open question whether Bluetooth should prevent
| sleep, or let sleep go through.
|
| On a well done "modern suspend/suspend to idle", I use
| disconnected modern sleep + Windows Media Player on Windows
| 11 to listen to songs on my bluetooth headphones for a cost
| of about 1% of the battery per hour (as measured and
| plotted with powercfg) which can come down to about half of
| it, 0.5%/h when not using Bluetooth.
|
| I wouldn't want Bluetooth to prevent sleep (a 1% drop per
| hour is better than not sleeping!)
|
| I also wouldn't want sleep to prevent me from using my
| Bluetooth headset (the difference between 0.5% and 1% is
| significant, but it doesn't matter much in practice if my
| computer can be usable in the morning)
|
| This is one of the rare examples of "modern suspend"
| delivering on its promises, and blowing the good old ACPI
| S3 away: I never saw a drop of battery <5% on S3 suspend-
| to-ram unless it also involved S4 in a hybrid sleep of
| "ACPI S3 suspend-to-ram then S4 suspend-to-disk after a
| while or when I run out of power whichever comes first"!
| mc4ndr3 wrote:
| with_resource(mutex file handle etc etc *o, func callback) is a
| safer pattern, for this reason. Thank you, Python
|
| (with_resource is a generic example. Search for such capabilities
| in your library.)
|
| Computers exist to automate. Let the computer remember to close
| resources upon scope end.
|
| Go's defer is also helpful in this regard.
| viraptor wrote:
| That was my first thought - would they leave the set_fs
| solution in place if C had a native solution for resource
| blocks / RAII+drop.
| shaded-enmity wrote:
| A simple policy that both set_fs() calls need to happen
| within the same function body with corresponding CI test
| based on AST/DWARF inspection would have also prevented it.
| Do you really want to rely on stack unwinding/destructors for
| security sensitive code when stack is usually the first thing
| that gets controlled by the attacker? Exception handling
| (SEH) on Windows is an exploitation vector of it's own.
| viraptor wrote:
| I'm talking about the general idea not specific
| implementation. Having something happen at function/block
| exit doesn't mean a runtime configurable behaviour. If you
| don't have exceptions, it's pretty easy to statically
| compile that behaviour and guarantee it rather than rely on
| checks.
| shaded-enmity wrote:
| You still need to implement a CI rule so that I don't
| just call `set_fs()` without using
| `preferred_syntax(set_fs)`, don't you?
| viraptor wrote:
| We're talking a theoretical language here, so... maybe?
| It could be also that set_fs is usable only in the
| preferred_syntax mode.
| ars wrote:
| It's not that hard to do in C. Make a macro that takes a
| function, it runs: set_fs(KERN)
| *function_pointer() set_fs(USER)
|
| There did it before, so they knew about that option, and if
| they didn't do it, they must have had a reason.
| simias wrote:
| It's true, unfortunately the lack of lambdas in C makes
| this pattern cumbersome. You could achieve it with a macro
| instead, I suppose.
| SolarNet wrote:
| Except that things are weird in a kernel, one cannot
| guarantee that language behavior is enforceable. Even with a
| RAII/drop style solution there is a possibility that kernel
| weirdness will prevent it from running correctly.
|
| The issue with forgetting function calls like this are the
| edge cases, and the kernel has a lot more of those.
| teraflop wrote:
| Even if you don't want to directly support splice() for every
| possible filesystem or file descriptor, I don't understand why it
| couldn't be "emulated" in those cases, by having the kernel just
| pretend it was given a series of read()/write() calls. That might
| not be as efficient, but surely it would be better than just
| breaking completely.
| amluto wrote:
| It used to. But this emulation worked by doing set_fs() so that
| the internal (outdated) read/write implementation could follow
| a pointer into _kernel_ memory, and that facility is gone now.
| teraflop wrote:
| Ah, I think I get it. So the problem isn't just that these
| drivers don't specifically support splice() -- it's that they
| _also_ only support reading /writing with userspace buffers,
| and set_fs() was just a hack around that limitation that is
| no longer supported. That's kind of unfortunate.
| amluto wrote:
| Correct. The modern interface supports reading and writing
| from a rather generic concept of a buffer, and drivers that
| support that are usable with splice().
| londons_explore wrote:
| Perhaps because some of the semantics of a spliced pipe aren't
| exactly the same as read() followed by write()?
| derefr wrote:
| The point of splice(2) is to be a fast-path; it's not in POSIX,
| so software that wants to guarantee portability has to write
| the slow path too, and only use splice(2) if it's present.
|
| Such a blind/naive compatibility shim, would likely be much
| slower than whatever hand-written slow path the developer has
| in place. If the whole point of a call is to be a
| fast/efficient version of something else, then the call is
| _breaking its semantics_ if it isn 't more fast/efficient than
| the alternative.
|
| Because of this, it's better to just make autoconf et al
| _detect splice(2) as absent for the given use-case_ (and fall
| back to the hand-rolled portable slow path), rather than
| relying on a naive kernel shim.
| bonzini wrote:
| You cannot in general use autoconf to detect runtime
| behavior.
| baisq wrote:
| Doesn't that happen often anyway?
| bonzini wrote:
| Nope. You can run programs to detect some behavior, for
| example printing sizeof(void _) or checking how some
| rounding is performed. Those tests of course may_ affect*
| the program behavior at runtime, but what they test is
| still the compilation environment.
| derefr wrote:
| Many of autoconf's more fraught checks work by attempting
| to compile and run little C programs to see what happens.
| So if splice(2) no longer works when the source is e.g.
| /dev/random, then autoconf could attempt to compile+run a
| program that splice(2)s from /dev/random to a known-working
| sink, and see whether that program SIGSEGVs or not when
| run; and use that to decide whether to _allow_ an --enable-
| splice configure flag to be passed, vs. bailing out if such
| a flag is passed.
|
| Compare and contrast: microarchitectural optimizations. You
| can ask autoconf to enable them, but there's the implicit
| assumption that if you're doing so, the build environment
| _is_ going to be the runtime environment.
|
| Autoconf can only protect you so far; it has to assume that
| if you're asking passing power-user-only optimization
| flags, it's because you have a good plan for what you're
| going to be doing with the resulting binary.
|
| Note that this is part of why embedded-system cross-
| compilation toolchains aren't just GCC/LLVM plugin packages
| you can install via your package manager -- they're quite
| complex, often embedding e.g. a target-machine emulator for
| autoconf's compiled detections to be run inside of, so that
| it gets the right answers.
| rootw0rm wrote:
| I don't think they're suggesting runtime use, but rather
| building software after the splice(2) change. That's how I
| read it at least.
| vlovich123 wrote:
| Aside from the "where you build isn't where you run
| problem", there's the "this syscall doesn't work when
| running on filesystem X or accessing /dev/urandom".
|
| Not only is the autoconf solution not fixing the problem,
| it's placing a massive undue burden on developers. Linus
| has been inconsistent here. Telemetry would have been
| helpful here in aiding this work (ie support it with the
| slow path but report the event so that distro maintainers
| could provide feedback on broken paths). Once you think
| you've eliminated the long tail of issues, then remove
| and see if anything remains broken that telemetry didn't
| catch.
| bonzini wrote:
| It would break if compiled on old kernel and run on new
| kernel. When CI is containerized the kernel might not
| even have anything to do with the distro that you're
| building on.
| [deleted]
| manchmalscott wrote:
| The only time I've had a kernel update break things was a weird
| Proxmox bug where if you booted with the (at the time) latest
| kernel, it would fail to start the first VM, and nothing you did
| from the UI or the command line could touch it without timing
| out. Rebooting to the previous kernel release and it just started
| working again with no other changes.
|
| That definitely broke my assumption that kernel updates were well
| vetted for regressions.
| bityard wrote:
| "Don't break userspace" is more like a _goal_ than a law.
|
| The kernel is an extremely complex piece of code, the
| developers can't be asked to test every kernel release against
| every piece of userspace software on every hardware
| configuration. That's one of the reasons that new code and
| significant changes require a bunch of mailing list discussion,
| reviews, and signing-off.
|
| Also, Proxmox ships with its own supported kernel, it shouldn't
| be a big surprise that you ran into issues while straying off
| the beaten path.
| londons_explore wrote:
| > The kernel is an extremely complex piece of code,
|
| With no regression tests, nearly no unit tests, and lots of
| bits of supported hardware that nobody on the Dev team uses
| day to day...
|
| It's a miracle that quality is as high as it is.
| bonzini wrote:
| There are plenty of tests, just not in Linus's tree. KVM
| has not one but three suites of unit and integration tests,
| for example, but only 60ish tests are in the kernel's
| tools/testing/selftests directory.
| bombcar wrote:
| I suspect kernel updates aren't vetted that well in general;
| the "don't break user space" is a ruling from Linus about what
| not to _do_ - usually with regards to the user space API (don
| 't remove or change things that already exist).
|
| And this article is an example of where the decision was made
| _to_ break user space.
| josefx wrote:
| Is it even a hard break? It seems to be more of a short lived
| regression while patches for filesystem drivers are still
| coming in to restore support and I don't think Linux ever
| guaranteed a stable driver API.
| simlevesque wrote:
| > kernel updates were well vetted for regressions.
|
| That seems like the kind of things that is easier said than
| done.
| caslon wrote:
| > The only time I've had a kernel update break things
|
| > That definitely broke my assumption that kernel updates were
| well vetted for regressions.
|
| Shouldn't that be evidence for the contrary? One break in a
| decade or two of use isn't so bad. It's actually pretty good.
| yjftsjthsd-h wrote:
| > That definitely broke my assumption that kernel updates were
| well vetted for regressions.
|
| I would argue that it's also a failure by the distro; unless
| there was something special about your exact setup that would
| have made the bug not show up in testing, I would argue that
| proxmox should have been testing updates to catch that kind of
| problem before users noticed.
| 867-5309 wrote:
| I've just had that with the most recent Proxmox update and also
| had no option but to regress
| Fnoord wrote:
| I had the issue as well but I managed to fix it by I changing
| my boot parameters (exact change differs per bootloader and
| hardware). Now I happily boot with .15 series (up from .13).
|
| I expect some more users of Proxmox given Broadcom taking
| over VMware (e.g. I'd like to merge away from ESXi to Proxmox
| as I don't trust Broadcom). Hopefully it does the product and
| community well.
| loeg wrote:
| > But it is true that this type of episode makes the kernel's "no
| regressions" rule look a bit more like just a guideline. It does
| not take too many of those to create breakage to the project's
| reputation that is hard to splice back together.
___________________________________________________________________
(page generated 2022-05-26 23:00 UTC)