[HN Gopher] A deep dive into Linux's new mseal syscall
___________________________________________________________________
A deep dive into Linux's new mseal syscall
Author : todsacerdoti
Score : 141 points
Date : 2024-10-25 14:58 UTC (8 hours ago)
(HTM) web link (blog.trailofbits.com)
(TXT) w3m dump (blog.trailofbits.com)
| ykonstant wrote:
| Interesting. The article mentions "spicy discussions" in the
| kernel mailing list. Is there any insider who can summarize
| objections and concerns? I tend to avoid reading the mailing list
| itself since it can get too spicy, and my headaches are already
| strong enough!
|
| The mechanism itself seems reasonable, but I am surprised that
| something like this doesn't already exist in the kernel.
| ziddoap wrote:
| Not sure if there was much more to it than the thread linked
| to, but it was basically Linus being Linus. He said stuff that
| made sense in a pretty blunt fashion.
|
| There were flags proposed that allowed the seal to be ignored.
|
| > _So you say "we can't munmap in this *one* place, but all
| others ignore the sealing"._
|
| Later was the spice.
|
| > _And dammit, once something is sealed, it is SEALED. None of
| this crazy "one place honors the sealing, random other places
| do not"._
|
| And later, even spicier, Linus says that seals cannot be
| ignored and that is non-negotiable. Any further suggestions to
| ignore a seal via a flag would result in the person being added
| to Linus' ignore list. (He, of course, said this with some
| profanities and capitals sprinkled in.)
| js2 wrote:
| Wasn't just Linus. Earlier, from Theo de Raadt:
|
| > I don't think you understand the problem space well enough
| to come up with your own solution for it. I spent a year on
| this, and ship a complete system using it. You are asking
| such simplistic questions above it shocks me.
|
| https://lwn.net/ml/linux-
| kernel/95482.1697587015@cvs.openbsd...
|
| Via https://lwn.net/Articles/948129/
| 0xbadcafebee wrote:
| Not a great perspective... "It took me a year [or more] to
| understand this. The fact that you don't understand it
| shocks me." Dude, not everybody's as smart or experienced
| as you. Here's an opportunity to be a mentor.
| nativeit wrote:
| > Google has no shortage of experienced developers who
| could have reviewed this submission before it was posted
| publicly, but that does not appear to have happened, with
| the result that a relatively inexperienced developer was
| put into a difficult position. Feedback on the proposal
| was resisted rather than listened to. The result was an
| interaction that pleased nobody.
| refulgentis wrote:
| > Google has no shortage of experienced developers who
| could have reviewed this submission before it was posted
| publicly,
|
| You'd be surprised. My understanding from folks on Chrome
| OS is they've already shedded most, if not all, of the
| most experienced old hands. (n.b. Chrome OS was absorbed
| by Android and development is, by and large, ceased on it
| according to same sources directly, and indirectly via
| Blind.)
| vlovich123 wrote:
| My reading of this is a lot more generous to the
| maintainers and a lot less sympathetic to the author than
| yours is. The maintainers highlighted the problems and
| the author came back basically with "I don't believe you
| so let's go with my approach to stay more general" - it's
| one thing to disagree, it's another to straight up not
| acknowledge the feedback. The author ate a lot of very
| senior people's time arguing instead of listening to them
| and learning from their experience and that was
| justifiably frustrating forcing much more direct
| feedback. The kind of mistake the author made - having to
| enforce at each individual syscall level instead of it
| being a protection on the memory itself enforced on all
| accesses - indicates a poor understanding of how to think
| about security and build security APIs which is a problem
| when you're proposing a security API.
|
| It's particularly impressive how misguided the patch is
| given that they took inspiration from the OpenBSD API
| implementation, changed both API & implementation, & then
| argued with both Linus and Theo who started Linux &
| OpenBSD respectively and were trying to give direct
| feedback about how OpenBSD is different and why it took
| the approach it did.
|
| Hopefully the author has taken the more forceful feedback
| as a learning opportunity about listening to feedback
| when the people giving it to you have a lot more
| experience. Or their team is coaching them about what
| went wrong now that this became so visible to learn what
| they got wrong.
|
| From Matthew Wilcox who is another senior Linux
| maintainer:
|
| > I concur with Theo & Linus. You don't know what you're
| doing. I think the underlying idea of mimmutable() is
| good, but how you've split it up and how you've
| implemented it is terrible.
|
| It's delivered directly and bluntly but it's not mean or
| personal. The author proposed a bad patch & argued from a
| position of ignorance.
| terribleperson wrote:
| The time of the people who maintain the free and open
| source software we rely on is not free. From the people
| I've talked to, maintainers of successful projects are
| overworked and underappreciated.
|
| Mentorship from one of those people would be valuable,
| but arguing with them about the implementation of
| something you don't understand isn't how you get that
| mentorship.
| lathiat wrote:
| This may help a bit: https://lwn.net/Articles/948129/
| ykonstant wrote:
| Very nice, thanks!
|
| Edit: I always find it funny that these articles on the
| mailing list tend to read like a sports announcer describing
| a boxing match!
| greenavocado wrote:
| https://lwn.net/ml/linux-kernel/7071.1697661373@cvs.openbsd....
| From: Theo de Raadt <deraadt-AT-openbsd.org> To:
| Jeff Xu <jeffxu-AT-google.com> > On Wed, Oct 18,
| 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
| > > > > Let's start with the purpose. The point of
| mimmutable/mseal/whatever is > > to fix the mapping of
| an address range to its underlying object, be it > > a
| particular file mapping or anonymous memory. After the call
| succeeds, > > it must not be possible to make any
| address in that virtual range point > > into any other
| object. > > > > The secondary purpose is to
| lock down permissions on that range. > > Possibly to
| fix them where they are, possibly to allow RW->RO transitions.
| > > > > With those purposes in mind, you should be able
| to deduce for any syscall > > or any madvise(), ...
| whether it should be allowed. > > > I got it.
| > > IMO: The approaches mimmutable() and mseal() took
| are different, but > we all want to seal the memory
| from attackers and make the linux > application safer.
| I think you are building mseal for chrome, and chrome alone.
| I do not think this will work out for the rest of the
| application space because 1) it is too
| complicated 2) experience with mimmutable() says that
| applications don't do any of it themselves, it is all
| in execve(), libc initialization, and ld.so. You don't
| strike me as an execve, libc, or ld.so developer.
| greenavocado wrote:
| From: Matthew Wilcox <willy-AT-infradead.org> To:
| Jeff Xu <jeffxu-AT-google.com> ...
| Yes, thank you for demonstrating that you have no idea what you
| need to block. > It is practical to keep
| syscall extentable, when the business logic is the same.
| I concur with Theo & Linus. You don't know what you're doing.
| I think the underlying idea of mimmutable() is good,
| but how you've split it up and how you've implemented
| it is terrible. ...
| metadat wrote:
| Will it be possible to override / disable the `mseal' syscall
| with the LD_PRELOAD trick?
| the8472 wrote:
| https://lwn.net/Articles/978010/ says there'll be a glibc
| tunable
| eska wrote:
| _mseal digresses from prior memory protection schemes on Linux
| because it is a syscall tailored specifically for exploit
| mitigation against remote attackers seeking code execution
| rather than potentially local ones looking to exfiltrate
| sensitive secrets in-memory._
|
| If a remote attacker can change the local environment then they
| must have already broken into your system.
| chucky_z wrote:
| You can override the mseal call wrapper but not the syscall
| itself.
|
| This is an interesting thought so I looked it up and this is
| how (all?) preload syscall overrides work. You override the
| wrapper but not the syscalls itself so if you're doing direct
| syscalls I don't think that can be overridden. Technically you
| could override the syscall function itself maybe?
| jmmv wrote:
| > Technically you could override the syscall function itself
| maybe?
|
| But then you can just write assembly code to issue the system
| call.
| Dwedit wrote:
| Probably not LD_PRELOAD. It would need to be an imported
| function in order for LD_PRELOAD to have any effect. A raw
| syscall would not be interceptable that way.
|
| Discussion about intercepting linux syscalls:
| https://stackoverflow.com/questions/69859/how-could-i-interc...
|
| But building your own patched kernel that pretends that mseal
| works would be the simplest way to "disable" that feature.
| Programs that use mseal could still do sanity checks to see if
| mseal actually works or not. Then a compromised kernel would
| need secret ways to disable mseal after it has been applied, to
| stop the apps from checking for a non-functional mseal.
| jandrese wrote:
| I'm not sure what protection you could expect on any system
| where the kernel has been replaced by the attacker. Sure they
| can bypass mseal, but they are also bypassing _all other
| security_ on the box.
| Dwedit wrote:
| Two different considerations for when you'd want to deny
| memory to other processes:
|
| Protecting against outside attackers
|
| Digital Rights Management
|
| Faking "mseal" is something you might intentionally do if
| you are trying to break DRM, and something you would not
| want to do if you are trying to defend against outside
| attackers.
| cataphract wrote:
| Depends whether the program calls into libc or inlines the
| syscalls, I imagine. Though you could use other mechanisms like
| secccomp.
| monocasa wrote:
| There's a bunch of ways to override it if you have early
| control over the process. Another example: ptrace the
| executable, watch the system calls, and skip over any
| mseal(2)s.
|
| This system call is meant for a different threat model than
| "attacker has early access to your process before it started
| initializing".
| unwind wrote:
| Meta: the mseal() prototype in the article needs some editing, it
| is not syntacticallly correct as shown now. The first argument is
| shown as unsigned start addr
|
| But should probably be unsigned long start_addr
| hifromwork wrote:
| Seems to be OK now: int mseal(unsigned long
| start, size_t len, unsigned long flags)
| throw0101a wrote:
| mseal() and what comes after, October 20, 2023:
| https://lwn.net/Articles/948129/
|
| mseal() gets closer, January 19, 2024:
| https://lwn.net/Articles/958438/
|
| Memory sealing for the GNU C Library, June 12, 2024:
| https://lwn.net/Articles/978010/
| westurner wrote:
| - "Memory Sealing "Mseal" System Call Merged for Linux 6.10"
| (2024) https://news.ycombinator.com/item?id=40474510#40474551 :
|
| > _How should CPython support the mseal() syscall?_
| xterminator wrote:
| OpenBSD has had it since forever [1]. Why is such an obvious
| feature only reaching Linux now?
|
| [1]https://man.openbsd.org/mimmutable.2
| gilgamesh3 wrote:
| >OpenBSD has had it since forever.
|
| OpenBSD introduced mimmutable in OpenBSD 7.3, which was
| released 10/4/2023 (for US people, it would be 4/10/2023), so
| it isn't "forever".
|
| Meanwhile Linux and FreeBSD has "memfd_create" forever while
| OpenBSD doesn't have anonymous files and relies on "shm_open".
| MBCook wrote:
| A question about using this call:
|
| Chrome is the one who wants it. But you can't unmap sealed pages
| because an attacker could then re-map them with different flags.
|
| So that basically means this can never be used on pages allocated
| at runtime unless you intend to hold them for the entire process
| lifetime, right?
|
| Doesn't that mean it can't be used for all the memory used by,
| say, the JS sandbox which would be a very very tempting target?
|
| Or is the idea that you deal with this by always running that
| kind of stuff in a different process where you can seal the
| memory and then you can just kill the process when you're done?
|
| I'm not familiar with how Chrome manages memory/processes, so I'm
| not exactly sure why this wouldn't be an issue.
|
| Is this also the reason why the articles about this often mention
| it's not useful to most programs (outside of how memory is set up
| at processes start up)?
| PhilipRoman wrote:
| >Doesn't that mean it can't be used for all the memory used by,
| say, the JS sandbox which would be a very very tempting target?
|
| Multiprocessing is an option here. I think chrome uses it
| extensively, so that might be the play here. You need separate
| processes for other stuff anyway, like isolation via
| namespaces.
| sim7c00 wrote:
| i am sad operating systems need to have such calls implemented
| while most modern (x86_64) architectures have so many features to
| facilitate safe and sound programming and computing. legacy crap
| en mentality , and trying to patch old systems built on paradigms
| not matching the current world and knowledge rather than
| rebuilding really put a break on progress in computing, and put
| litterally billions at risk.
|
| not to say these things arent steps in the right direction, but
| if you let go of current ideals on how operating systems work,
| and take into account current systems, knowledge about them, and
| knowledge about what people want from systems, you can envision
| systems free from the burden and risks put on developers and
| users today.
|
| yes architecture bugs exist, but software hardly takes advantage
| of current features truly,so arguing about architectural bugs is
| a moot point. theres cheaper ways to compromise, and always will
| be if things are built on shaky foundations
___________________________________________________________________
(page generated 2024-10-25 23:00 UTC)