[HN Gopher] A deep dive into Linux's new mseal syscall
       ___________________________________________________________________
        
       A deep dive into Linux's new mseal syscall
        
       Author : todsacerdoti
       Score  : 141 points
       Date   : 2024-10-25 14:58 UTC (8 hours ago)
        
 (HTM) web link (blog.trailofbits.com)
 (TXT) w3m dump (blog.trailofbits.com)
        
       | ykonstant wrote:
       | Interesting. The article mentions "spicy discussions" in the
       | kernel mailing list. Is there any insider who can summarize
       | objections and concerns? I tend to avoid reading the mailing list
       | itself since it can get too spicy, and my headaches are already
       | strong enough!
       | 
       | The mechanism itself seems reasonable, but I am surprised that
       | something like this doesn't already exist in the kernel.
        
         | ziddoap wrote:
         | Not sure if there was much more to it than the thread linked
         | to, but it was basically Linus being Linus. He said stuff that
         | made sense in a pretty blunt fashion.
         | 
         | There were flags proposed that allowed the seal to be ignored.
         | 
         | > _So you say "we can't munmap in this *one* place, but all
         | others ignore the sealing"._
         | 
         | Later was the spice.
         | 
         | > _And dammit, once something is sealed, it is SEALED. None of
         | this crazy "one place honors the sealing, random other places
         | do not"._
         | 
         | And later, even spicier, Linus says that seals cannot be
         | ignored and that is non-negotiable. Any further suggestions to
         | ignore a seal via a flag would result in the person being added
         | to Linus' ignore list. (He, of course, said this with some
         | profanities and capitals sprinkled in.)
        
           | js2 wrote:
           | Wasn't just Linus. Earlier, from Theo de Raadt:
           | 
           | > I don't think you understand the problem space well enough
           | to come up with your own solution for it. I spent a year on
           | this, and ship a complete system using it. You are asking
           | such simplistic questions above it shocks me.
           | 
           | https://lwn.net/ml/linux-
           | kernel/95482.1697587015@cvs.openbsd...
           | 
           | Via https://lwn.net/Articles/948129/
        
             | 0xbadcafebee wrote:
             | Not a great perspective... "It took me a year [or more] to
             | understand this. The fact that you don't understand it
             | shocks me." Dude, not everybody's as smart or experienced
             | as you. Here's an opportunity to be a mentor.
        
               | nativeit wrote:
               | > Google has no shortage of experienced developers who
               | could have reviewed this submission before it was posted
               | publicly, but that does not appear to have happened, with
               | the result that a relatively inexperienced developer was
               | put into a difficult position. Feedback on the proposal
               | was resisted rather than listened to. The result was an
               | interaction that pleased nobody.
        
               | refulgentis wrote:
               | > Google has no shortage of experienced developers who
               | could have reviewed this submission before it was posted
               | publicly,
               | 
               | You'd be surprised. My understanding from folks on Chrome
               | OS is they've already shedded most, if not all, of the
               | most experienced old hands. (n.b. Chrome OS was absorbed
               | by Android and development is, by and large, ceased on it
               | according to same sources directly, and indirectly via
               | Blind.)
        
               | vlovich123 wrote:
               | My reading of this is a lot more generous to the
               | maintainers and a lot less sympathetic to the author than
               | yours is. The maintainers highlighted the problems and
               | the author came back basically with "I don't believe you
               | so let's go with my approach to stay more general" - it's
               | one thing to disagree, it's another to straight up not
               | acknowledge the feedback. The author ate a lot of very
               | senior people's time arguing instead of listening to them
               | and learning from their experience and that was
               | justifiably frustrating forcing much more direct
               | feedback. The kind of mistake the author made - having to
               | enforce at each individual syscall level instead of it
               | being a protection on the memory itself enforced on all
               | accesses - indicates a poor understanding of how to think
               | about security and build security APIs which is a problem
               | when you're proposing a security API.
               | 
               | It's particularly impressive how misguided the patch is
               | given that they took inspiration from the OpenBSD API
               | implementation, changed both API & implementation, & then
               | argued with both Linus and Theo who started Linux &
               | OpenBSD respectively and were trying to give direct
               | feedback about how OpenBSD is different and why it took
               | the approach it did.
               | 
               | Hopefully the author has taken the more forceful feedback
               | as a learning opportunity about listening to feedback
               | when the people giving it to you have a lot more
               | experience. Or their team is coaching them about what
               | went wrong now that this became so visible to learn what
               | they got wrong.
               | 
               | From Matthew Wilcox who is another senior Linux
               | maintainer:
               | 
               | > I concur with Theo & Linus. You don't know what you're
               | doing. I think the underlying idea of mimmutable() is
               | good, but how you've split it up and how you've
               | implemented it is terrible.
               | 
               | It's delivered directly and bluntly but it's not mean or
               | personal. The author proposed a bad patch & argued from a
               | position of ignorance.
        
               | terribleperson wrote:
               | The time of the people who maintain the free and open
               | source software we rely on is not free. From the people
               | I've talked to, maintainers of successful projects are
               | overworked and underappreciated.
               | 
               | Mentorship from one of those people would be valuable,
               | but arguing with them about the implementation of
               | something you don't understand isn't how you get that
               | mentorship.
        
         | lathiat wrote:
         | This may help a bit: https://lwn.net/Articles/948129/
        
           | ykonstant wrote:
           | Very nice, thanks!
           | 
           | Edit: I always find it funny that these articles on the
           | mailing list tend to read like a sports announcer describing
           | a boxing match!
        
         | greenavocado wrote:
         | https://lwn.net/ml/linux-kernel/7071.1697661373@cvs.openbsd....
         | From:   Theo de Raadt <deraadt-AT-openbsd.org>         To:
         | Jeff Xu <jeffxu-AT-google.com>              > On Wed, Oct 18,
         | 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
         | > >         > > Let's start with the purpose.  The point of
         | mimmutable/mseal/whatever is         > > to fix the mapping of
         | an address range to its underlying object, be it         > > a
         | particular file mapping or anonymous memory.  After the call
         | succeeds,         > > it must not be possible to make any
         | address in that virtual range point         > > into any other
         | object.         > >         > > The secondary purpose is to
         | lock down permissions on that range.         > > Possibly to
         | fix them where they are, possibly to allow RW->RO transitions.
         | > >         > > With those purposes in mind, you should be able
         | to deduce for any syscall         > > or any madvise(), ...
         | whether it should be allowed.         > >         > I got it.
         | >          > IMO: The approaches mimmutable() and mseal() took
         | are different, but         > we all want to seal the memory
         | from attackers and make the linux         > application safer.
         | I think you are building mseal for chrome, and chrome alone.
         | I do not think this will work out for the rest of the
         | application space         because              1) it is too
         | complicated         2) experience with mimmutable() says that
         | applications don't do any of it         themselves, it is all
         | in execve(), libc initialization, and ld.so.         You don't
         | strike me as an execve, libc, or ld.so developer.
        
         | greenavocado wrote:
         | From:   Matthew Wilcox <willy-AT-infradead.org>         To:
         | Jeff Xu <jeffxu-AT-google.com>              ...
         | Yes, thank you for demonstrating that you have no idea what you
         | need to         block.              > It is practical to keep
         | syscall extentable, when the business logic is the same.
         | I concur with Theo & Linus.  You don't know what you're doing.
         | I think         the underlying idea of mimmutable() is good,
         | but how you've split it up         and how you've implemented
         | it is terrible.              ...
        
       | metadat wrote:
       | Will it be possible to override / disable the `mseal' syscall
       | with the LD_PRELOAD trick?
        
         | the8472 wrote:
         | https://lwn.net/Articles/978010/ says there'll be a glibc
         | tunable
        
         | eska wrote:
         | _mseal digresses from prior memory protection schemes on Linux
         | because it is a syscall tailored specifically for exploit
         | mitigation against remote attackers seeking code execution
         | rather than potentially local ones looking to exfiltrate
         | sensitive secrets in-memory._
         | 
         | If a remote attacker can change the local environment then they
         | must have already broken into your system.
        
         | chucky_z wrote:
         | You can override the mseal call wrapper but not the syscall
         | itself.
         | 
         | This is an interesting thought so I looked it up and this is
         | how (all?) preload syscall overrides work. You override the
         | wrapper but not the syscalls itself so if you're doing direct
         | syscalls I don't think that can be overridden. Technically you
         | could override the syscall function itself maybe?
        
           | jmmv wrote:
           | > Technically you could override the syscall function itself
           | maybe?
           | 
           | But then you can just write assembly code to issue the system
           | call.
        
         | Dwedit wrote:
         | Probably not LD_PRELOAD. It would need to be an imported
         | function in order for LD_PRELOAD to have any effect. A raw
         | syscall would not be interceptable that way.
         | 
         | Discussion about intercepting linux syscalls:
         | https://stackoverflow.com/questions/69859/how-could-i-interc...
         | 
         | But building your own patched kernel that pretends that mseal
         | works would be the simplest way to "disable" that feature.
         | Programs that use mseal could still do sanity checks to see if
         | mseal actually works or not. Then a compromised kernel would
         | need secret ways to disable mseal after it has been applied, to
         | stop the apps from checking for a non-functional mseal.
        
           | jandrese wrote:
           | I'm not sure what protection you could expect on any system
           | where the kernel has been replaced by the attacker. Sure they
           | can bypass mseal, but they are also bypassing _all other
           | security_ on the box.
        
             | Dwedit wrote:
             | Two different considerations for when you'd want to deny
             | memory to other processes:
             | 
             | Protecting against outside attackers
             | 
             | Digital Rights Management
             | 
             | Faking "mseal" is something you might intentionally do if
             | you are trying to break DRM, and something you would not
             | want to do if you are trying to defend against outside
             | attackers.
        
         | cataphract wrote:
         | Depends whether the program calls into libc or inlines the
         | syscalls, I imagine. Though you could use other mechanisms like
         | secccomp.
        
         | monocasa wrote:
         | There's a bunch of ways to override it if you have early
         | control over the process. Another example: ptrace the
         | executable, watch the system calls, and skip over any
         | mseal(2)s.
         | 
         | This system call is meant for a different threat model than
         | "attacker has early access to your process before it started
         | initializing".
        
       | unwind wrote:
       | Meta: the mseal() prototype in the article needs some editing, it
       | is not syntacticallly correct as shown now. The first argument is
       | shown as                   unsigned start addr
       | 
       | But should probably be                   unsigned long start_addr
        
         | hifromwork wrote:
         | Seems to be OK now:                   int mseal(unsigned long
         | start, size_t len, unsigned long flags)
        
       | throw0101a wrote:
       | mseal() and what comes after, October 20, 2023:
       | https://lwn.net/Articles/948129/
       | 
       | mseal() gets closer, January 19, 2024:
       | https://lwn.net/Articles/958438/
       | 
       | Memory sealing for the GNU C Library, June 12, 2024:
       | https://lwn.net/Articles/978010/
        
       | westurner wrote:
       | - "Memory Sealing "Mseal" System Call Merged for Linux 6.10"
       | (2024) https://news.ycombinator.com/item?id=40474510#40474551 :
       | 
       | > _How should CPython support the mseal() syscall?_
        
       | xterminator wrote:
       | OpenBSD has had it since forever [1]. Why is such an obvious
       | feature only reaching Linux now?
       | 
       | [1]https://man.openbsd.org/mimmutable.2
        
         | gilgamesh3 wrote:
         | >OpenBSD has had it since forever.
         | 
         | OpenBSD introduced mimmutable in OpenBSD 7.3, which was
         | released 10/4/2023 (for US people, it would be 4/10/2023), so
         | it isn't "forever".
         | 
         | Meanwhile Linux and FreeBSD has "memfd_create" forever while
         | OpenBSD doesn't have anonymous files and relies on "shm_open".
        
       | MBCook wrote:
       | A question about using this call:
       | 
       | Chrome is the one who wants it. But you can't unmap sealed pages
       | because an attacker could then re-map them with different flags.
       | 
       | So that basically means this can never be used on pages allocated
       | at runtime unless you intend to hold them for the entire process
       | lifetime, right?
       | 
       | Doesn't that mean it can't be used for all the memory used by,
       | say, the JS sandbox which would be a very very tempting target?
       | 
       | Or is the idea that you deal with this by always running that
       | kind of stuff in a different process where you can seal the
       | memory and then you can just kill the process when you're done?
       | 
       | I'm not familiar with how Chrome manages memory/processes, so I'm
       | not exactly sure why this wouldn't be an issue.
       | 
       | Is this also the reason why the articles about this often mention
       | it's not useful to most programs (outside of how memory is set up
       | at processes start up)?
        
         | PhilipRoman wrote:
         | >Doesn't that mean it can't be used for all the memory used by,
         | say, the JS sandbox which would be a very very tempting target?
         | 
         | Multiprocessing is an option here. I think chrome uses it
         | extensively, so that might be the play here. You need separate
         | processes for other stuff anyway, like isolation via
         | namespaces.
        
       | sim7c00 wrote:
       | i am sad operating systems need to have such calls implemented
       | while most modern (x86_64) architectures have so many features to
       | facilitate safe and sound programming and computing. legacy crap
       | en mentality , and trying to patch old systems built on paradigms
       | not matching the current world and knowledge rather than
       | rebuilding really put a break on progress in computing, and put
       | litterally billions at risk.
       | 
       | not to say these things arent steps in the right direction, but
       | if you let go of current ideals on how operating systems work,
       | and take into account current systems, knowledge about them, and
       | knowledge about what people want from systems, you can envision
       | systems free from the burden and risks put on developers and
       | users today.
       | 
       | yes architecture bugs exist, but software hardly takes advantage
       | of current features truly,so arguing about architectural bugs is
       | a moot point. theres cheaper ways to compromise, and always will
       | be if things are built on shaky foundations
        
       ___________________________________________________________________
       (page generated 2024-10-25 23:00 UTC)