[HN Gopher] Process Creation in Io_uring
       ___________________________________________________________________
        
       Process Creation in Io_uring
        
       Author : semiquaver
       Score  : 58 points
       Date   : 2024-12-20 15:23 UTC (1 days ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | cperciva wrote:
       | The "clone the entire address space and then call exec" idiom is
       | indeed wildly inefficient -- that's why the horror which is vfork
       | was invented -- but I'm not convinced that putting everything
       | which sits between fork and execve into io_uring (or, as a
       | comment snarkily suggests, ebpf) is the solution. There's just
       | too many things userland might want to do.
       | 
       | I wonder if the best solution lies somewhere in the vicinity of
       | "fork but only copy a small part of the address space" -- rather
       | than copying the entire address space as in fork (only to use a
       | tiny portion and throw away the rest) or copying none of the
       | address space as in vfork (the paging tables are shared between
       | parent and child until exec) if we can identify what memory the
       | child will need to access before calling _exit or exec (say, "the
       | current function and its local variables") then we could create
       | an address space with just a few paging tables entries.
       | 
       | Kind of like the "zygote" forking model (early in the main
       | process lifetime, a zygote process gets forked off, and when the
       | main process wants another worker it asks the zygote to fork one
       | off) except that the "zygote" is more like an induced pluripotent
       | stem cell, having been reverted from an adult state.
        
         | znpy wrote:
         | > Kind of like the "zygote" forking model (early in the main
         | process lifetime, a zygote process gets forked off, and when
         | the main process wants another worker it asks the zygote to
         | fork one off) except that the "zygote" is more like an induced
         | pluripotent stem cell, having been reverted from an adult
         | state.
         | 
         | interestingly enough, i thought of the same concept, except i
         | did not get to implement that (for a few reasons). is "zygote"
         | a term you made up or is it an established pattern?
        
           | cperciva wrote:
           | Well established pattern. See e.g. in Chromium: https://chrom
           | ium.googlesource.com/chromium/src/+/HEAD/docs/l...
           | 
           | (I don't think the Chromium developers invented it either,
           | it's just a convenient reference.)
        
             | bean-weevil wrote:
             | This paragraph surprised me: > Furthermore it is the only
             | reasonable way to keep a reference to a binary and a set of
             | shared libraries that can be exec'ed. In the model used on
             | Windows and Mac, renderers are exec'ed as needed from the
             | chrome binary. However, if the chrome binary, or any of its
             | shared libraries are updated while Chrome is running, we'll
             | end up exec'ing the wrong version. A version x browser
             | might be talking to a version y renderer. Our IPC system
             | does not support this (and does not want to!).
             | 
             | I think the Chrome team overthought this. If you update
             | firefox and try to perform an action which spawns a new
             | process, it just politely demands the user restart the
             | browser.
        
               | theamk wrote:
               | and I hate this, this is super inconvinient when auto-
               | updates are enabled. I am glad Chrome authors went out of
               | the way to fix this.
               | 
               | (The other option would be to convince Linux
               | distributions to implement special updater, but I am sure
               | implementing zygote thong was easier)
        
         | LegionMammal978 wrote:
         | That doesn't sound much different from regular vfork()? It
         | isn't that evil, you just need a small assembly shim (or if
         | you're courageous, a bit of massaging the compiler output) to
         | safely call another function with its own stack frame, as well
         | as some care to disable signal handlers in the child. It's
         | mostly for silly setuid-binary reasons that the libc people
         | tend to dislike it.
         | 
         | Also, there's no way that libc people would want to work with
         | the compiler people to locate the current stack frame to copy.
         | So you'd end up with an assembly shim with a definite stack
         | size anyway.
        
         | jcranmer wrote:
         | > if we can identify what memory the child will need to access
         | before calling _exit or exec (say, "the current function and
         | its local variables")
         | 
         | I mean, a good deal of the parameters you need for the relevant
         | syscalls are strings, which means it's not sufficient to copy
         | just the stack frame, but all the memory reachable from the
         | stack frame. Which is a nontrivial problem if you're assuming
         | C/C++-style code.
        
         | vacuity wrote:
         | Funnily enough, embyronic processes, which goes along with the
         | zygote naming scheme. Similar idea of a small process
         | specifically for forking, but this is a more comprehensive
         | model.
         | 
         | Described in these nicely written comments by others:
         | 
         | https://news.ycombinator.com/item?id=32794270
         | https://news.ycombinator.com/item?id=30510318
         | https://news.ycombinator.com/item?id=29697645
        
         | skissane wrote:
         | > I wonder if the best solution lies somewhere in the vicinity
         | of "fork but only copy a small part of the address space"
         | 
         | I think the best solution would be if every relevant syscall
         | took a process handle, so you can run it either in the current
         | process or in a non-started child process
         | 
         | That's not going to happen on Linux because it would be a
         | radical change to the Linux syscall API. But if one were
         | designing an OS from scratch today I think it would make sense
         | to do things that way.
        
           | IAmLiterallyAB wrote:
           | That is basically how Fuchsia handles it
           | https://fuchsia.dev/fuchsia-
           | src/reference/kernel_objects/pro...
        
         | PeterWhittaker wrote:
         | But Linux doesn't clone the entire address space: it copies the
         | page table, RO: if the child attempts to write, then it uses
         | COW.
         | 
         | So if fork/clone is followed immediately by exec/execve/etc.,
         | there is minimal copying.
        
           | sweetjuly wrote:
           | Surely it also marks the parent process' pages as COW too? If
           | only the child is RO but the parent still has RW mappings to
           | the same physical pages, writes from the parent will be
           | observed in the child, which is wrong for COW. You either
           | have to copy the pages immediately (in which cases there's no
           | COW) or you have to make all mappings to the physical page
           | RO.
           | 
           | The implications of this is that even if you immediately
           | execve in the child, you still have to pay for the cost of
           | setting COW on the entire address space and then later
           | faulting on every single writable page in the parent process.
           | The performance impact might not be massive, but it's not
           | nothing.
        
             | toast0 wrote:
             | > You either have to copy the pages immediately (in which
             | cases there's no COW) or you have to make all mappings to
             | the physical page RO.
             | 
             | One of these enhanced fork exec calls stops the parent
             | until the child execs. Then you don't need to touch the
             | parent page mappings or worry about concurrency. (Although
             | it's not ideal if the parent is threaded)
        
           | cperciva wrote:
           | Yes, that's what said. Fork doesn't copy the _data_ but it
           | does copy the _address space_. For a large process (say, a
           | database with multiple GB of data) that 's a lot of paging
           | tables -- many MB of them if you're using 4 kB pages.
        
         | thayne wrote:
         | I don't think adding this to io_uring is at all bad. But I
         | don't think it enough to solve the problem. If for no other
         | reason, than because it requires using the machinery of
         | io_uring, which adds quite a bit of complexity.
         | 
         | However, maybe I'm missing something, but it seems like linux
         | already has functionality that could make spawning a process a
         | lot more efficient and threadsafe. My idea is basically to use
         | clone or clone3 to create a new process in a new thread group
         | that shares the original processes memory (that is with
         | CLONE_VM but not CLONE_THREAD). And pass a function point to
         | call (instead of returning on the child process) and a heap-
         | allocated stack for the child process to use.
         | 
         | Then there is no need to copy the address space, and you can do
         | more things to prep before calling exec, since other threads
         | can still release locks, you can write to memory, etc.
         | 
         | The downsides I see are that you wouldn't be able to safely
         | modify the current environment variables since that would
         | impact the parent process, and there might be some weirdness
         | with the child process having copies of file descriptors
         | instead of the originals. The first is easy to work around
         | though, and the latter probably wouldn't be an issue in most
         | cases.
         | 
         | Another thought I've had is that if there was a more efficient
         | single syscall for spawning a process that combined fork and
         | exec, even if it is a lot less flexible than fork/exec or the
         | io_uring equivalent, something simple could probably meet the
         | needs of most applications and benefit performance and safety
         | in the common case where you don't need complex setup before
         | calling execve.
        
           | remexre wrote:
           | Does the glibc clone wrapper not already do this?
        
       | 10000truths wrote:
       | I agree with Pavel that extending the clone syscall is a better
       | idea than this patch set. The flexibility that Josh and Gabriel
       | talk about seems wholly unnecessary. In every use of fork-(do
       | stuff)-exec I've ever seen, the below two observations remained
       | true:
       | 
       | 1. Everything needed in the "do stuff" part was known prior to
       | the call to fork
       | 
       | 2. Any failures in the "do stuff" part would scrap the child
       | process and report an error to the parent process
        
         | wbl wrote:
         | 3: the stuff has to be done in the child to avoid problems.
         | Like in shells.
        
       | PaulDavisThe1st wrote:
       | For now, I'd settle for an RT-safe way to create a new process
       | that then calls execve. AFAIK, this doesn't for Linux and may not
       | exist for any *nix kernel at this time (not sure about this
       | second part).
        
         | duskwuff wrote:
         | Darwin has a posix_spawn() syscall. I'm not sure if it's RT-
         | safe, but it is actually a syscall - it's not a wrapper for
         | vfork+execve like it is on Linux.
        
           | PaulDavisThe1st wrote:
           | Not RT-safe.
        
       ___________________________________________________________________
       (page generated 2024-12-21 18:01 UTC)