[HN Gopher] Process Creation in Io_uring
___________________________________________________________________
Process Creation in Io_uring
Author : semiquaver
Score : 58 points
Date : 2024-12-20 15:23 UTC (1 days ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| cperciva wrote:
| The "clone the entire address space and then call exec" idiom is
| indeed wildly inefficient -- that's why the horror which is vfork
| was invented -- but I'm not convinced that putting everything
| which sits between fork and execve into io_uring (or, as a
| comment snarkily suggests, ebpf) is the solution. There's just
| too many things userland might want to do.
|
| I wonder if the best solution lies somewhere in the vicinity of
| "fork but only copy a small part of the address space" -- rather
| than copying the entire address space as in fork (only to use a
| tiny portion and throw away the rest) or copying none of the
| address space as in vfork (the paging tables are shared between
| parent and child until exec) if we can identify what memory the
| child will need to access before calling _exit or exec (say, "the
| current function and its local variables") then we could create
| an address space with just a few paging tables entries.
|
| Kind of like the "zygote" forking model (early in the main
| process lifetime, a zygote process gets forked off, and when the
| main process wants another worker it asks the zygote to fork one
| off) except that the "zygote" is more like an induced pluripotent
| stem cell, having been reverted from an adult state.
| znpy wrote:
| > Kind of like the "zygote" forking model (early in the main
| process lifetime, a zygote process gets forked off, and when
| the main process wants another worker it asks the zygote to
| fork one off) except that the "zygote" is more like an induced
| pluripotent stem cell, having been reverted from an adult
| state.
|
| interestingly enough, i thought of the same concept, except i
| did not get to implement that (for a few reasons). is "zygote"
| a term you made up or is it an established pattern?
| cperciva wrote:
| Well established pattern. See e.g. in Chromium: https://chrom
| ium.googlesource.com/chromium/src/+/HEAD/docs/l...
|
| (I don't think the Chromium developers invented it either,
| it's just a convenient reference.)
| bean-weevil wrote:
| This paragraph surprised me: > Furthermore it is the only
| reasonable way to keep a reference to a binary and a set of
| shared libraries that can be exec'ed. In the model used on
| Windows and Mac, renderers are exec'ed as needed from the
| chrome binary. However, if the chrome binary, or any of its
| shared libraries are updated while Chrome is running, we'll
| end up exec'ing the wrong version. A version x browser
| might be talking to a version y renderer. Our IPC system
| does not support this (and does not want to!).
|
| I think the Chrome team overthought this. If you update
| firefox and try to perform an action which spawns a new
| process, it just politely demands the user restart the
| browser.
| theamk wrote:
| and I hate this, this is super inconvinient when auto-
| updates are enabled. I am glad Chrome authors went out of
| the way to fix this.
|
| (The other option would be to convince Linux
| distributions to implement special updater, but I am sure
| implementing zygote thong was easier)
| LegionMammal978 wrote:
| That doesn't sound much different from regular vfork()? It
| isn't that evil, you just need a small assembly shim (or if
| you're courageous, a bit of massaging the compiler output) to
| safely call another function with its own stack frame, as well
| as some care to disable signal handlers in the child. It's
| mostly for silly setuid-binary reasons that the libc people
| tend to dislike it.
|
| Also, there's no way that libc people would want to work with
| the compiler people to locate the current stack frame to copy.
| So you'd end up with an assembly shim with a definite stack
| size anyway.
| jcranmer wrote:
| > if we can identify what memory the child will need to access
| before calling _exit or exec (say, "the current function and
| its local variables")
|
| I mean, a good deal of the parameters you need for the relevant
| syscalls are strings, which means it's not sufficient to copy
| just the stack frame, but all the memory reachable from the
| stack frame. Which is a nontrivial problem if you're assuming
| C/C++-style code.
| vacuity wrote:
| Funnily enough, embyronic processes, which goes along with the
| zygote naming scheme. Similar idea of a small process
| specifically for forking, but this is a more comprehensive
| model.
|
| Described in these nicely written comments by others:
|
| https://news.ycombinator.com/item?id=32794270
| https://news.ycombinator.com/item?id=30510318
| https://news.ycombinator.com/item?id=29697645
| skissane wrote:
| > I wonder if the best solution lies somewhere in the vicinity
| of "fork but only copy a small part of the address space"
|
| I think the best solution would be if every relevant syscall
| took a process handle, so you can run it either in the current
| process or in a non-started child process
|
| That's not going to happen on Linux because it would be a
| radical change to the Linux syscall API. But if one were
| designing an OS from scratch today I think it would make sense
| to do things that way.
| IAmLiterallyAB wrote:
| That is basically how Fuchsia handles it
| https://fuchsia.dev/fuchsia-
| src/reference/kernel_objects/pro...
| PeterWhittaker wrote:
| But Linux doesn't clone the entire address space: it copies the
| page table, RO: if the child attempts to write, then it uses
| COW.
|
| So if fork/clone is followed immediately by exec/execve/etc.,
| there is minimal copying.
| sweetjuly wrote:
| Surely it also marks the parent process' pages as COW too? If
| only the child is RO but the parent still has RW mappings to
| the same physical pages, writes from the parent will be
| observed in the child, which is wrong for COW. You either
| have to copy the pages immediately (in which cases there's no
| COW) or you have to make all mappings to the physical page
| RO.
|
| The implications of this is that even if you immediately
| execve in the child, you still have to pay for the cost of
| setting COW on the entire address space and then later
| faulting on every single writable page in the parent process.
| The performance impact might not be massive, but it's not
| nothing.
| toast0 wrote:
| > You either have to copy the pages immediately (in which
| cases there's no COW) or you have to make all mappings to
| the physical page RO.
|
| One of these enhanced fork exec calls stops the parent
| until the child execs. Then you don't need to touch the
| parent page mappings or worry about concurrency. (Although
| it's not ideal if the parent is threaded)
| cperciva wrote:
| Yes, that's what said. Fork doesn't copy the _data_ but it
| does copy the _address space_. For a large process (say, a
| database with multiple GB of data) that 's a lot of paging
| tables -- many MB of them if you're using 4 kB pages.
| thayne wrote:
| I don't think adding this to io_uring is at all bad. But I
| don't think it enough to solve the problem. If for no other
| reason, than because it requires using the machinery of
| io_uring, which adds quite a bit of complexity.
|
| However, maybe I'm missing something, but it seems like linux
| already has functionality that could make spawning a process a
| lot more efficient and threadsafe. My idea is basically to use
| clone or clone3 to create a new process in a new thread group
| that shares the original processes memory (that is with
| CLONE_VM but not CLONE_THREAD). And pass a function point to
| call (instead of returning on the child process) and a heap-
| allocated stack for the child process to use.
|
| Then there is no need to copy the address space, and you can do
| more things to prep before calling exec, since other threads
| can still release locks, you can write to memory, etc.
|
| The downsides I see are that you wouldn't be able to safely
| modify the current environment variables since that would
| impact the parent process, and there might be some weirdness
| with the child process having copies of file descriptors
| instead of the originals. The first is easy to work around
| though, and the latter probably wouldn't be an issue in most
| cases.
|
| Another thought I've had is that if there was a more efficient
| single syscall for spawning a process that combined fork and
| exec, even if it is a lot less flexible than fork/exec or the
| io_uring equivalent, something simple could probably meet the
| needs of most applications and benefit performance and safety
| in the common case where you don't need complex setup before
| calling execve.
| remexre wrote:
| Does the glibc clone wrapper not already do this?
| 10000truths wrote:
| I agree with Pavel that extending the clone syscall is a better
| idea than this patch set. The flexibility that Josh and Gabriel
| talk about seems wholly unnecessary. In every use of fork-(do
| stuff)-exec I've ever seen, the below two observations remained
| true:
|
| 1. Everything needed in the "do stuff" part was known prior to
| the call to fork
|
| 2. Any failures in the "do stuff" part would scrap the child
| process and report an error to the parent process
| wbl wrote:
| 3: the stuff has to be done in the child to avoid problems.
| Like in shells.
| PaulDavisThe1st wrote:
| For now, I'd settle for an RT-safe way to create a new process
| that then calls execve. AFAIK, this doesn't for Linux and may not
| exist for any *nix kernel at this time (not sure about this
| second part).
| duskwuff wrote:
| Darwin has a posix_spawn() syscall. I'm not sure if it's RT-
| safe, but it is actually a syscall - it's not a wrapper for
| vfork+execve like it is on Linux.
| PaulDavisThe1st wrote:
| Not RT-safe.
___________________________________________________________________
(page generated 2024-12-21 18:01 UTC)