https://lwn.net/SubscriberLink/1002371/0ff2be6a2c7624ca/ LWN.net Logo LWN .net News from the source LWN * Content + Weekly Edition + Archives + Search + Kernel + Security + Events calendar + Unread comments + ------------------------------------------------------------- + LWN FAQ + Write for us User: [ ] Password: [ ] [Log in] | [Subscribe] | [Register] Subscribe / Log in / New account Process creation in io_uring [LWN subscriber-only content] Welcome to LWN.net The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net! By Jonathan Corbet December 20, 2024 Back in 2022, Josh Triplett presented a plan to implement a "spawn new process" functionality in the io_uring subsystem. There was a fair amount of interest at the time, but developers got distracted, and the work did not progress. Now, Gabriel Krisman Bertazi has returned with a patch series updating and improving Triplett's work. While interest in this functionality remains, it may still take some time before it is ready for merging into the mainline. A new process in Linux is created with one of the variants of the clone() system call. As its name suggests, clone() creates a copy of the calling process, running the same code. Much of the time, though, the newly created process quickly calls execve() or execveat() to run a different program, perhaps after performing a bit of cleanup. There has long been interest in a system call that would combine these operations efficiently, but nothing like that has ever found its way into the Linux kernel. There is a posix_spawn() function, but that is implemented in the C library using clone() and execve(). Arguably, part of the problem is that, while the clone()-to-execve() pattern is widespread, the details of what happens between those two calls can vary quite a bit. Some files may need to be closed, signal handling changed, scheduling policies tweaked, environment adjusted, and so on; the specific pattern will be different for every case. posix_spawn() tries to provide a general mechanism to specify these actions but, as can be seen by looking at the function's argument list, it quickly becomes complex. Io_uring, meanwhile, is primarily thought of as a way of performing operations asynchronously. User space can queue operations in a ring buffer; the kernel consumes that buffer, executes the operations asynchronously, then puts the results into another ring buffer (the "completion ring") as each operation completes. Initially, only basic I/O operations were supported, but the list of operations has grown over the years. At this point, io_uring can be thought of as a sort of alternative system-call interface for Linux that is inherently asynchronous. An important io_uring feature, for the purposes of implementing something like posix_spawn(), is the ability to create chains of linked operations. When the kernel encounters a chain, it will only initiate the first operation; the next operation in the chain will only run after the first completes. The failure of an operation in a chain will normally cause all remaining operations to be canceled, but a "hard link" between two operations will cause execution to continue regardless of the success of the first of the two. Linking operations in this way essentially allows simple programs to be loaded into the kernel for asynchronous execution; these programs can run in parallel with any other io_uring operations that have been submitted. The new patch set creates two new io_uring operations, each with some special semantics. The first of those is IORING_OP_CLONE, which causes the creation of a new process to execute any operations that follow in the same chain. In a difference from a full clone() call, though, much of the calling task's context is unavailable to the process created by IORING_OP_CLONE. Without that context, io_uring operations in the newly created process can no longer be asynchronous; every operation in the chain must complete immediately, or the chain will fail. In practice, that means that operations like closing files can be executed, but complicated I/O operations are no longer possible. Krisman hopes to be able to at least partially lift that constraint in the future. Once the chain completes, the new process will be terminated, with one important exception: if it invokes the second new operation, IORING_OP_EXEC, which performs the equivalent of an execveat() call, replacing the running program with a new executable. At this point, the new process is completely detached from the original, is running its own program, and the processing of the io_uring chain is complete; the process will, rather than being terminated, go off to run the new program. Placing any other operations after IORING_OP_EXEC in the chain usually makes no sense; any operations after a successful IORING_OP_EXEC will be canceled. It also does not make sense to use IORING_OP_EXEC in any context other than a new process created with IORING_OP_CLONE, so that usage is not allowed. There is one case where it can be useful to link operations into the chain after IORING_OP_EXEC -- efficiently implementing a path search in the kernel. Often, the execution of a new program involves searching for it in a number of directories, usually specified by the PATH environment variable. One way of doing this in the io_uring context, as shown in this test program, is to enqueue a series of IORING_OP_EXEC operations, each trying a different location in the path. If hard links are used to chain these operations, execution will continue past failed operations until the one that actually finds the target program succeeds; after that, any subsequent operations will be discarded. The entire search runs in the kernel, without the need to repeatedly switch between kernel and user space. Most of the comments on the proposal so far have come from Pavel Begunkov, who has expressed some concerns about it. He did not like some aspects of the implementation, the special quirks associated with IORING_OP_CLONE and the process it creates, and the use of links, "`which already a bad sign for a bunch of reasons'" (he did not specify what the reasons are). He suggested that io_uring might not be the best place for this functionality; perhaps a list of operations could be passed to a future version of clone() instead, mirroring how the posix_spawn() interface works. Krisman answered that combining everything into a single system call would add complexity while making the solution less flexible. Io_uring makes it easy to put together a set of operations to be run in the kernel in an arbitrary order. The hope is to increase the set of possible operations over time, enabling the implementation of complex logic for the spawning of a new task. It is hard to see how combining all of this functionality into a single system call could work as well. In any case, this is early-stage work; getting it to a point where it can be considered for the mainline will require smoothing a number of the rough edges and reducing the number of limitations. It will also certainly require wider review; this work is proposing a significant addition to the kernel's user-space ABI that would have to be supported indefinitely. The developers involved will surely want to get the details right before committing to that support. Index entries for this article Kernel io_uring [Send a free link] ----------------------------------------- [Log in] to post comments BPF! Posted Dec 20, 2024 16:32 UTC (Fri) by willy (subscriber, #9762) [ Link] (11 responses) Clearly the right solution is to load a BPF program into the kernel to do the clone and setup. /s in case it wasn't clear. [Reply to this comment] BPF! Posted Dec 20, 2024 17:47 UTC (Fri) by gutschke (subscriber, #27910) [Link] (8 responses) I am not even sure the "/s" is warranted. clone()/exec() is a very powerful pattern that nicely fits in with how POSIX has designed its API. The ability to customize the newly launched process prior to loading the binary is crucial in a lot of scenarios. And I don't see that going away. But ever since the advent of threads (and possibly even in the presence of signals), this has gotten incredibly difficult to do correctly. There are just too many subtle race conditions that involve hidden state in the various run-time libraries or even in the dynamic link loader. If there was a way to do everything that you can currently do with systemcalls from userspace, but it instead moved entirely into the kernel, most of these problems would immediately go away. So, I see a lot of value with being able to call clone() and exec() from a BPF program, or maybe from io_uring. The elephant in the room with BPF is that this new API would then likely be limited to privileged processes. You can approximate a solution in userspace by very carefully picking what system calls you invoke, and by avoiding any calls into libc, including accidental calls into the dynamic link loader. This involves some amount of assembly code to get 100% reliable. It's very tedious and extremely fragile. It is often not worth the effort and instead you have to live with the occasional random crash. In some cases, a possible work-around is to launch a "zygote" helper process that executes before any threads are created. The latter is difficult to ensure though, as some libraries create threads when they are loaded into memory. [Reply to this comment] BPF! Posted Dec 20, 2024 19:05 UTC (Fri) by Cyberax ( supporter , # 52523) [Link] (5 responses) > clone()/exec() is a very powerful pattern that nicely fits in with how POSIX has designed its API. The ability to customize the newly launched process prior to loading the binary is crucial in a lot of scenarios. And I don't see that going away. POSIX's API is badly designed. clone() creates a copy of the entire VM and then just discards it. It's a lot of uselessly wasted work. A better API would create an "empty shell" suspended process, then the calling process can poke it (using FD-based APIs), and finally un-suspend it. There's a strange aversion in Linux/UNIX land to this model (it's too sane), so we get closer and closer to it with these kinds of workarounds. [Reply to this comment] BPF! Posted Dec 20, 2024 19:49 UTC (Fri) by epa (subscriber, #39769) [Link ] It's not only wasted work, but it makes it hard not to overcommit memory (at least in the case of a full fork()). If a process with a gigabyte of address space forks, requiring a gigabyte of free memory is far too cautious if it will exec() shortly afterwards, yet if you assume it always exec()s you will get caught out if the child process starts to use the memory you promised it. [Reply to this comment] BPF! Posted Dec 21, 2024 1:08 UTC (Sat) by comex (subscriber, #71521) [ Link] (2 responses) clone() is not POSIX. POSIX includes fork(), vfork(), and posix_spawn (). [Reply to this comment] BPF! Posted Dec 21, 2024 1:47 UTC (Sat) by gutschke (subscriber, #27910) [ Link] (1 responses) posix_spawn() is well-intentioned, but it doesn't really address the main problem with all of these APIs. As far as I can tell, POSIX doesn't guarantee for posix_spawn() to be thread-safe. And when I looked at the source code (admitted, this was years ago), the implementation in glibc most definitely didn't make any effort to ensure thread-safety. Also, posix_spawn() is just too limited to be a general solution. It's a fine response to the problem of Windows not having a fork()/exec() API. But it isn't really a solution for safely starting processes from any context. fork() is a decent general solution for single-threaded applications, and that's why we've been using it for so many decades. The kernel-level API is amenable to writing thread-safe code using fork() /exec(). But that requires that after fork() returns in the client, no further entries into any libraries are allowed. In fact, I am not even convinced that it is always safe to call the glibc version of fork() instead of making a direct system call. Both the various wrappers that glibc puts around system calls, and the hidden invocations of the dynamic link loader are potential sources for dead locks or crashes. Depending on how your program has been linked, this can even mean that you can no longer access any global symbols. Everything has to be on the local stack. The upshot of all of this is that you not only need to carefully screen the system calls that you want to make for potential process-wide side-effects, you also have to call them from inlined assembly instead of deferring to glibc. In addition, fork() only really works with memory over-committing enabled, and for large programs this system call can be expensive. vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux. Some amount of porting to different OS's will be involved, if you need to spawn a new process from within a multi-threaded environment. clone() is the pragmatic solution. Once you come to the realization that this code is impossible to implement within the constraints of POSIX alone, you might as well take advantage of everything that Linux can provide to you. It's going to be hairy code to write, but there really is no way around it. Also, just to point out the obvious, the glibc wrapper around clone() is completely unsuitable for the purposes of what we need here. But a direct system call will work fine. Of course, in 99% of the cases, you won't hit any of the race conditions. They are a little tricky to trigger accidentally, and a lot of them are relatively benign. Who cares about an occasional errno value that isn't set correctly, or a file descriptor that sometimes leaks to a child process. Only in very rare cases will you trigger a dead-lock, crash, or worse. So, many programs simply don't bother, and nobody ever notices that the code is buggy. It's the really big programs that everyone uses that need to worry about these things, as you suddenly have millions of running instances and countless numbers of spawned processes. If there is a way for something to go wrong, it eventually will. A zygote process is a time-tested alternative. And that's great, assuming you can modify the startup phase of the program. If you can guarantee that your code executes before any threads are created, then a zygote that is fork'd() proactively will avoid all of these complications. But with bigger pieces of software that rely on lots of third-party libraries, that's not always feasible. These days, you should assume that all code is always multi-threaded -- if only because the graphics libraries decide to start threads as soon as they get linked into the program, or something similarly frustrating. [Reply to this comment] BPF! Posted Dec 21, 2024 15:18 UTC (Sat) by khim (subscriber, #9252) [Link ] > A zygote process is a time-tested alternative. Zygote solves an entirely different problem: how to start not one process, but many processes while executing an initialization part only once. It works, but that's entirely different task. > vfork() solves the over-comitting problem, but it requires even more careful programming. I don't see how it can be made to work in a fully portable fashion, but it probably is the best solution for code that should run on more than just Linux. It's also the simplest way to do everything reliably and efficiently on Linux. For some unfathomable reason everyone's attentions is on an unsolvable problem: how to prepare a new process state using remnants of the old code that is interwined with the state of your program. Just ditch all that! Start from the clean state! Create a new setup code, push whatever you need/want in there, then execute vfork/exec (with zero steps between them, using fexecve) and viola: no races, no possibility of corrupting anything, everything is very clear, simple and guaranteed. The only downside: you have to develop that in arch-dependent way... but so what? If you compare that to insane amount of effort one would need to support all these bazillion zygote-based solutions then adding some kind of portable wrapper with arch-dependent guts even for 3-4 most popular architectures is not too hard. Best property of that solution: it's not supposed to be perfect! If you would find out that it doesn't work - nobody stops you from redoing that portable API and adding or removing something to it. Because you ship it with your code or in a shared library it's replaceable without any in-kernel politics. P.S. I think it can be called "double-exec" solution, and it requires Linux-specific syscalls, but the best part: all these syscalls are already there and are not even especially new. [Reply to this comment] Empty shell Posted Dec 21, 2024 4:58 UTC (Sat) by IAmLiterallyABee (subscriber, # 144892) [Link] > A better API would create an "empty shell" suspended process, then the calling process can poke it (using FD-based APIs), and finally un-suspend it IIRC, Fuchsia does something like that https://fuchsia.dev/fuchsia-src/reference/kernel_objects/... [Reply to this comment] BPF! Posted Dec 20, 2024 19:33 UTC (Fri) by magfr (subscriber, #16052) [ Link] I have been intrigued by the BeOS variant since I first saw it. They have some variant of posix_spawn which always can be called and they also have fork/exec but only allows those system calls in single threaded environments. To further mess with people this clone abstraction isn't strong enough to handle all cases - I have a little variation on tee which forks, sets up the child as a daemon process which does the writing, and then execs in the parent in order to keep the parent/child link with the grandparent. (The child terminates on end of input) [Reply to this comment] BPF! Posted Dec 21, 2024 3:05 UTC (Sat) by geofft (subscriber, #59789) [ Link] > The elephant in the room with BPF is that this new API would then likely be limited to privileged processes. Yeah, that was also my concern. It seems like people are not going to be comfortable making eBPF available to unprivileged users any time soon. On the other hand, classic BPF is still around and is accessible to unprivileged users in a few ways, most notably via seccomp mode 2, but also by creating an unprivileged user+net namespace (allowed by default in the upstream kernel and in most but not all distros) and using it for its original purpose of packet filtering. Could you allow userspace to upload a cBPF program and some data for its use and have that be enough to make system calls? I think my specific proposal would be to extend clone3's struct clone_args with three fields: a pointer to a cBPF program in user memory, and a pointer and length of memory to copy-on-write into the new process. So if you want traditional behavior for some reason, you can specify NULL and ~0 and deal with the overcommit issues of doing that, but more likely you just need a page or two of memory for the filename, argv, maybe the value of $PATH, and maybe some additional info like how to reorder file descriptors. Add a new cBPF opcode BPF_SYSCALL that is only valid in this context, which makes the syscall stored in the BPF accumulator with the arguments in the BPF registers and returns a value to the accumulator. This syscall is treated as a real syscall (it is not eBPF's BPF_CALL, pointer arguments point to userspace, etc.). When it calls execve, normal behavior resumes. If the cBPF program returns instead of calling either execve or exit, then it returns to the userspace instruction pointer where clone3 was originally called, so you can use it just like a normal use of clone if you want. If that address is no longer mapped, the process dies with a segfault. [Reply to this comment] BPF! Posted Dec 20, 2024 17:50 UTC (Fri) by edeloget (subscriber, #88392) [Link] (1 responses) Right now, the commands really look like a set of instructions which are executed by a specific in-kernel VM, so my guess is more that with enough time, the complexity of the subsystem will grow enough to warrant the creation of an "io uring language" of some sort. Which will /then/ be interpreted by a BPF program :) [Reply to this comment] BPF! Posted Dec 20, 2024 18:08 UTC (Fri) by adobriyan (subscriber, #30858) [Link] > in-kernel VM It will be incomplete until it is possible to create new uring with uring interface __attribute__((sarcasm)). [Reply to this comment] Why not just have a one-step spawn? Posted Dec 20, 2024 18:44 UTC (Fri) by jbills (subscriber, #161176) [ Link] (20 responses) Dumb question: why can't we just have a single step function that starts a new process with a clean state without needing to do a whole load of operations in that process's context? Other operating systems get away with process creation without a magic dance. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 20, 2024 23:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (4 responses) The basic problem is that, historically, the standard behavior is that "everything" is inherited, unless explicitly listed at [1]. If you change the rules now, lots of old libraries will not handle it gracefully. There are also a lot of awkward questions about miscellaneous process-wide state, such as the umask and working directory. Do those get "zeroed out" in some sensible way, or do you just inherit them? The other basic problem is that systemd --user has already solved quite a lot of practical use cases anyway, so there is reduced motivation to expand the kernel's semantics when we already have code that works today. [1]: https://pubs.opengroup.org/onlinepubs/9799919799/function... [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 0:11 UTC (Sat) by jbills (subscriber, #161176) [ Link] (2 responses) I mean it makes sense to keep the legacy API the way it is, but if we are designing an entire new API surface in io_uring, why not do something better? [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 1:18 UTC (Sat) by NYKevin (subscriber, #129325) [ Link] (1 responses) The question is, if you build it, will they come? Not if it's flagrantly incompatible with everything... unless you combine it with exec and end up with posix_spawn, of course, but then you need umpteen different flags to tell posix_spawn how to do its job, which is not fun either. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 2:28 UTC (Sat) by josh (subscriber, #17465) [Link ] The design of the io_uring-based mechanism should allow using it to implement posix_spawn in many cases. (Some flags may require new uring operations.) [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 15:25 UTC (Sat) by khim (subscriber, #9252) [Link ] > The basic problem is that, historically, the standard behavior is that "everything" is inherited, unless explicitly listed at. And why is that an issue? New process can always ditch whatever it doesn't need. Heck, you may supply it with all the information needed to do that. Only need five syscalls: memfd_create/write/vfork/execveat/execveat No need for io_uring, BPF or other madness, everything entirely in userspace using syscalls that exist for years and years. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 0:18 UTC (Sat) by josh (subscriber, #17465) [Link ] (12 responses) I think this would be a good idea. At the very least, there might be value in having a CLONE_ flag that makes the new process have an empty memory map rather than inheriting the caller's memory map. However, typically, you do want to inherit at least some state from the current process. You *have* to inherit permissions (though root could override them), you may want to inherit at least some file descriptors, and so on. And in practice you *may* want the option of having access to your memory map before doing the exec, at least for some operations. It might well be useful to have pidfd operations to set up a new process from an existing one, but there's value in batching those operations, in the style of uring. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 0:39 UTC (Sat) by willy (subscriber, #9762) [Link ] (10 responses) It's generally considered good form to have at least one text segment in your address space ... you can try to munmap(NULL, -1) if you want, but it will not end well. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 0:46 UTC (Sat) by josh (subscriber, #17465) [Link ] (9 responses) If you don't have any userspace (yet), and your userspace is going to get completely replaced by an execveat, what would go wrong if you have zero pages mapped in the address space? [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 0:55 UTC (Sat) by willy (subscriber, #9762) [Link ] (8 responses) Oh, when you said CLONE flag, I thought you were talking about clone (). From your reply it seems like you're talking about some other operation where the caller operates on its child. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 1:12 UTC (Sat) by josh (subscriber, #17465) [Link ] (7 responses) I was talking about clone(), but I was imagining a mode in which you combine "no initial memory map" with "don't start running yet". You'd then do your setup remotely, and then make some pidfd call to allow the process to start running. That would work well in the io_uring case too, where you could keep the pidfd in an in-ring file descriptor and do a series of operations on it. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 2:11 UTC (Sat) by gutschke (subscriber, #27910) [ Link] (6 responses) We don't have a full set of system calls for remotely doing everything that a process can do by itself. Every once in while, there has been talk of a new system call to inject system calls into child processes. But it never seems to go far. Until then, you need to at least have some memory that is already mapped into the child. And presumably, you could then use ptrace() to make the child do what you need to do. But by the time you jump through all these hoops, you might as well create a new process that has some pre-mapped memory pages that the parent filled out before starting the child. It's still a major pain to program, but better than starting with no initial memory map. I could see working with a version of clone() that takes an aligned memory address and number of pages to preserve. It won't be fun to program, but that's something that could be implemented in a library once and then nobody else needs to worry about it. It'll solve a number of the concerns that people have with (v)fork() and clone() [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 2:13 UTC (Sat) by josh (subscriber, #17465) [Link ] (4 responses) This is exactly the motivation that led me to propose io_uring as the primary mechanism here. That way, we don't have to add a distinct set of system calls for remotely manipulating a process, we can use the same set of io_uring operations we already have. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 15:39 UTC (Sat) by khim (subscriber, #9252) [Link ] (3 responses) A much simpler approach would be to just add some code that would do that setup in the empty process. And we already have memfd_create/ execveat combo that can do that. If you want - add flag to the clone that would call execveat. And then new code in an entirely empty image can do whatever it needs to prepare for the execution of the real binary that you want to execution. Why shove io_uring into something that already can be done entirely from userspace? Buzzword compliance? [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 16:10 UTC (Sat) by corbet (editor, #1) [Link] (2 responses) Khim, if you have a better idea, please submit a patch showing it. But please stop insulting the work of others, that does not help anybody. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 16:44 UTC (Sat) by khim (subscriber, #9252) [Link ] (1 responses) > But please stop insulting the work of others, that does not help anybody. Where do you see insults? I've faced the need to mangle simple and easy to understand and implement ideas into pretzels to include all the right buzzwords at my $DAYJOBs often enough that I can easily see buzzword compliance as explicit, or more likely, implicit part of the requirements. And very often it's even the most important one: if you couldn't cause enough buzz around your idea then it would die (except if there are some concrete tasks for concrete customers that may need it) even if it's pretty good, but with enough buzz around your idea you may push it even if it's totally stupid and would hurt everyone in the long run. > Khim, if you have a better idea, please submit a patch showing it. There are no patch because in-kernel parts are already done... years ago, in fact. And to discuss userspace part we need some idea about who, why and how plans to use that mechanism. The list of interested parties is not in the article thus it's hard for me to offer anything concrete because it's not clear to me how much flexibility is needed or wanted. Implementation of posix_spawn is doable but would be significant amount of work without any clear benefits: do we have lots of users of that syscall? If yes, then where are they, if not then why are they so rare? IOW: I don't see enough of a picture related to that work to judge it fairly and if "buzzword-compliance" is part of reasoning (even if an implicit one) then it could be that io_uring-based solution is the best way forward. Especially if it's a solution-in-a-search-of-a-problem: it's much easier to make someone excited about io_uring solution than about solution that just combines well-known syscalls in a way that makes posix_spawn safer. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 16:48 UTC (Sat) by corbet (editor, #1) [Link] "Buzzword compliance" takes the work of people who are trying to improve the system and casts it as something useless. If it were my work, I would find that insulting. I do not believe that the people working on this are concerned about buzzwords, they are trying to solve real problems. Please try being a bit more respectful toward them. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 3:14 UTC (Sat) by geofft (subscriber, #59789) [ Link] I think the idea of allowing a subset of pages to be preserved into the new program makes sense (I just suggested a variant of it in another comment). Agree that the complexity can be dealt with in a library once, but also I think it's less hard to program than you'd fear - one approach that would make it relatively pleasant to implement would be to write a tiny standalone binary to do the post-fork actions, and embed that compiled binary as a big constant in this helper library. Then the pre-fork operation (which the library would do for you) is to mmap some new pages to hold the binary and its stack, copy over the binary and fill in the stack appropriately, and tell clone3 to start running that binary from its entry point. Then immediately munmap those pages in the parent. (Or, if you want to get fancy, make a clone3 flag to move pages from the parent to the child instead of CoWing them.) This lets you avoid thinking too much about how the compiler is laying out memory and what parts you need to preserve, because you're essentially running a new program in the child. (In other words, it basically gives you kernel support for the "zygote" approach.) [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 15:32 UTC (Sat) by khim (subscriber, #9252) [Link ] > However, typically, you do want to inherit at least some state from the current process. Just package it neatly and pass it into a new process, damn it! > At the very least, there might be value in having a CLONE_ flag that makes the new process have an empty memory map rather than inheriting the caller's memory map. That's not possible: you need something in the process that you may execute. You can not start from zero. But if you would instead pass fd number that contains image that should be loaded there then with simple, almost trivial in kernel change you would enable fully-userpace solutions. But hey, that's too simple! There are not enough buzzwords in that approach! How can be accept something so sane? Nope, we need to push for io_uring, BPF or maybe even webasm! More complicated, more invasive, yet much more buzzword-compliant approach! [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 1:07 UTC (Sat) by comex (subscriber, #71521) [ Link] (1 responses) That's essentially posix_spawn. On Linux, posix_spawn is just a userland wrapper for a vfork/exec dance. But on macOS, posix_spawn is its own syscall. The kernel creates the new process without having to bother with forking the virtual memory space and all that. [Reply to this comment] Why not just have a one-step spawn? Posted Dec 21, 2024 17:26 UTC (Sat) by ma4ris8 (subscriber, #170509) [Link] Let's have a threaded program. It opens and closes file descriptors. Some of those have FD_CLOSE. Task is to create a child program. Child program will have three file descriptors, parent's three fds mapped as child's stdin, stdout and stderr. Close all unrelated file descriptors. Perhaps Valgrind's file descriptors with fds near upper bound, 1024, are also allowed to pass thru. One way is to fork, then open /proc/self/fd, close unrelated fds, remap related ones into 0,1 and 2. After that, then exec the final child with a clean state. If parent has large memory foot print, this is heavy. The other way is to do posix_spawn(). Spawn intermediate process, which closes unrelated fds, remaps related ones into 0,1 and 2. After clean up, execute the final child process. Drawback is to have the middle process to do the clean up, but if parent has large memory foot print, this is light, compared to fork. Third way: how to do it so, that the cleanups could be done in an elegant and memory safe way, without the separate middle process? [Reply to this comment] zygotes Posted Dec 21, 2024 0:20 UTC (Sat) by josh (subscriber, #17465) [Link ] Fun trick you could pull with this, once it has full support for arbitrary io_uring operations: Clone a new process, do some initial setup, do a futex wait or blocking read or wait on a uring message, and when that completes, do an execveat of the new process. Now you can have a "pool" of ready-to-start processes, blocked in the kernel, waiting to exec. [Reply to this comment] More work to do for tracing execs Posted Dec 21, 2024 1:57 UTC (Sat) by kxxt (subscriber, #172895) [ Link] I suppose this means more work to do for tracing execs in the future... On x86_64, tracing execs is easy as hooking __x64_sys_execve{,at} and __ia32_compat_sys_execve{,at} for 32bit(execsnoop doesn't handle it, but my tracexec handles it). Of course there is sched_process_exec but only for successful execs. [Reply to this comment] Missing a beat Posted Dec 21, 2024 2:06 UTC (Sat) by marcH (subscriber, #57642) [ Link] (1 responses) > The first of those is IORING_OP_CLONE, which causes the creation of a new process to execute any operations that follow in the same chain. In a difference from a full clone() call, though, much of the calling task's context is unavailable to the process created by IORING_OP_CLONE. and that is the whole point, right? > Without that context, io_uring operations in the newly created process can no longer be asynchronous; every operation in the chain must complete immediately, or the chain will fail. I'm afraid I'm missing a beat here. I mean, I miss the ... link (pun intended) between "without context" and "asynchronous". Could someone elaborate? [Reply to this comment] Missing a beat Posted Dec 21, 2024 3:35 UTC (Sat) by geofft (subscriber, #59789) [ Link] My guess, and someone correct me if I'm wrong: once you've called IORING_OP_CLONE, there's no userspace for this process yet, and you can't return from a syscall like io_uring_enter if you don't have a userspace to return to. So all the operations you do in processing the ring have to be operations that are handled synchronously in kernelspace and keep the syscall happening; none of them can be an operation that would cause the kernel to return from the syscall and say "Yeah I'll do this asynchronously," until you've done an exec and loaded a new program into userspace. (And when you exec, you want to start the process from the beginning like normal, you don't want to act like you're returning from an io_uring system call, hence the rule that you can have no further io_uring operations.) [Reply to this comment] Copyright (c) 2024, Eklektix, Inc. Comments and public postings are copyrighted by their creators. Linux is a registered trademark of Linus Torvalds