https://lwn.net/SubscriberLink/1002371/0ff2be6a2c7624ca/

LWN.net Logo LWN
.net News from the source LWN

  * Content
      + Weekly Edition
      + Archives
      + Search
      + Kernel
      + Security
      + Events calendar
      + Unread comments
      + -------------------------------------------------------------
      + LWN FAQ
      + Write for us

User: [        ] Password: [        ] [Log in]
|
[Subscribe]
|
[Register]
Subscribe / Log in / New account

Process creation in io_uring

[LWN subscriber-only content]

    Welcome to LWN.net

    The following subscription-only content has been made available
    to you by an LWN subscriber. Thousands of subscribers depend on
    LWN for the best news from the Linux and free software
    communities. If you enjoy this article, please consider
    subscribing to LWN. Thank you for visiting LWN.net!

By Jonathan Corbet
December 20, 2024
Back in 2022, Josh Triplett presented a plan to implement a "spawn
new process" functionality in the io_uring subsystem. There was a
fair amount of interest at the time, but developers got distracted,
and the work did not progress. Now, Gabriel Krisman Bertazi has
returned with a patch series updating and improving Triplett's work.
While interest in this functionality remains, it may still take some
time before it is ready for merging into the mainline.

A new process in Linux is created with one of the variants of the
clone() system call. As its name suggests, clone() creates a copy of
the calling process, running the same code. Much of the time, though,
the newly created process quickly calls execve() or execveat() to run
a different program, perhaps after performing a bit of cleanup. There
has long been interest in a system call that would combine these
operations efficiently, but nothing like that has ever found its way
into the Linux kernel. There is a posix_spawn() function, but that is
implemented in the C library using clone() and execve().

Arguably, part of the problem is that, while the clone()-to-execve()
pattern is widespread, the details of what happens between those two
calls can vary quite a bit. Some files may need to be closed, signal
handling changed, scheduling policies tweaked, environment adjusted,
and so on; the specific pattern will be different for every case.
posix_spawn() tries to provide a general mechanism to specify these
actions but, as can be seen by looking at the function's argument
list, it quickly becomes complex.

Io_uring, meanwhile, is primarily thought of as a way of performing
operations asynchronously. User space can queue operations in a ring
buffer; the kernel consumes that buffer, executes the operations
asynchronously, then puts the results into another ring buffer (the
"completion ring") as each operation completes. Initially, only basic
I/O operations were supported, but the list of operations has grown
over the years. At this point, io_uring can be thought of as a sort
of alternative system-call interface for Linux that is inherently
asynchronous.

An important io_uring feature, for the purposes of implementing
something like posix_spawn(), is the ability to create chains of
linked operations. When the kernel encounters a chain, it will only
initiate the first operation; the next operation in the chain will
only run after the first completes. The failure of an operation in a
chain will normally cause all remaining operations to be canceled,
but a "hard link" between two operations will cause execution to
continue regardless of the success of the first of the two. Linking
operations in this way essentially allows simple programs to be
loaded into the kernel for asynchronous execution; these programs can
run in parallel with any other io_uring operations that have been
submitted.

The new patch set creates two new io_uring operations, each with some
special semantics. The first of those is IORING_OP_CLONE, which
causes the creation of a new process to execute any operations that
follow in the same chain. In a difference from a full clone() call,
though, much of the calling task's context is unavailable to the
process created by IORING_OP_CLONE. Without that context, io_uring
operations in the newly created process can no longer be
asynchronous; every operation in the chain must complete immediately,
or the chain will fail. In practice, that means that operations like
closing files can be executed, but complicated I/O operations are no
longer possible. Krisman hopes to be able to at least partially lift
that constraint in the future.

Once the chain completes, the new process will be terminated, with
one important exception: if it invokes the second new operation,
IORING_OP_EXEC, which performs the equivalent of an execveat() call,
replacing the running program with a new executable. At this point,
the new process is completely detached from the original, is running
its own program, and the processing of the io_uring chain is
complete; the process will, rather than being terminated, go off to
run the new program. Placing any other operations after
IORING_OP_EXEC in the chain usually makes no sense; any operations
after a successful IORING_OP_EXEC will be canceled. It also does not
make sense to use IORING_OP_EXEC in any context other than a new
process created with IORING_OP_CLONE, so that usage is not allowed.

There is one case where it can be useful to link operations into the
chain after IORING_OP_EXEC -- efficiently implementing a path search
in the kernel. Often, the execution of a new program involves
searching for it in a number of directories, usually specified by the
PATH environment variable. One way of doing this in the io_uring
context, as shown in this test program, is to enqueue a series of
IORING_OP_EXEC operations, each trying a different location in the
path. If hard links are used to chain these operations, execution
will continue past failed operations until the one that actually
finds the target program succeeds; after that, any subsequent
operations will be discarded. The entire search runs in the kernel,
without the need to repeatedly switch between kernel and user space.

Most of the comments on the proposal so far have come from Pavel
Begunkov, who has expressed some concerns about it. He did not like
some aspects of the implementation, the special quirks associated
with IORING_OP_CLONE and the process it creates, and the use of
links, "`which already a bad sign for a bunch of reasons'" (he did
not specify what the reasons are). He suggested that io_uring might
not be the best place for this functionality; perhaps a list of
operations could be passed to a future version of clone() instead,
mirroring how the posix_spawn() interface works.

Krisman answered that combining everything into a single system call
would add complexity while making the solution less flexible.
Io_uring makes it easy to put together a set of operations to be run
in the kernel in an arbitrary order. The hope is to increase the set
of possible operations over time, enabling the implementation of
complex logic for the spawning of a new task. It is hard to see how
combining all of this functionality into a single system call could
work as well.

In any case, this is early-stage work; getting it to a point where it
can be considered for the mainline will require smoothing a number of
the rough edges and reducing the number of limitations. It will also
certainly require wider review; this work is proposing a significant
addition to the kernel's user-space ABI that would have to be
supported indefinitely. The developers involved will surely want to
get the details right before committing to that support.

Index entries for this article
Kernel       io_uring


[Send a free link]


-----------------------------------------
[Log in] to post comments

BPF!

Posted Dec 20, 2024 16:32 UTC (Fri) by willy (subscriber, #9762) [
Link] (11 responses)

Clearly the right solution is to load a BPF program into the kernel
to do the clone and setup.

/s in case it wasn't clear.
[Reply to this comment]
BPF!

Posted Dec 20, 2024 17:47 UTC (Fri) by gutschke (subscriber, #27910)
[Link] (8 responses)

I am not even sure the "/s" is warranted.

clone()/exec() is a very powerful pattern that nicely fits in with
how POSIX has designed its API. The ability to customize the newly
launched process prior to loading the binary is crucial in a lot of
scenarios. And I don't see that going away.

But ever since the advent of threads (and possibly even in the
presence of signals), this has gotten incredibly difficult to do
correctly. There are just too many subtle race conditions that
involve hidden state in the various run-time libraries or even in the
dynamic link loader. If there was a way to do everything that you can
currently do with systemcalls from userspace, but it instead moved
entirely into the kernel, most of these problems would immediately go
away. So, I see a lot of value with being able to call clone() and
exec() from a BPF program, or maybe from io_uring. The elephant in
the room with BPF is that this new API would then likely be limited
to privileged processes.

You can approximate a solution in userspace by very carefully picking
what system calls you invoke, and by avoiding any calls into libc,
including accidental calls into the dynamic link loader. This
involves some amount of assembly code to get 100% reliable. It's very
tedious and extremely fragile. It is often not worth the effort and
instead you have to live with the occasional random crash.

In some cases, a possible work-around is to launch a "zygote" helper
process that executes before any threads are created. The latter is
difficult to ensure though, as some libraries create threads when
they are loaded into memory.
[Reply to this comment]
BPF!

Posted Dec 20, 2024 19:05 UTC (Fri) by Cyberax ( supporter , #
52523) [Link] (5 responses)

> clone()/exec() is a very powerful pattern that nicely fits in with
how POSIX has designed its API. The ability to customize the newly
launched process prior to loading the binary is crucial in a lot of
scenarios. And I don't see that going away.

POSIX's API is badly designed. clone() creates a copy of the entire
VM and then just discards it. It's a lot of uselessly wasted work.

A better API would create an "empty shell" suspended process, then
the calling process can poke it (using FD-based APIs), and finally
un-suspend it. There's a strange aversion in Linux/UNIX land to this
model (it's too sane), so we get closer and closer to it with these
kinds of workarounds.
[Reply to this comment]
BPF!

Posted Dec 20, 2024 19:49 UTC (Fri) by epa (subscriber, #39769) [Link
]

It's not only wasted work, but it makes it hard not to overcommit
memory (at least in the case of a full fork()). If a process with a
gigabyte of address space forks, requiring a gigabyte of free memory
is far too cautious if it will exec() shortly afterwards, yet if you
assume it always exec()s you will get caught out if the child process
starts to use the memory you promised it.
[Reply to this comment]
BPF!

Posted Dec 21, 2024 1:08 UTC (Sat) by comex (subscriber, #71521) [
Link] (2 responses)

clone() is not POSIX. POSIX includes fork(), vfork(), and posix_spawn
().
[Reply to this comment]
BPF!

Posted Dec 21, 2024 1:47 UTC (Sat) by gutschke (subscriber, #27910) [
Link] (1 responses)

posix_spawn() is well-intentioned, but it doesn't really address the
main problem with all of these APIs. As far as I can tell, POSIX
doesn't guarantee for posix_spawn() to be thread-safe. And when I
looked at the source code (admitted, this was years ago), the
implementation in glibc most definitely didn't make any effort to
ensure thread-safety. Also, posix_spawn() is just too limited to be a
general solution. It's a fine response to the problem of Windows not
having a fork()/exec() API. But it isn't really a solution for safely
starting processes from any context.

fork() is a decent general solution for single-threaded applications,
and that's why we've been using it for so many decades. The
kernel-level API is amenable to writing thread-safe code using fork()
/exec(). But that requires that after fork() returns in the client,
no further entries into any libraries are allowed. In fact, I am not
even convinced that it is always safe to call the glibc version of
fork() instead of making a direct system call.

Both the various wrappers that glibc puts around system calls, and
the hidden invocations of the dynamic link loader are potential
sources for dead locks or crashes. Depending on how your program has
been linked, this can even mean that you can no longer access any
global symbols. Everything has to be on the local stack.

The upshot of all of this is that you not only need to carefully
screen the system calls that you want to make for potential
process-wide side-effects, you also have to call them from inlined
assembly instead of deferring to glibc. In addition, fork() only
really works with memory over-committing enabled, and for large
programs this system call can be expensive.

vfork() solves the over-comitting problem, but it requires even more
careful programming. I don't see how it can be made to work in a
fully portable fashion, but it probably is the best solution for code
that should run on more than just Linux. Some amount of porting to
different OS's will be involved, if you need to spawn a new process
from within a multi-threaded environment.

clone() is the pragmatic solution. Once you come to the realization
that this code is impossible to implement within the constraints of
POSIX alone, you might as well take advantage of everything that
Linux can provide to you. It's going to be hairy code to write, but
there really is no way around it. Also, just to point out the
obvious, the glibc wrapper around clone() is completely unsuitable
for the purposes of what we need here. But a direct system call will
work fine.

Of course, in 99% of the cases, you won't hit any of the race
conditions. They are a little tricky to trigger accidentally, and a
lot of them are relatively benign. Who cares about an occasional
errno value that isn't set correctly, or a file descriptor that
sometimes leaks to a child process. Only in very rare cases will you
trigger a dead-lock, crash, or worse. So, many programs simply don't
bother, and nobody ever notices that the code is buggy. It's the
really big programs that everyone uses that need to worry about these
things, as you suddenly have millions of running instances and
countless numbers of spawned processes. If there is a way for
something to go wrong, it eventually will.

A zygote process is a time-tested alternative. And that's great,
assuming you can modify the startup phase of the program. If you can
guarantee that your code executes before any threads are created,
then a zygote that is fork'd() proactively will avoid all of these
complications. But with bigger pieces of software that rely on lots
of third-party libraries, that's not always feasible. These days, you
should assume that all code is always multi-threaded -- if only
because the graphics libraries decide to start threads as soon as
they get linked into the program, or something similarly frustrating.
[Reply to this comment]
BPF!

Posted Dec 21, 2024 15:18 UTC (Sat) by khim (subscriber, #9252) [Link
]

> A zygote process is a time-tested alternative.

Zygote solves an entirely different problem: how to start not one
process, but many processes while executing an initialization part
only once.

It works, but that's entirely different task.

> vfork() solves the over-comitting problem, but it requires even
more careful programming. I don't see how it can be made to work in a
fully portable fashion, but it probably is the best solution for code
that should run on more than just Linux.

It's also the simplest way to do everything reliably and efficiently
on Linux.

For some unfathomable reason everyone's attentions is on an
unsolvable problem: how to prepare a new process state using remnants
of the old code that is interwined with the state of your program.

Just ditch all that! Start from the clean state! Create a new setup
code, push whatever you need/want in there, then execute vfork/exec
(with zero steps between them, using fexecve) and viola: no races, no
possibility of corrupting anything, everything is very clear, simple
and guaranteed.

The only downside: you have to develop that in arch-dependent way...
but so what? If you compare that to insane amount of effort one would
need to support all these bazillion zygote-based solutions then
adding some kind of portable wrapper with arch-dependent guts even
for 3-4 most popular architectures is not too hard.

Best property of that solution: it's not supposed to be perfect! If
you would find out that it doesn't work - nobody stops you from
redoing that portable API and adding or removing something to it.
Because you ship it with your code or in a shared library it's
replaceable without any in-kernel politics.

P.S. I think it can be called "double-exec" solution, and it requires
Linux-specific syscalls, but the best part: all these syscalls are
already there and are not even especially new.

[Reply to this comment]
Empty shell

Posted Dec 21, 2024 4:58 UTC (Sat) by IAmLiterallyABee (subscriber, #
144892) [Link]

> A better API would create an "empty shell" suspended process, then
the calling process can poke it (using FD-based APIs), and finally
un-suspend it

IIRC, Fuchsia does something like that
https://fuchsia.dev/fuchsia-src/reference/kernel_objects/...
[Reply to this comment]
BPF!

Posted Dec 20, 2024 19:33 UTC (Fri) by magfr (subscriber, #16052) [
Link]

I have been intrigued by the BeOS variant since I first saw it.
They have some variant of posix_spawn which always can be called and
they also have fork/exec but only allows those system calls in single
threaded environments.

To further mess with people this clone abstraction isn't strong
enough to handle all cases - I have a little variation on tee which
forks, sets up the child as a daemon process which does the writing,
and then execs in the parent in order to keep the parent/child link
with the grandparent.
(The child terminates on end of input)
[Reply to this comment]
BPF!

Posted Dec 21, 2024 3:05 UTC (Sat) by geofft (subscriber, #59789) [
Link]

> The elephant in the room with BPF is that this new API would then
likely be limited to privileged processes.

Yeah, that was also my concern. It seems like people are not going to
be comfortable making eBPF available to unprivileged users any time
soon.

On the other hand, classic BPF is still around and is accessible to
unprivileged users in a few ways, most notably via seccomp mode 2,
but also by creating an unprivileged user+net namespace (allowed by
default in the upstream kernel and in most but not all distros) and
using it for its original purpose of packet filtering. Could you
allow userspace to upload a cBPF program and some data for its use
and have that be enough to make system calls?

I think my specific proposal would be to extend clone3's struct
clone_args with three fields: a pointer to a cBPF program in user
memory, and a pointer and length of memory to copy-on-write into the
new process. So if you want traditional behavior for some reason, you
can specify NULL and ~0 and deal with the overcommit issues of doing
that, but more likely you just need a page or two of memory for the
filename, argv, maybe the value of $PATH, and maybe some additional
info like how to reorder file descriptors. Add a new cBPF opcode
BPF_SYSCALL that is only valid in this context, which makes the
syscall stored in the BPF accumulator with the arguments in the BPF
registers and returns a value to the accumulator. This syscall is
treated as a real syscall (it is not eBPF's BPF_CALL, pointer
arguments point to userspace, etc.). When it calls execve, normal
behavior resumes. If the cBPF program returns instead of calling
either execve or exit, then it returns to the userspace instruction
pointer where clone3 was originally called, so you can use it just
like a normal use of clone if you want. If that address is no longer
mapped, the process dies with a segfault.
[Reply to this comment]
BPF!

Posted Dec 20, 2024 17:50 UTC (Fri) by edeloget (subscriber, #88392)
[Link] (1 responses)

Right now, the commands really look like a set of instructions which
are executed by a specific in-kernel VM, so my guess is more that
with enough time, the complexity of the subsystem will grow enough to
warrant the creation of an "io uring language" of some sort.

Which will /then/ be interpreted by a BPF program :)
[Reply to this comment]
BPF!

Posted Dec 20, 2024 18:08 UTC (Fri) by adobriyan (subscriber, #30858)
[Link]

> in-kernel VM

It will be incomplete until it is possible to create new uring with
uring interface
__attribute__((sarcasm)).
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 20, 2024 18:44 UTC (Fri) by jbills (subscriber, #161176) [
Link] (20 responses)

Dumb question: why can't we just have a single step function that
starts a new process with a clean state without needing to do a whole
load of operations in that process's context? Other operating systems
get away with process creation without a magic dance.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 20, 2024 23:14 UTC (Fri) by NYKevin (subscriber, #129325)
[Link] (4 responses)

The basic problem is that, historically, the standard behavior is
that "everything" is inherited, unless explicitly listed at [1]. If
you change the rules now, lots of old libraries will not handle it
gracefully. There are also a lot of awkward questions about
miscellaneous process-wide state, such as the umask and working
directory. Do those get "zeroed out" in some sensible way, or do you
just inherit them?

The other basic problem is that systemd --user has already solved
quite a lot of practical use cases anyway, so there is reduced
motivation to expand the kernel's semantics when we already have code
that works today.

[1]: https://pubs.opengroup.org/onlinepubs/9799919799/function...
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 0:11 UTC (Sat) by jbills (subscriber, #161176) [
Link] (2 responses)

I mean it makes sense to keep the legacy API the way it is, but if we
are designing an entire new API surface in io_uring, why not do
something better?
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 1:18 UTC (Sat) by NYKevin (subscriber, #129325) [
Link] (1 responses)

The question is, if you build it, will they come? Not if it's
flagrantly incompatible with everything... unless you combine it with
exec and end up with posix_spawn, of course, but then you need
umpteen different flags to tell posix_spawn how to do its job, which
is not fun either.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 2:28 UTC (Sat) by josh (subscriber, #17465) [Link
]

The design of the io_uring-based mechanism should allow using it to
implement posix_spawn in many cases. (Some flags may require new
uring operations.)
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 15:25 UTC (Sat) by khim (subscriber, #9252) [Link
]

> The basic problem is that, historically, the standard behavior is
that "everything" is inherited, unless explicitly listed at.

And why is that an issue? New process can always ditch whatever it
doesn't need.

Heck, you may supply it with all the information needed to do that.
Only need five syscalls: memfd_create/write/vfork/execveat/execveat

No need for io_uring, BPF or other madness, everything entirely in
userspace using syscalls that exist for years and years.

[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 0:18 UTC (Sat) by josh (subscriber, #17465) [Link
] (12 responses)

I think this would be a good idea. At the very least, there might be
value in having a CLONE_ flag that makes the new process have an
empty memory map rather than inheriting the caller's memory map.

However, typically, you do want to inherit at least some state from
the current process. You *have* to inherit permissions (though root
could override them), you may want to inherit at least some file
descriptors, and so on.

And in practice you *may* want the option of having access to your
memory map before doing the exec, at least for some operations.

It might well be useful to have pidfd operations to set up a new
process from an existing one, but there's value in batching those
operations, in the style of uring.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 0:39 UTC (Sat) by willy (subscriber, #9762) [Link
] (10 responses)

It's generally considered good form to have at least one text segment
in your address space ... you can try to munmap(NULL, -1) if you
want, but it will not end well.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 0:46 UTC (Sat) by josh (subscriber, #17465) [Link
] (9 responses)

If you don't have any userspace (yet), and your userspace is going to
get completely replaced by an execveat, what would go wrong if you
have zero pages mapped in the address space?
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 0:55 UTC (Sat) by willy (subscriber, #9762) [Link
] (8 responses)

Oh, when you said CLONE flag, I thought you were talking about clone
(). From your reply it seems like you're talking about some other
operation where the caller operates on its child.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 1:12 UTC (Sat) by josh (subscriber, #17465) [Link
] (7 responses)

I was talking about clone(), but I was imagining a mode in which you
combine "no initial memory map" with "don't start running yet". You'd
then do your setup remotely, and then make some pidfd call to allow
the process to start running.

That would work well in the io_uring case too, where you could keep
the pidfd in an in-ring file descriptor and do a series of operations
on it.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 2:11 UTC (Sat) by gutschke (subscriber, #27910) [
Link] (6 responses)

We don't have a full set of system calls for remotely doing
everything that a process can do by itself.

Every once in while, there has been talk of a new system call to
inject system calls into child processes. But it never seems to go
far.

Until then, you need to at least have some memory that is already
mapped into the child. And presumably, you could then use ptrace() to
make the child do what you need to do. But by the time you jump
through all these hoops, you might as well create a new process that
has some pre-mapped memory pages that the parent filled out before
starting the child.

It's still a major pain to program, but better than starting with no
initial memory map. I could see working with a version of clone()
that takes an aligned memory address and number of pages to preserve.
It won't be fun to program, but that's something that could be
implemented in a library once and then nobody else needs to worry
about it. It'll solve a number of the concerns that people have with
(v)fork() and clone()
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 2:13 UTC (Sat) by josh (subscriber, #17465) [Link
] (4 responses)

This is exactly the motivation that led me to propose io_uring as the
primary mechanism here. That way, we don't have to add a distinct set
of system calls for remotely manipulating a process, we can use the
same set of io_uring operations we already have.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 15:39 UTC (Sat) by khim (subscriber, #9252) [Link
] (3 responses)

A much simpler approach would be to just add some code that would do
that setup in the empty process. And we already have memfd_create/
execveat combo that can do that.

If you want - add flag to the clone that would call execveat. And
then new code in an entirely empty image can do whatever it needs to
prepare for the execution of the real binary that you want to
execution.

Why shove io_uring into something that already can be done entirely
from userspace? Buzzword compliance?

[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 16:10 UTC (Sat) by corbet (editor, #1) [Link] (2
responses)

Khim, if you have a better idea, please submit a patch showing it.
But please stop insulting the work of others, that does not help
anybody.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 16:44 UTC (Sat) by khim (subscriber, #9252) [Link
] (1 responses)

> But please stop insulting the work of others, that does not help
anybody.

Where do you see insults? I've faced the need to mangle simple and
easy to understand and implement ideas into pretzels to include all
the right buzzwords at my $DAYJOBs often enough that I can easily see
buzzword compliance as explicit, or more likely, implicit part of the
requirements.

And very often it's even the most important one: if you couldn't
cause enough buzz around your idea then it would die (except if there
are some concrete tasks for concrete customers that may need it) even
if it's pretty good, but with enough buzz around your idea you may
push it even if it's totally stupid and would hurt everyone in the
long run.

> Khim, if you have a better idea, please submit a patch showing it.

There are no patch because in-kernel parts are already done... years
ago, in fact.

And to discuss userspace part we need some idea about who, why and
how plans to use that mechanism.

The list of interested parties is not in the article thus it's hard
for me to offer anything concrete because it's not clear to me how
much flexibility is needed or wanted.

Implementation of posix_spawn is doable but would be significant
amount of work without any clear benefits: do we have lots of users
of that syscall? If yes, then where are they, if not then why are
they so rare?

IOW: I don't see enough of a picture related to that work to judge it
fairly and if "buzzword-compliance" is part of reasoning (even if an
implicit one) then it could be that io_uring-based solution is the
best way forward. Especially if it's a
solution-in-a-search-of-a-problem: it's much easier to make someone
excited about io_uring solution than about solution that just
combines well-known syscalls in a way that makes posix_spawn safer.

[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 16:48 UTC (Sat) by corbet (editor, #1) [Link]

"Buzzword compliance" takes the work of people who are trying to
improve the system and casts it as something useless. If it were my
work, I would find that insulting. I do not believe that the people
working on this are concerned about buzzwords, they are trying to
solve real problems. Please try being a bit more respectful toward
them.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 3:14 UTC (Sat) by geofft (subscriber, #59789) [
Link]

I think the idea of allowing a subset of pages to be preserved into
the new program makes sense (I just suggested a variant of it in
another comment).

Agree that the complexity can be dealt with in a library once, but
also I think it's less hard to program than you'd fear - one approach
that would make it relatively pleasant to implement would be to write
a tiny standalone binary to do the post-fork actions, and embed that
compiled binary as a big constant in this helper library. Then the
pre-fork operation (which the library would do for you) is to mmap
some new pages to hold the binary and its stack, copy over the binary
and fill in the stack appropriately, and tell clone3 to start running
that binary from its entry point. Then immediately munmap those pages
in the parent. (Or, if you want to get fancy, make a clone3 flag to
move pages from the parent to the child instead of CoWing them.) This
lets you avoid thinking too much about how the compiler is laying out
memory and what parts you need to preserve, because you're
essentially running a new program in the child. (In other words, it
basically gives you kernel support for the "zygote" approach.)
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 15:32 UTC (Sat) by khim (subscriber, #9252) [Link
]

> However, typically, you do want to inherit at least some state from
the current process.

Just package it neatly and pass it into a new process, damn it!

> At the very least, there might be value in having a CLONE_ flag
that makes the new process have an empty memory map rather than
inheriting the caller's memory map.

That's not possible: you need something in the process that you may
execute. You can not start from zero. But if you would instead pass
fd number that contains image that should be loaded there then with
simple, almost trivial in kernel change you would enable
fully-userpace solutions.

But hey, that's too simple! There are not enough buzzwords in that
approach! How can be accept something so sane? Nope, we need to push
for io_uring, BPF or maybe even webasm! More complicated, more
invasive, yet much more buzzword-compliant approach!

[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 1:07 UTC (Sat) by comex (subscriber, #71521) [
Link] (1 responses)

That's essentially posix_spawn.

On Linux, posix_spawn is just a userland wrapper for a vfork/exec
dance. But on macOS, posix_spawn is its own syscall. The kernel
creates the new process without having to bother with forking the
virtual memory space and all that.
[Reply to this comment]
Why not just have a one-step spawn?

Posted Dec 21, 2024 17:26 UTC (Sat) by ma4ris8 (subscriber, #170509)
[Link]

Let's have a threaded program. It opens and closes file descriptors.
Some of those have FD_CLOSE.
Task is to create a child program. Child program will have three file
descriptors, parent's three fds
mapped as child's stdin, stdout and stderr. Close all unrelated file
descriptors.
Perhaps Valgrind's file descriptors with fds near upper bound, 1024,
are also allowed to pass thru.

One way is to fork, then open /proc/self/fd, close unrelated fds,
remap related ones into 0,1 and 2.
After that, then exec the final child with a clean state. If parent
has large memory foot print, this is heavy.

The other way is to do posix_spawn(). Spawn intermediate process,
which closes unrelated fds, remaps related
ones into 0,1 and 2. After clean up, execute the final child process.
Drawback is to have the middle process
to do the clean up, but if parent has large memory foot print, this
is light, compared to fork.

Third way: how to do it so, that the cleanups could be done in an
elegant and memory safe way,
without the separate middle process?

[Reply to this comment]
zygotes

Posted Dec 21, 2024 0:20 UTC (Sat) by josh (subscriber, #17465) [Link
]

Fun trick you could pull with this, once it has full support for
arbitrary io_uring operations:

Clone a new process, do some initial setup, do a futex wait or
blocking read or wait on a uring message, and when that completes, do
an execveat of the new process. Now you can have a "pool" of
ready-to-start processes, blocked in the kernel, waiting to exec.
[Reply to this comment]
More work to do for tracing execs

Posted Dec 21, 2024 1:57 UTC (Sat) by kxxt (subscriber, #172895) [
Link]

I suppose this means more work to do for tracing execs in the
future...

On x86_64, tracing execs is easy as hooking __x64_sys_execve{,at} and
__ia32_compat_sys_execve{,at} for 32bit(execsnoop doesn't handle it,
but my tracexec handles it).

Of course there is sched_process_exec but only for successful execs.

[Reply to this comment]
Missing a beat

Posted Dec 21, 2024 2:06 UTC (Sat) by marcH (subscriber, #57642) [
Link] (1 responses)

> The first of those is IORING_OP_CLONE, which causes the creation of
a new process to execute any operations that follow in the same
chain. In a difference from a full clone() call, though, much of the
calling task's context is unavailable to the process created by
IORING_OP_CLONE.

and that is the whole point, right?

> Without that context, io_uring operations in the newly created
process can no longer be asynchronous; every operation in the chain
must complete immediately, or the chain will fail.

I'm afraid I'm missing a beat here. I mean, I miss the ... link (pun
intended) between "without context" and "asynchronous". Could someone
elaborate?
[Reply to this comment]
Missing a beat

Posted Dec 21, 2024 3:35 UTC (Sat) by geofft (subscriber, #59789) [
Link]

My guess, and someone correct me if I'm wrong: once you've called
IORING_OP_CLONE, there's no userspace for this process yet, and you
can't return from a syscall like io_uring_enter if you don't have a
userspace to return to. So all the operations you do in processing
the ring have to be operations that are handled synchronously in
kernelspace and keep the syscall happening; none of them can be an
operation that would cause the kernel to return from the syscall and
say "Yeah I'll do this asynchronously," until you've done an exec and
loaded a new program into userspace. (And when you exec, you want to
start the process from the beginning like normal, you don't want to
act like you're returning from an io_uring system call, hence the
rule that you can have no further io_uring operations.)
[Reply to this comment]

                  Copyright (c) 2024, Eklektix, Inc.
   Comments and public postings are copyrighted by their creators.
          Linux is a registered trademark of Linus Torvalds