http://catern.com/process.html
The Unix process API is unreliable and unsafe
Table of Contents
* 1. It's easy for processes to leak
+ 1.1. Flawed solutions
o 1.1.1. Make sure B always cleans up on exit and kills C
o 1.1.2. Use PR_SET_PDEATHSIG to kill C when B exits.
o 1.1.3. Always write down the pid of every process you
start, or otherwise coordinate between A and B
o 1.1.4. A should run B inside a container
o 1.1.5. Use process groups or controlling terminals
o 1.1.6. Use Windows 8 nested job objects
* 2. It's impossible to prevent malicious processes leaks
+ 2.1. Flawed solutions
o 2.1.1. Run your possibly-malicious process inside a
container or a virtual machine
o 2.1.2. Limit the number of processes that can exist on
the system
* 3. Processes have global, reusable IDs
+ 3.1. Flawed solutions
o 3.1.1. Don't reuse pids, use a UUID instead
o 3.1.2. Only send signals to your own child processes
+ 3.2. Correct solutions
o 3.2.1. Use pidfd
* 4. Process exit is communicated through signals
+ 4.1. Flawed solutions
o 4.1.1. Use signalfd
o 4.1.2. Chain signal handlers
o 4.1.3. Create a standard library for starting children
and have everyone use it
+ 4.2. Correct solutions
o 4.2.1. Use pidfd
* 5. How to fix all these problems
+ 5.1. Problem: It's easy for processes to leak
+ 5.2. Problem: It's impossible to prevent malicious processes
leaks
+ 5.3. Problem: Processes have global, reusable IDs
+ 5.4. Problem: Process exit is communicated through signals
* 6. How to really fix all these problems in the long term
It is difficult to write software which reliably and safely manages
processes on Unix; this is because of flaws in the API. I know of no
Unix system which has solved these problems, but in this article I'll
be specifically talking about the Linux process API.
When I say "process API", I am talking about the interface for
starting processes, monitoring processes until they exit, and
stopping processes that haven't yet exited.
I won't cover here the various issues with fork; that has already
been discussed enough elsewhere.
So what is so bad about all the other parts of the Linux process API?
1. It's easy for started processes to accidentally "leak", and
linger on the system without anyone responsible for killing them.
2. Conversely, it's difficult or impossible to guarantee that a
buggy or malicious process has not "leaked" some processes.
3. Rather than use secure identifiers for processes (such as a file
descriptor), global process identifiers (pids) separate names
from authority, making it possible to confuse one process for
another.
4. Process events are communicated using signals to the parent
process, making libraries (which can't safely handle signals)
unable to spawn processes, as well as inheriting all the other
problems of Unix signals.
I'll explain each problem, as well as some possible solutions that
are ultimately flawed. At the end of the article, I'll go over my own
solution and how it solves all of these problems.
But first, some motivation. Why should we care about flaws in the
Linux process API? Things work well enough right now, don't they?
Yes, there are many tools like process supervisor daemons and Linux
containers which fix parts of the process API. All of these tools so
far are flawed in one way or another, as I'll explain, but they're
still in wide use and they are adequate solutions when used
carefully.
The real problem is the significant complexity cost of working around
the flaws of the process API. The aforementioned tools have many
features and complex interfaces, and implementing their workarounds
yourself can lead to even more complexity. At the moment, any system
that tries for reliability must pay this complexity cost. But if the
native process API was not so flawed, no such complex hacks would be
needed, and our systems would be both simpler and more robust.
1 It's easy for processes to leak
What do I mean by a "leaked process"?
I mean a process that is running without any other entity on the
system which is both:
1. Responsible for killing the process, and
2. Knows the identity of the process and can kill it precisely with
no risk of collateral damage
As soon as a process is orphaned (that is, its parent dies), it's
leaked, because only the parent of a process is able to safely kill
it. Merely knowing the pid is insufficient to safely kill a process,
as we'll discuss in the section on process ids. So while an orphaned
process might have some other entity on the system with property 1,
that entity can never satisfy property 2.
Once some processes have been leaked, you have two options:
* Hope that they will exit on their own at some point.
* Look at the list of processes, fire off some signals, and hope
you didn't just kill the wrong thing.
Neither is particularly sastisfying, so we'd hope that it's hard to
make an orphaned process. But, unfortunately, it's quite easy.
A process is orphaned when its parent process exits. If process A
starts process B, which starts process C, and then process B exits,
process C will be orphaned.
This is as simple as:
sh -c '{ sleep inf & } &'
'sh' is our process A; it forks off another copy of itself to perform
the outer '&', which is our process B; then 'sleep inf' is our
process C.
The parent process B is able to robustly track the lifetime of its
child process C, through the mechanisms Linux provides for parent
processes, and ensure that process C exits. If and when C is
orphaned, that mechanism is no longer easily usable; it can be used
by the init process, but that's not typically accessible to us.
1.1 Flawed solutions
1.1.1 Make sure B always cleans up on exit and kills C
B should just always make sure to kill process C before it exits.
That way no processes will be orphaned, so no processes will be
leaked, and we'll be fine.
Well, what are the possible ways B might exit and need to clean up?
B might choose to exit, possibly by throwing an exception or
panicking. In those cases, it's possible for B to kill process C
immediately before exiting.
Or B might receive a signal. B might be signaled for conventional
reasons, such as a user pressing Ctrl-C, in which case B can still
clean up, as long as the programmer or runtime take care to catch
every kind of signal.
Or B might be signaled for some more unconventional reasons, such as
a segmentation fault. It's still possible for B to clean up in this
case, but it may be very tricky to do, and the programmer or runtime
may need to take great care to make sure that the pid of C is still
accessible even while handling a segfault.
Or B might receive SIGKILL. Unfortunately, this case prevents this
solution from working. It's not possible for B to clean up when it
receives SIGKILL, so C will be unavoidably leaked.
We might want to say, "never send SIGKILL". But that is impossible,
both for a conventional reason and an ironic reason. The conventional
reason is that B might have a bug, and hang, and SIGKILL might be the
only way to kill it. The ironic reason is that the only way for B to
clean up and exit in guaranteed finite time is for it to SIGKILL its
own children, so that if they have bugs they will not just hang
forever. So B would be SIGKILL'd by its own parent, implementing the
same strategy.
So, in summary, it's not possible to guarantee that B cleans up and
kills C when it exits, because it might be SIGKILL'd. Even in the
case where B isn't SIGKILL'd, it's tricky for a complicated program
to always make sure to kill off any child processes when it exits.
1.1.2 Use PR_SET_PDEATHSIG to kill C when B exits.
We can use the Linux-specific feature PR_SET_PDEATHSIG on process C,
to ensure that process C will receive SIGKILL (or another signal)
whenever process B exits for any reason, including if process B exits
uncleanly due to a bug or SIGKILL.
The issue is that this only works one level down the tree. If C forks
off its own process D, the death signal will kill off C but not D.
Extending it to work over an entire tree of processes, requires that
the entire tree be using PR_SET_PDEATHSIG (and using it correctly).
If we can make that guarantee, this technique will work. But in
practice, most large systems can't make that guarantee since they are
made up of a large number of programs from many different developers.
As one specific example, many applications run subcommands through
Unix shells, which don't use PR_SET_PDEATHSIG.
Even in smaller systems where we control all involved programs, this
technique isn't perfect, since even programs we control can always
have bugs and fail to use PR_SET_PDEATHSIG. We'd prefer a guarantee
that relies only on the program at the root of the tree, and doesn't
require us to reason about and debug all the programs involved.
1.1.3 Always write down the pid of every process you start, or
otherwise coordinate between A and B
B could make sure to always write down the pid of every process it
starts, so that we can at least make an attempt to kill any orphaned
processes, even if that attempt isn't robust. More generally, B could
coordinate with A, and somehow tell A about every process B starts.
Then A (which we might trust to be correctly implemented) can handle
cleaning up the processes that B starts. This will fail if there's a
bug in B, or if B is killed just after starting a process but before
telling A, but perhaps it's good enough?
This has the same flaw as PR_SET_PDEATHSIG, in that it only allows
for avoiding leaks at a single level. Like PR_SET_PDEATHSIG, all
programs involved would need to use our mechanism. And that's
infeasible in practice in any large system.
1.1.4 A should run B inside a container
If A runs B inside a Linux container technology, such as a Docker
container, then no matter how many processes B starts, A will be able
to terminate them all by just stopping the container, and we'll be
fine.
Ignoring the other merits of containers, if we're trying to solve the
problem of "it is too easy for processes to leak", containers have
three main flaws.
1. It's not easy to run a container. Python has a "subprocess.run"
function in its standard library, for starting a subprocess.
Python has no "container.run" function in its standard library,
to start a subprocess inside a container, and in the current
container landscape that seems unlikely to change.
Shell scripts make starting processes trivial, but it's almost
unthinkable that, say, bash, would integrate functionality for
starting containers, so that every process is started in a
container. Leaving aside the issues of which container technology
to use, it would be quite complex to implement.
2. Containers require root or running inside a user namespace. The
root requirement obviously can't be satisfied by most users.
Fortunately, it's possible to start a container without being
root by using user namespaces. Unfortunately, user namespaces
introduce a number of quirks, such as breaking gdb (by breaking
ptrace), so they also can't be used by most users.
3. It's pretty heavyweight to require literally every child process
to run in a separate container. Robust usage of pid namespaces
(the relevant part of Linux containers) requires that we start up
an init process for each pid namespace, separate from the other
processes running in the container. This init process will do
nothing but increase the load on the system, and it will prevent
us from directly monitoring the started processes.
So, running every child process in a separate container isn't a
viable solution. We still have no way to easily prevent child
processes from leaking.
1.1.5 Use process groups or controlling terminals
Process groups and controlling terminals are two features which can
be used to terminate a group of processes. Such a group of processes
is usually called a "job", since Unix shells use these features and
use that terminology. When processes start children, they start out
in the same job, and they can all be terminated at once. So if
process A put process B in a job, process A could avoid process C
leaking by terminating the job.
Unfortunately, neither of these job mechanisms is nestable. If a
process puts itself or its children into a new process group or gives
itself a new controlling terminal, it completely replaces the old
process group or controlling terminal. So that process will no longer
be terminated when its original job is terminated!
In other words, if process A puts process B in a job, then process B
puts process C in a job, then process B neglects to terminate process
C, process C will no longer be in the job that process A knows about,
so process C will leak!
So, ironically, if a child process tries to use these features to
prevent its own child processes from leaking, it can inadvertantly
cause them to leak. This is certainly unsuitable.
1.1.6 Use Windows 8 nested job objects
Windows 8 added support for nested job objects. Child processes (and
all their transitive children) can be associated with a job, and they
will all be terminated when the owner of the job exits (or
deliberately kills them). Child processes can create their own jobs
and assign their own children to those jobs, without interfering with
or being aware of their parent job.
Unfortunately, we're using Linux, not Windows. :)
2 It's impossible to prevent malicious processes leaks
What's a "malicious process leak"?
Well, if a "process leak" is a process existing on the system without
someone knowing to kill it, a "malicious process leak" is a process
existing on the system and actively evading being killed.
A process can fork repeatedly to make a thousand copies of itself, or
just fork constantly at all times, leaving the previous processes to
immediately exit, so that its pid is constantly changing and the
latest copy can't be identified and sent a signal. A "fork bomb" is
one example of an attack of this kind.
But note that this doesn't have to be the result of an attack; simple
buggy code can cause this. If you ever program using fork(), you
could easily start forking repeatedly just from a bug.
We would like to be able to put in a moderate amount of work to
completely prevent this kind of issue. Unfortunately, completely
preventing this from happening is very difficult, maybe even
impossible.
2.1 Flawed solutions
2.1.1 Run your possibly-malicious process inside a container or a
virtual machine
If we run our possibly-malicious process inside a container or
virtual machine, then no matter how much it forks and exits, we will
be able to terminate the process by just stopping the container (or
virtual machine).
This will actually work, to a degree. Most of our earlier concerns
(it's too hard and too heavyweight) no longer apply, because in this
section we're happy to have any means at all to prevent the attack.
However, it still requires root access (or the use of unprivileged
user namespaces) to a run a container or a virtual machine. So this
solution is not truly general purpose; we can't use this routinely,
every time we create a child process, because our application
certainly should not run with root access in the normal case, nor can
everything run in an unprivileged user namespace.
We can partially get around the need for root access by having a
privileged daemon start processes on our behalf inside a container.
systemd, for example, with its systemd-run API, allows us to request
that systemd start up a process for us on the fly. systemd runs every
process in a separate cgroup (which is the underlying container
mechanism that we would use), so it can protect against the malicious
process leak problem.
But having someone else start a process on our behalf breaks a lot of
traditional Unix features. For example, we can't easily have our
child process inherit stdin/stdout/stderr from us, nor will it
inherit environment variables or any ulimits we've placed on ourself.
The shell, among other applications, is completely dependent on these
features.
Also, this privileged daemon centralizes all the processes we start
on the system. We can't, say, set up a truly isolated environment for
development or integration testing, because we'll still have to go
through the central daemon.
So as a general-purpose mechanism, this is not workable, but it can
work in certain constrained scenarios.
2.1.2 Limit the number of processes that can exist on the system
What if we limit the number of processes that can exist on the
system? Then as the process keeps forking, it will eventually exhaust
the available process space and stop, and in that frozen moment of
tranquility, an already-started process would be able to kill it.
The number of processes that can exist is actually already limited;
there's a maximum pid, and we can't have any more processes than
that. The issue is that as processes exit, possibly due to being
killed by us, their space is usually freed up, and new processes can
be created.
So if the malicious process just keeps forking, it can fill up the
space left by previous processes exiting, and this doesn't help us.
Stricter limits on the number of processes can prevent fork bombs,
but not more general attacks.
However, if we could prevent space from being freed up as processes
exit, the space that malicious process has to operate in would shrink
and shrink, until finally it is no longer able to fork any more, and
we can kill the last copy. Preventing the reuse of process space
while under possible attack can be done using a technique that I'll
discuss at the end of this article. It's a key part of a robust
solution to the process leaking problem.
3 Processes have global, reusable IDs
A process is identified using its 'pid'. A pid is an integer,
frequently in the range 1 to 65536, which is selected for the process
at startup from the pool of currently unused pids, and which is
relinquished back into that pool when the process exits.
There is a single pool of process IDs on the system. If enough
processes are started and exit, a process ID will be reused.
Pids are mainly used to send signals to processes with the "kill"
system call (which is used for any kind of signal, not just lethal
ones).
Typically, a long-lived process (a "daemon") would write its own pid
into a file, called a "pidfile". Then other processes could send
signals to the daemon by reading that pidfile and using "kill".
But there is absolutely no guarantee that when you "kill", you are
sending a signal to the right process. If the daemon has exited, and
enough processes have started and stopped since then, the pid in the
daemon's pidfile might point to a completely unrelated process. You
might send a fatal signal to something critically important instead
of the daemon you meant to send it to!
You might try checking the command that the pid is running, or other
details about the process, before killing the process. But that's no
guarantee either - between the time you check and the time you
signal, the process may have died and been replaced.
Fundamentally, any usage of a pid (other than for your own child
processes) is vulnerable to a time-of-check-to-time-of-use race
condition. Since pids are the only way to identify a process, this
means any interaction with processes other than your own child
processes is inherently racy.
3.1 Flawed solutions
3.1.1 Don't reuse pids, use a UUID instead
The kernel could identify processes with some kind of truly globally
unique identifier. Then users wouldn't have race conditions when they
try to kill them. Note that users can't implement this solution in
userspace by tagging the processes with a UUID; it must be a
kernel-provided a UUID to avoid the TOCTOU issues.
This would work, but it would be difficult to retrofit onto an
existing Unix system: Many applications assume that pids are the same
size as 16-bit or 32-bit ints.
We would also pay an efficiency cost, just because of handling a
larger identifier. It would be unusual for an operating system to
provide references to its internal structures with UUIDs, when it can
use more efficient smaller identifiers and provide security through
other means.
3.1.2 Only send signals to your own child processes
When process A starts process B, and then process B exits, process A
is notified. Furthermore, process B leaves a "zombie process" behind
after it exits, which consumes the pid until process A explicitly
acts to get rid of the zombie process. These two features allow
process A to know exactly when it is safe to send signals to B's pid.
So if A stays running for as long as B is running, and only A sends
signals to B, we can have signals without races.
This works, and is an excellent replacement for pidfiles. But it
doesn't work in all situations.
What if process A exits unexpectedly? Then we are back in the
situation of not being able to kill process B without a race
condition. Furthermore, what if we genuinely want process B to
outlive process A? This is the case whenever we are starting a
long-lived process (a daemon), for example.
To support this, instead of forking off a process, process A could
send a request to a long-lived supervisor daemon to start process B,
as the supervisor daemon's own child. The authors of supervisor
daemons such as systemd or supervisord often urge software developers
not to fork off their own long-lived processes; instead, say the
supervisor daemon authors, we should request that the supervisor
daemon fork off long-lived processes on our behalf.
Unfortunately, that has the same issues as discussed in the section
on preventing malicious process leaks, where we considered having a
privileged daemon create containers on our behalf. We can't easily
have our child process inherit stdin/stdout/stderr from us, nor will
it inherit environment variables or any ulimits we've placed on
ourself. The shell, among other applications, is completely dependent
on these features. And this daemon centralizes the processes we start
on the system, so it's difficult to set up isolated test or
development environments.
Furthermore, even if we have a supervisor daemon starting processes
on our behalf, this leaves a static parent-child hierarchy which
cannot change. The supervisor daemon cannot, for example, start a new
version of itself to upgrade, without careful use of exec, as all of
its child processes will stop being its children. Nor can process A
initially start up process B as process A's child, and then later
decide that process B should live past process A's exit.
What we need is a way to send signals without races, without forcing
a specific parent-child hierarchy. If we can make the parent-child
hierarchy more flexible, it would work well. We will use this
technique in combination with others as part of a full solution at
the end of this article.
3.2 Correct solutions
3.2.1 Use pidfd
pidfd was added to Linux in 2019, and solves the issues with using
pids to identify processes. Read the LWN summary for more information
.
It allows us to break the rigid parent-child hierarchy, and replace
it with a more flexible supervision hierarchy, based on passing file
descriptors around.
4 Process exit is communicated through signals
Process exit is communicated to the parent of a process by SIGCHLD.
If process A starts process B, and then process B exits, process A
will be sent the SIGCHLD signal.
Signals are delivered to the entire process, and only one signal
handler can be registered for each signal.
So if the main function in process A registers a signal handler for
SIGCHLD, and library L1 in process A starts a process B, when process
B exits, the signal handler of the main function in process A will
receive a notification of the exit of a child it never started, and
the library will never be told that its child has exited.
Conversely, if the library L1 registers the signal handler, and the
main function or even another library L2 starts a process B, then
only L1 will be notified when the process exits.
In general, only one part of the program can directly receive
signals. That one part of the program then must forward the signal
around to whatever other components desire to receive signals. If a
library has no interface for receiving signal information, like
glibc, then it can't use child processes. This is a major
inconvenience for both the library developer and the user.
4.1 Flawed solutions
4.1.1 Use signalfd
While signalfd is certainly a great help in dealing with signals on
Linux, it doesn't actually help deal with the problem of libraries
receiving SIGCHLD. You could use signalfd to wait for the SIGCHLD
signals, but you still then need to forward the signals to each
library.
4.1.2 Chain signal handlers
Can't we just have one library's signal handler call the next
library's signal handler?
Rather than explain in this article, I refer the reader to here where
it's explained that signal handler chaining can't be done robustly.
Libraries have high standard for working, even in strange scenarios!
4.1.3 Create a standard library for starting children and have
everyone use it
The issue is that multiple libraries want to handle the task of
starting and monitoring children. Can't we just agree on a single
standard library that abstracts over SIGCHLD, and have everyone use
it? We can provide a file descriptor interface, which is increasingly
standard on Linux, and is easy for libraries to use and monitor.
Unfortunately, it would be near impossible to get every other library
that wants to use subprocesses or wants to listen for SIGCHLD to use
this single standard library.
There are already plenty of libraries which provide wrappers around
SIGCHLD/fork/exec, and plenty of code that depends on them. We can't
just have a flag day and switch everything over to a new library all
at once. This becomes even more tricky in high-level languages,
because most languages already come with a higher-level API around
spawning processes.
Still, the idea of providing a file descriptor interface for starting
and monitoring children is a good one. File descriptors can easily be
integrated into an event loop. And a file descriptor can be monitored
by a library without interfering with the rest of the program, using
a library's own private event loop or other mechanisms. We just need
a way to provide that interface that does not interfere with other
libraries in the same process.
4.2 Correct solutions
4.2.1 Use pidfd
pidfd allows a library to launch a process and get a pidfd back, then
use that pidfd to monitor the process.
A file descriptor can be easily integrated into an event loop, as
mentioned in the previous section.
5 How to fix all these problems
pidfd is a great solution to the third and fourth problems, but it
doesn't solve the first two problems.
I only know one existing solution that fixes all these problems
without sacrificing flexibility or generality.
Use the C utility supervise to start your processes; for Python, you
can use its associated Python library.
Essentially, we delegate the problem of starting and monitoring child
processes to a small helper program: supervise. And it abstracts away
the fixes for these problems behind a nicer (but still low-level)
interface. For a high-level interface, one can use the Python
library.
5.1 Problem: It's easy for processes to leak
Solution: supervise kills all your descendant processes when you
exit.
supervise is passed a file descriptor to read instructions from on
startup, and monitors that fd throughout its (short and simple)
lifetime. When the parent process exits, the fd will be closed,
supervise will be notified, supervise will kill the descendant
processes, and then supervise will also exit.
But because process lifetime is tied to the lifetime of a fd, it's
still easy to create long-lived processes if you wish; just make sure
the fd outlives your own process.
supervise is able to find all descendant processes by using
PR_SET_CHILD_SUBREAPER, a Linux-specific feature. If process A starts
process B which starts process C, and process B exits, then if
process A has set PR_SET_CHILD_SUBREAPER then process A will become
the new parent of process C. This allows supervise to safely find and
kill all descendant processes.
5.2 Problem: It's impossible to prevent malicious processes leaks
Solution: supervise kills all your descendant processes when you
exit, securely and in a guaranteed-to-terminate way.
It does this using the technique mentioned in the "Limit the number
of processes that can exist on the system" section. If we don't free
up pid space as a malicious process forks and exits, eventually the
pid space will be exhausted and the malicious process can be cornered
and killed.
5.3 Problem: Processes have global, reusable IDs
Solution: supervise gives you a file descriptor interface to
signaling a process.
To signal the process, you just write to the file descriptor. File
descriptors are local and unforgeable, so it's not possible for the
file descriptor to suddenly start pointing at a different instance of
supervise, wrapping a different process.
All the descendant processes of supervise will at some point become
its direct children, thanks to PR_SET_CHILD_SUBREAPER, so it can
safely send them all signals using "kill" and cause them to exit, so
a supervision hierarchy can be maintained without forcing any
specific organization.
And just like all file descriptors, the supervise file descriptors
can be inherited by children or passed over Unix sockets. This allows
a supervision hierarchy to be rearranged at runtime, rather than
forcing a static parent-child hierarchy.
5.4 Problem: Process exit is communicated through signals
Solution: supervise gives you a file descriptor interface to monitor
a process for exit.
In addition to the file descriptor that supervise reads instructions
from, supervise also accepts a file descriptor to write status
changes to. You can read and monitor this file descriptor to receive
notification of process status changes.
6 How to really fix all these problems in the long term
Of course, supervise is not a long-term solution. Running an
additional helper process for every real process you start is an
annoying, if slight, inconvenience and performance loss. The correct
long-term solution is to actually get this functionality into the
Linux kernel.
pidfd is the obvious basis for a solution, since it solves two out of
this four problems. We just need to add a way to fix the process
leaking issues.
The best way to approach this would be to tie the lifetime of the
process to the lifetime of the pidfds pointing to it, as pdfork does
when PD_DAEMON is not set. This could be done with a new
CLONE_TERMINATE_ON_CLOSE flag. Then when all the pidfds pointing to a
process are closed, the process is killed. This happens naturally
when the parent of the process (and any other processes that the
parent sent the pidfd to) lose interest in the process.
This still allows the process to fork off and leak processes of its
own; that could be addressed by various means, perhaps by using a
seccomp filter to force the process to only start processes using
CLONE_TERMINATE_ON_CLOSE, or by adding a new prctl that is an
inheritable variant of PR_SET_PDEATHSIG. Using a pid namespace is a
major hassle at the moment since it requires a dedicated init
process, but it could also form the basis of a solution if that
requirement was removed.
Hopefully Linux gets these features soon. In the meantime, supervise
provides equivalent functionality in userspace.
Created: 2021-05-08 Sat 19:10
Validate