[HN Gopher] The Unix process API is unreliable and unsafe (2021)
___________________________________________________________________
The Unix process API is unreliable and unsafe (2021)
Author : todsacerdoti
Score : 107 points
Date : 2023-03-22 17:41 UTC (5 hours ago)
(HTM) web link (catern.com)
(TXT) w3m dump (catern.com)
| dataflow wrote:
| It seems there isn't even anything written about FD_CLOEXEC and
| its associated race conditions either, as far as I can tell.
| Basically it's impossible to portably spawn a subprocess in a
| safe manner if you don't have sufficient control over all the
| code running in your process, because you might duplicate file
| descriptors into the child that you might not have intended, and
| that can break things in the parent.
| rwmj wrote:
| AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but the
| possibility that some library might not be using it? (That is
| to say, *_CLOEXEC if used does not have race conditions)
|
| However we usually cope with that by closing all
| unknown/unexpected file descriptors after fork and before exec.
| Linux even has a system call to make that easier:
| https://man7.org/linux/man-pages/man2/close_range.2.html
| dataflow wrote:
| > AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but
| the possibility that some library might not be using it?
|
| Not exactly. The problem is that you have to be able to set
| it atomically from the creation of the file descriptor.
| Setting it after creation is subject to a race condition
| where a fork occurs in the interim. There's no portable way
| to do that, and people often ignore O_CLOEXEC even when
| there's a platform-dependent way to pass it. (How often do
| you see dup3() called, for example? And how often do you see
| higher-level languages and libraries expose this and force
| callers to make a conscious decision?)
|
| > However we usually cope with that by closing all
| unknown/unexpected file descriptors after fork and before
| exec.
|
| You can't really do that portably (well, maybe unless you
| want to call close() billions of times). And even if/when you
| _can_ do that, you run into the reverse problem, where you
| might close descriptors that were supposed to be duplicated
| into the subprocess but that you didn 't know about. (One
| example is when a user performs redirect inside a shell like
| 2>&3 and wants it to work inside a descendant process - you
| don't want to just randomly close FDs you don't recognize.)
| deathanatos wrote:
| > _1.1.4 A should run B inside a container_
|
| I think the author knows this, but you don't have to start a
| full-blown container if all you want is to solve the article's
| stated problem of process leaks. Become a new pid NS: point 1,
| the subprocess.run criticism is fixed (it just works); point 2, I
| don't believe a pid NS requires either root or a user NS; and all
| that remains is point 3. It doesn't _require_ you to start a
| separate init, you can _be_ the init, i.e., whatever your top-
| level service is. IIRC, the only two requirements is handling
| SIGTERM (which you should probably already be doing) and reaping
| reparented orphans who then die. But also dumb-init is available?
| The article notes using a separate init, too: "This init process
| will do nothing but increase the load on the system, and it will
| prevent us from directly monitoring the started processes." and
| ... no? dumb-init, in a container I have here that's run for >2
| weeks, has used < 20 ms of CPU time. RSS of 522 KiB. You'll be
| fine. I'm not sure how it "will prevent us from directly
| monitoring the started processes" -- it would live _above_ you in
| the process tree. You 'd monitor the started process the same way
| you would any started process.
|
| Edit: ah, crap, I've got it wrong. A new PID NS requires root (or
| user NSes); being a subreaper, I think, maybe does not. But I'm
| not sure being a subreaper is sufficient; you want the subtree
| reaped on the subtree root's death.
|
| (I'm also not sure that the subreaper approach is sufficient: if
| the subreaper itself dies, the processes leak.)
| mike_hock wrote:
| The subreaper is also gonna have the same footprint as the
| pidns init, and is _more complicated._
|
| It's just as flawed a solution as the other flawed solutions.
| We can accept the subreaper being bug-free as a requirement for
| this workaround to be working, but we can't prevent it from
| being sigkilled.
| jiveturkey wrote:
| Too bad the article doesn't discuss contracts, the Solaris
| solution. As the article is very linux focused, I imagine the
| author is blissfully unaware.
| jamesdutc wrote:
| I recently wrote an autorunner[1] (like Entr[2] and Watchexec[3])
| so I have some recent exposure to this problem. (I will be
| releasing it on Github shortly.) My autorunner allows running
| interactive programmes, so it is very sensitive to lingering
| child processes.
|
| For the purposes of the autorunner, I use approach 1.1.3 ("always
| write down the pid of every process you start, or otherwise
| coordinate between A and B") and leave it to the user to figure
| out what happens if the child process misbehaves with relation to
| any processes it starts.
|
| However, I want to point out that approach 1.1.4 ("A should run B
| inside a container") is easier to do than one might expect, and
| I'd like to plug one of my favourite utilities--Bubblewrap[4].
| The Bubblewrap documentation says "[y]ou are unlikely to use it
| directly from the commandline, although that is possible" but I
| have built some amazing little tools from it.
|
| Try the following invocation: bwrap --ro-bind /
| / --proc /proc --unshare-pid ps
|
| This launches `ps` in a PID namespace with a new `/proc` (since
| `ps` will read from the host proc otherwise) and the root
| filesystem mounted readonly. Any procesesses within the PID
| namespace should have been created by the immediate command that
| `bwrap` launched. There are also flags `--die-with-parent` and
| `--as-pid-1` which can further reduce runtime overhead. If you
| really need a supervisor process, this can be as simple as a
| `/bin/sh` script that `kill TERM --timeout 1000 KILL` in a loop
| on everything it sees in `ps`.)
|
| As you can see, there's a lot you can do with this tool with
| significantly lower overhead than using Docker. It has been my
| goal for some time to extract some of the functionality of
| Bubblewrap into a Zsh extension to allow accessing these
| mechanisms with even lower overhead. I think the creation of
| namespaces is a missing primitive in Linux shells, and being able
| to quickly construct namespaced environments allows for a style
| of safe, robust, simple shell scripting. e.g., if you create a
| mount namespace to run your script, you can actually be looser
| about parameterising file locations (since the namespace can
| ensure everything is exactly where you want it to be.)
|
| [1] https://fosstodon.org/@dontusethsicode/110019380909461936
|
| [2] http://eradman.com/entrproject/
|
| [3] https://watchexec.github.io/
|
| [4] https://github.com/containers/bubblewrap
| jrootabega wrote:
| Looks interesting. Have you needed or found any good ways to
| detach the wrapped code from the terminal where you first
| launch the wrapper? (for security mostly) I haven't found a
| good way to do that with bwrap other than using sudo or su and
| their pty feature. bwrap's --new-session flag didn't play nice
| with interactive programs in my attempts.
| ary wrote:
| This links one of my favorite critiques of API design: 'A fork()
| in the road'
|
| https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
|
| It's very much worth a read.
| 1vuio0pswjnm7 wrote:
| "I only know one existing solution that fixes all these problems
| without sacrificing flexibility or generality.
|
| Use the C utility supervise to start your processes; for Python,
| you can use its associated Python library."
|
| C utility written in 1999. Last updated in 2001. I'm still using
| it everyday, not always with multilog and svscan.
| evilotto wrote:
| Is basic fork/exec from a large process still slow or have newer
| apis fixed that?
| wmf wrote:
| I kept expecting Capsicum to step from behind the curtain but no.
| loeg wrote:
| Capsicum is about sandboxing code in the same process, not
| really related to the problem the article is talking about.
| FreeBSD's somewhat related mechanism to Linux pidfd is pdfork /
| pdkill:
| https://man.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2&n...
| dataangel wrote:
| > Shell scripts make starting processes trivial, but it's almost
| unthinkable that, say, bash, would integrate functionality for
| starting containers, so that every process is started in a
| container.
|
| Doooooooo it
| edgyquant wrote:
| Is it me or does this not make sense? Bash glues and pipes
| together commands, has network access etc. every process being
| a container would require either knowing all commands and being
| able to ensure containers have proper access (even across
| pipes) or that containers were so open as to defeat the
| purpose.
| lozenge wrote:
| The problem is every executable can impersonate the user, it
| has access to do anything the user can do, including deleting
| or encrypting all their files, reading ssh private keys etc.
| Network access is rarely concerning unless the program has
| access to credentials.
| Karellen wrote:
| > The problem is every executable can impersonate the user,
|
| Um, what?
|
| What do you mean by "impersonate" here? What does a process
| that does not impersonate the user look like? Do you just
| mean "executables that run as the user"?
|
| When you log in, and a shell is started that runs as you,
| is that shell impersonating the user?
|
| When you execute commands, as yourself, those commands run
| with your credentials. Because you ran them. Isn't that,
| like, the point?
| dllthomas wrote:
| Typically, any program I run has the totality of my
| (regular user) authority, which may let it do things I
| did not intend.
|
| Related:
|
| https://en.wikipedia.org/wiki/Ambient_authority
|
| https://en.wikipedia.org/wiki/Confused_deputy_problem
|
| https://en.wikipedia.org/wiki/Object-capability_model
| nyrikki wrote:
| Nothing is stopping you from using namespaces, and
| containers are just namespaces with cgroups etc
|
| But containers aren't jails, pid and uid remapping is just
| remapping.
|
| A huge problem container has to drop capabilities on the
| honor system. In the default docker mode, running as root,
| anyone who can launch a container can read from any block
| device if they don't drop the mknod capability as an
| example.
|
| Actually a privileged container can update the bios or even
| load arbitrary kernel modules in the host context or change
| kernel parameters as it is a shared kernel.
|
| I tried to get the docker folks to add a conf option
| disallow privileged container but they refused.
|
| You can run in user mode now but most people want
| persistence and other features that don't allow for that.
|
| The important point is if you assume containers are a
| security feature you are going to have a bad time. Jails
| were bad enough and containers are just one step up from
| chroots as far as security go.
|
| namespace isolation is the main benefit of containers.
|
| Selinux and apparmor are far more appropriate than
| containers for the security concerns. While I don't
| personally like selinux, apparmor profiles are pretty easy
| to write.
| nyrikki wrote:
| Plus the 'leaks' in the Linux process API is even worse
| as each container may run its own tiny-init
|
| Containers make the first point of the OP far worse by
| adding way more pid namespaces.
| wmf wrote:
| Maybe cgroups would be better than full containers here.
| GauntletWizard wrote:
| Which cgroups? Containers are not actually a thing in
| kernel-land. They're a combination of Process, Network,
| User, and other namespacing.
| wmf wrote:
| No, cgroups are a separate API from namespaces.
| https://man7.org/linux/man-pages/man7/cgroups.7.html
| mattpallissard wrote:
| Done.
| https://pallissard.net/2022/06/27/limiting_application_resou...
|
| Tl'dr two functions "dispatch" that calls systemd-run and
| "wrap" that takes a command, a memory limit, and a cpu limit.
| nine_k wrote:
| systemd is not bash. Otherwise indeed true.
| nickdothutton wrote:
| It is after reading pieces like these that I'm reminded of how
| fortunate I am to have had experience of other "serious"
| Operating Systems, used at scale, in complex and sometimes
| unfriendly environments. Namely VAX/VMS. Although some might feel
| the title was a little clickbaity, I enjoyed the article.
| DeathArrow wrote:
| VMS was released for x86, so if you miss it you can give it a
| spin.
|
| https://vmssoftware.com/about/news/2022-07-14-openvms-v92-fo...
| skissane wrote:
| Thus far the x86 port is only available to paying customers.
| x86 hobbyist program is expected very soon now (within the
| next few days/weeks). Until then, the best x86 option for
| hobbyist use is probably running the Alpha version under an
| emulator. (I don't know if any Itanium emulators are
| available.) Or emulated VAX-OpenVMS for VAX is no longer
| legally available to hobbyists, but not hard to find if you
| don't care about the legalities of it.
| gtirloni wrote:
| Is Fuchsia any better for what the article is concerned about?
|
| https://fuchsia.dev
| sitkack wrote:
| Excellent article! Thanks for posting it. It outlines all the
| problems and then offers a solution with this tool (by the
| author)
|
| https://github.com/catern/supervise
| cryptonector wrote:
| Yup. PIDs are racy unless they are direct children processes' and
| you've not reaped them yet. And it goes on.
|
| Windows has a much better process API, except for CreateProcess()
| (the less said about which the better).
|
| One thing I generally do when I have a multi-process program (one
| that starts multiple worker processes, say), is to have a pipe
| with the write end only in the parent process and whose read end
| the children include in their I/O event loops. That way when the
| parent exits the children find out and then they too exit. The
| parent will still try to signal them, but say the parent gets
| `SIGKILL`ed: the children find out and they exit.
| monocasa wrote:
| pidfds solve some of those problems.
| cryptonector wrote:
| Indeed, they do.
|
| One can approximate pidfd in multi-processed programs on OSes
| that lack it, but that's about it. pidfd needs to be first-
| class.
| rand_flip_bit wrote:
| Curious why you think CreateProcess is worse than fork/exec.
| Sure it takes about a dozen parameters but is that really the
| end of the world?!? It's much much easier to use correctly and
| doesn't have nearly as many of the pitfalls as fork/exec.
| Especially in large processes with lots of memory allocated. I
| genuinely don't understand why people dislike it so much.
| jborean93 wrote:
| Most of the complaints I've seen are about the number of args
| and the complexity of calling it vs something simple like
| fork. There are a lot of knobs to turn which you need to be
| explicit about. That's not even getting into the whole
| ProcThreadAttributeList and the myriad of options it exposes.
|
| In saying all that I do prefer the `CreateProcess*` APIs on
| Windows vs the POSIX ones but that might be because I
| understand the former better.
| bolangi wrote:
| Where process supervision is required under unix, you can use
| systemd, the linux-only solution pushed by redhat, or one of the
| small supervision suites such as s6 developed by skarnet.org.
| slondr wrote:
| What happens when s6 crashes, then?
| [deleted]
| aidenn0 wrote:
| Do cgroups solve any of these problems? I was mildly surprised to
| not see them mentioned.
| wmf wrote:
| Where the author talks about containers you can mentally
| substitute cgroups since Linux containers are cgroups +
| namespaces.
| rcoveson wrote:
| That's how I look at it too, but lots of people don't look at
| it that way, hence all the handwaving about "too heavyweight"
| and "seems like overkill" etc.
|
| Largely because of Docker and Kubernetes, many think of a
| container as _all_ of the following:
|
| 1. A cgroup + [all or nearly all of the] unshare-able
| namespaces
|
| 2. A writable, disposable overlay on top of an immutable
| "image", which may be lazily downloaded and extracted
|
| 3. A resource managed by a userspace daemon managed by a
| userspace utility over a socket
|
| 4. Optionally, a seccomp-bpf filter or apparmor profile or
| something
|
| But there's a whole useful spectrum between a vanilla process
| and a Docker container like that. Lots of points on that
| spectrum still feel highly container-ized but aren't really
| much more heavyweight than a vanilla process.
|
| Beyond that, in the point about PID namespaces, the author
| should mention that there are ultra-light-weight init
| implementations that are barely a factor in overhead.
| jwilk wrote:
| They were mentioned in 2.1.1.
| not_enoch_wise wrote:
| you're exciting me
| userbinator wrote:
| Is "unreliable and unsafe" the new "considered harmful"? Because
| it sure feels like that.
| calt wrote:
| I think it's quite a bit more descriptive and objective than
| "considered harmful."
| wang_li wrote:
| This reads like they have a set of requirements and since the
| Unix model doesn't meet their requirements, the Unix model is
| bad. As opposed to it's fine for those who have different
| requirements.
| kccqzy wrote:
| Note that this article is really talking about the general case,
| but in practice a lot of techniques can work if you have narrower
| requirements or if you have more control over what you run.
|
| For example, in 1.1.4 the author talks about why containers are
| not a solution giving three distinct reasons. But if we change
| our perspective a little bit, none of the three reasons are
| blocking. The first is that it's not easy; but `docker run` or
| `podman run` is easy. Even systemd units start with separate
| control groups to allow you to terminate everything at once. The
| second reason was about gdb; when was the last time you used gdb
| in production? If you are using gdb someone is interactively
| using the computer and can be relied upon to clean up processes
| manually. The third reason is that containers are more
| heavyweight, but there's no need to make every subprocess a
| separate container: if multiple processes should be managed as a
| single unit (including the case when we'd want to terminate a
| whole group of processes) they should run in the same container.
|
| So with a slight change of perspective we find the problem easily
| solved. It had trade offs but it works well enough in practice
| that only very few purists have a problem with it. Not to diss on
| the author--I think this type of perfectionist thinking is
| illuminating in terms of API design--but pragmatically it's a
| solved problem.
| wahern wrote:
| The same is true for the process group/session + controlling
| terminal solution: the solution doesn't work recursively (can't
| do process management downstream), and it also requires child
| processes to abstain from changing SIGHUP handler or mask, but
| in the vast majority most cases none of those limitations are a
| problem. Combined with POSIX fcntl locks[1] on a PID file, this
| is my go to generic solution for Unix-portable[2], multiprocess
| daemons. The amount of code required in the supervisor
| component is quite trivial, yet covers almost all of your
| bases.
|
| [1] fcntl locks permit querying the PID of the lock holder, so
| you don't need to write the PID to the file, providing a
| solution to the PID file race and loaded gun dilemmas. (There's
| still a race, but the same race exists with Linux containers,
| and both can be resolved in similar manner--query PID, send
| SIGSTOP, verify PID association, send SIGKILL or SIGCONT.)
|
| [2] One of the crucial behaviors, that the kernel atomically
| sends SIGHUP to all processes in the group if the controlling
| process terminates, isn't guaranteed by POSIX, but it's the
| behavior on all Unix I've tried--AIX, FreeBSD, macOS, Linux,
| NetBSD, OpenBSD, and Solaris.
| kwhitefoot wrote:
| That was interesting and clearly written, I wish all such
| articles were as clear.
___________________________________________________________________
(page generated 2023-03-22 23:00 UTC)