[HN Gopher] Fork() is evil; vfork() is goodness; afork() would b...
       ___________________________________________________________________
        
       Fork() is evil; vfork() is goodness; afork() would be better;
       clone() is stupid
        
       Author : __s
       Score  : 237 points
       Date   : 2022-02-28 17:25 UTC (5 hours ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | kazinator wrote:
       | Concurrently running dupe currently on front page:
       | https://news.ycombinator.com/item?id=30499169
       | 
       | :) :) :)
        
         | cryptonector wrote:
         | Ha!
        
       | immibis wrote:
       | Another option is to allow the parent to create an empty child
       | process, and then make arbitrary system calls and execute code in
       | the child, like a debugger does. In most cases the last "remote
       | system call" would be exec.
        
         | cryptonector wrote:
         | posix_spawn() essentially is like that, or can be, as an
         | implementation detail.
        
       | infogulch wrote:
       | The dense fog lifts, tree branches part, a ray of light beams
       | down on a pedestal revealing the hidden intentions of the
       | ancients. A plaque states "The operational semantics of the most
       | basic primitives of your operating system are designed to
       | simplify the implementation of shells." You hesitantly lift your
       | eyes to the item presented upon the pedestal, take a pause in
       | respect, then turn away slumped and disappointed but not entirely
       | surprised. As you walk you shake your head trying to evict the
       | after image of a beam of light illuminating a turd.
        
         | ckastner wrote:
         | > _" The operational semantics of the most basic primitives of
         | your operating system are designed to simplify the
         | implementation of shells."_
         | 
         | Yes, but why is this characterized as something negative?
         | 
         | Isn't that the entire point? Operating systems are there to
         | serve user requests, and shells are an interface between user
         | and OS.
         | 
         | Shells simply developed features that users required of them.
        
         | [deleted]
        
       | psanford wrote:
       | I saw a bug once where an application would get way slower on
       | MacOS after calling fork(). Not just temporarily either; many
       | syscalls would continue to run slowly from the call to fork()
       | until the process exited.
       | 
       | Looking on Stack Overflow, I see a few reports of this
       | behavior[0][1].
       | 
       | [0]: https://stackoverflow.com/questions/4411840/memory-access-
       | af...
       | 
       | [1]: https://stackoverflow.com/questions/27932330/why-is-
       | tzset-a-...
        
       | tych0 wrote:
       | The problem with this argument is that the set of programs that
       | just fork() and then exec() is fairly small. Sure, shells are
       | small and do this, but then the article argues that shells are a
       | good use of fork().
       | 
       | In larger programs, you're forking because you need to diverge
       | the work that's going to be done and probably where it's going to
       | be done (maybe you want to create a new pid ns, you need a
       | separate mm because you're going to allocate a bunch, whatever).
       | Maybe the argument is that programs should never do this? I don't
       | buy that. Then there's a lot of string-slinging through exec().
        
         | olliej wrote:
         | The vast majority of programs that fork are doing fork()
         | followed almost immediately by exec(), to the extent that on
         | macOS for example a process is only really considered safe for
         | exec() after fork() happens. Pretty much nothing else is
         | considered safe.
        
           | NovemberWhiskey wrote:
           | Yeah; that would be my assumption too. I worked one time on a
           | significant project that benefit from fork() without exec()
           | and it was a monstrous pain - only if you own every single
           | line of code in your project, have centralized resource
           | management, and have no significant library dependencies
           | should you ever consider doing this.
        
             | olliej wrote:
             | Yeah, you can't depend on pthreads or pthread mutexes
             | (they're not defined as being fork safe).
             | 
             | The entirety of Foundation (so presumably anything in
             | Swift) is not fork safe either.
             | 
             | To be clear: "not fork safe" in this case means "severely
             | constrained environment": e.g. you can do things liker
             | limits, set up pipes, etc but good luck with much more. I
             | guess morally similar to the restrictions you have in a
             | signal handler, albeit with different restrictions.
        
         | cryptonector wrote:
         | Oh no, there's tons of ProcessBuilder type APIs in Java,
         | Python, and... every major language you can think of.
         | 
         | The problems with fork() become very apparent in any Java apps
         | that try to run external programs, especially in apps that have
         | many threads and massive heaps and are very busy.
        
         | kllrnohj wrote:
         | > In larger programs, you're forking because you need to
         | diverge the work that's going to be done and probably where
         | it's going to be done
         | 
         | That's usually going to be done with clone() instead, no?
         | You'll likely want to fiddle with the various flags for those
         | usages and are unlikely to be happy with what fork() otherwise
         | does.
        
         | pm215 wrote:
         | That's backwards from my experience, which is that most users
         | of fork() only do "fork; child does small amount of setup, eg
         | closing file descriptors; exec". Shells are one of the few
         | programs that do serious work in the child, because the POSIX
         | shell semantics surface "create a subshell and do..." to the
         | shell user, and then the natural way to implement that when
         | you're evaluating an expression tree is "fork, and let the
         | child process continue evaluating as a long-lived process
         | continuing to execute as the same shell binary". (Depending on
         | what's in that sub-tree of the expression, it might eventually
         | exec, but it equally might not.)
         | 
         | Many years back I worked on an rtos that had no fork(), only a
         | 'spawn new process' primitive (it didn't use an MMU and all
         | processes shared an address space, so fork would have been
         | hard). Most unixy programs were easy to port, because you could
         | just replace the fork-tweak-exec sequence with an appropriate
         | spawn call. The shells (bash, ash I think were the two I looked
         | at) were practically impossible to port -- at any rate, we
         | never found it worth the effort, though I think with a lot of
         | effort and willingness to carry invasive local patches it could
         | have been done.
        
       | lucideer wrote:
       | The good/evil/etc. here seem to be defined exclusively around
       | "performance above all else", and - more specifically -
       | performant primitives over performant application architecture.
       | 
       | It strikes me that performance gains associated with sharing
       | address space & stack are similar to many performance gains:
       | trade-offs. So calling them "good" and "evil" when performance is
       | seemingly your sole goal and interest seems a bit forward.
        
         | cryptonector wrote:
         | In my world we often say things like "X is the moral equivalent
         | of Y" where X and Y are just technologies and, _clearly_ , are
         | morally-neutral things.
         | 
         | Why do we do this? Well, because it adds emphasis, and a dash
         | of humor.
         | 
         | Clearly fork() is neither Good nor Evil. It's morally neutral.
         | It has no moral value whatsoever. But to say "fork() is evil"
         | is to cause the audience to raise their eyebrows -"what, why
         | would you say fork() is evil?!"- and maybe pay attention.
         | 
         | Yes, there is the risk that the audience might react
         | dismissively because fork() obviously is morally-neutral, so
         | any claim that it is "evil" must be vacuous or hyperbolic. It's
         | a risk I chose to take.
         | 
         | Really, it's a rhetorical device. I think it's pretty standard.
         | I didn't create that device myself -- I've seen it used before
         | and I _liked_ it.
        
       | pipeline_peak wrote:
       | Your idea good
       | 
       | Your idea stupid
       | 
       | I'm not woke by any means, idk what it is about low level
       | programming but calling someone's idea "stupid" is a really
       | shitty thing to say.
       | 
       | "He chose to take it personally" is the type of lazy, pseudo-
       | stoic argument I have no interest in reading.
       | 
       | Yes I'm having a morning, lol.
        
         | cryptonector wrote:
         | I answered this here:
         | https://news.ycombinator.com/item?id=30504804
         | 
         | It's a rhetorical device. I didn't expect this to -years later-
         | become a front-page item on HN. I wrote that to share with
         | certain people.
         | 
         | And yes, clone() has some real problems, and if calling it
         | "stupid" pisses off some people, but maybe also leads others to
         | want to improve clone() or create a better alternative, then
         | that's fine. If I'd wanted to write an alternative to Linux I'd
         | probably have had to deal with the very, very fine language
         | that Linus and others use on the Linux kernel mailing lists --
         | if you don't like my using the word "stupid", then you really
         | shouldn't look there because you're likely to be very
         | disappointed. Indeed, not only would I have to accept colorful
         | language from reviewers there, I'd probably have to employ some
         | such language myself.
         | 
         | TL;DR: clone() came from Linux, where "stupid" is the least
         | colorful language you'll find, and me calling it "stupid" is
         | just a rhetorical device.
        
       | ridiculous_fish wrote:
       | Amusingly vfork semantics differ across OSes. This program prints
       | 42 in Linux but 1 on Mac: https://godbolt.org/z/jn7Gaf5Me because
       | on Linux they share address space.
        
         | butterisgood wrote:
         | I am pretty sure Mac OS doesn't COW fork(), and that the
         | address space is copied. At least it was the last time I
         | looked. FreeBSD and Linux both seem to COW.
         | 
         | Perhaps there's a reason vfork is different too.
        
           | ridiculous_fish wrote:
           | My (very possibly wrong) understanding is that xnu does CoW
           | fork but doesn't overcommit, meaning that memory must be
           | reserved (perhaps in swap) in case the pages need to be
           | duplicated.
           | 
           | There's other complications relating to inheriting Mach ports
           | and the mach_task <-> BSD process "duality" in xnu, which
           | Linux doesn't have. I'd love for someone to chime in who
           | knows more about how this stuff works.
        
         | cryptonector wrote:
         | Unfortunately there was this paper from the 80s titled "vfork()
         | Considered Dangerous", which led to BSDs removing vfork(), and
         | then later it was re-added because that paper was clearly quite
         | wrong. But the news hasn't quite filtered through to Apple, I
         | guess.
        
       | scottlamb wrote:
       | > clone() is stupid ... the clone(2) design, or its maintainers,
       | encourages a proliferation of flags, which means one must
       | constantly pay attention to the possible need to add new flags at
       | existing call sites.
       | 
       | IMHO a bigger problem [2] in practice with clone is that
       | (according to glibc maintainers) once your program calls it, you
       | can't call any glibc function anymore. [1] Essentially the raw
       | syscall is a tool for the libc implementation to use. The libc
       | implementation hasn't provided a wrapper for programs to use
       | which maintains the libc's internal invariants about things like
       | (IIUC) thread-local storage for errno.
       | 
       | The author's aforkx implementation is something that glibc
       | maintainers could (and maybe should) provide, but my
       | understanding is that you can get in trouble by implementing it
       | yourself.
       | 
       | [1] https://github.com/rust-
       | lang/rust/issues/89522#issuecomment-...
       | 
       | [2] editing to add: or at least a more _concrete_ expression of
       | the problem. Wouldn 't surprise me if they haven't provided this
       | wrapper in part because the proliferation the author mentioned
       | makes it difficult for them to do so.
        
       | saurik wrote:
       | One use case for fork()--which is used extensively on Android--is
       | to build an expensive template process that can then be
       | replicated for later work, which is exactly what people often
       | want for the behavior with virtual machines. I wrote an article
       | on the history of linking and loading optimizations leading up to
       | how Android handles their "zygote" which touches on this
       | behavior.
       | 
       | http://www.cydiasubstrate.com/id/727f62ed-69d3-4956-86b2-bc0...
        
       | mrob wrote:
       | Fork() is the second worst idea in programming, behind null
       | pointers. Fork() is the reason overcommit exists, which is the
       | reason my web browser crashes if I open too many tabs, and the
       | reason the "safe" Rust programming language leaves software
       | vulnerable to DOS attacks if it uses the standard library. It's a
       | clear example of "worse is worse", and we should have switched to
       | the Microsoft Windows model decades ago.
       | 
       | Here's a paper from Microsoft Research supporting this point of
       | view:
       | 
       | https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
        
         | silon42 wrote:
         | Agreed about overcommit and resulting mess.
        
         | alerighi wrote:
         | Why overcommit is a problem? A program is unlikely to use all
         | the memory that it allocates, or use it only at a later time.
         | It would be a waste to not have it, it would mean having a ton
         | of RAM that never gets used because a lot of programs allocates
         | more ram that they will probably ever need. And it would be
         | inefficient, costly and error prone to use dynamic memory
         | allocation for everything.
         | 
         | The cause of your browser crash is not the overcommit, is
         | simply the fact that you have not enough memory. If you disable
         | overcommit (something you can do on Linux) you would the same
         | crash earlier, before you allocated (not necessary used) 100%
         | of your RAM (because really no software handles the dynamic
         | memory fail condition, i.e. malloc returning null, that you
         | can't handle reasonably).
         | 
         | Null pointers are not a mistake, how do you signal the absence
         | of a value otherwise? How do you signal the failure of a
         | function that returns a pointer without having to return a
         | struct with a pointer and an error code (which is inefficient
         | since the return value doesn't fit a single register)? null
         | makes a perfect sense to be used as a value to signal "this
         | pointer doesn't point to something valid".
         | 
         | Microsoft saying that fork() was a mistake... well, of course,
         | because Windows doesn't have it. fork was a good idea and that
         | is the reason why it's still used these days. Of course
         | nowadays there are evolution, in Linux there is the clone
         | system call (fork is deprecated and still there for
         | compatibility reasons, the glibc fork is implemented with the
         | clone system call). But the concept of creating a process by
         | cloning the resources of the parent is something that to me
         | always seamed very elegant to me.
         | 
         | In reality fork is something that (if I remember correctly, I
         | don't have that much experience in programming in Windows)
         | doesn't exist on Windows, and the only way to create a new
         | process of the same program is to launch the executable, and
         | pass the parameters from the command line, that is not that
         | great for efficiency at all, and also can have its problems
         | (for example the executable was deleted, renamed, etc while the
         | program was running). Also in Windows there is neither the
         | concept of exec, tough I think it can be emulated in software
         | (while fork can't).
         | 
         | To me it makes perfect sense to separate the concept of
         | creating a new process (fork/clone) and loading an executable
         | from disk (exec). It gives a lot of flexibility, at a cost that
         | is not that high (and there are alternatives to avoid it, such
         | as vfork or variations of the clone system call, or directly
         | higher level API such as posix_spawn).
        
           | mrob wrote:
           | >Null pointers are not a mistake
           | 
           | The inventor, Tony Hoare, famously called them his "billion-
           | dollar mistake". The better way to do it is with nullable
           | types (which could internally represent null as 0 as a
           | performance optimization). This is something Rust gets right.
        
             | alerighi wrote:
             | Nullable types... they have the same problems as null
             | pointers: if you don't care about handling the case they
             | are null the program will crash, if you handle it, you can
             | handle it also for null pointers. Well, they have a nicer
             | syntax, and that's it. How much Rust code is full of
             | `.unwrap()` because programmers are lazy and don't want to
             | check each optional to see if it's valid? Or simply don't
             | care about it, since having the program crash on an
             | unexpected condition is not the end of the world.
        
           | initplus wrote:
           | I think much of the confusion around nulls stems from the
           | fact that in mainstream languages pointers are overloaded for
           | two purposes: for passing values by reference, and for
           | optionality.
           | 
           | Nearly every pointer bug is caused by the programmer wanting
           | one of these two properties, and not considering the
           | consequences of the other.
           | 
           | Non-nullable references and pass-by-value optionals can
           | replace many usages of pointers.
        
             | alerighi wrote:
             | Yes, and they are just two usages of pointers. The fact is
             | that, whatever you call it, null pointer, nullable
             | reference, optional, you have to put in a language a
             | concept of "reference to an object that can reference a non
             | valid object".
        
         | nqzero wrote:
         | unpopular opinion: null pointers (in at least java and c) are
         | the single greatest metaphor in software development, and are
         | the CS analog to the invention of zero
        
         | im3w1l wrote:
         | There was an article about exceptions the other day that
         | lamented that exceptions are high latency because the
         | exceptional path will be paged out. I would assume overcommit
         | is to blame for that too.
        
           | mort96 wrote:
           | Why would you assume that..?
        
           | oconnor663 wrote:
           | That's probably a caching issue, and caching issues are a
           | fact of life for the foreseeable future. (Could also be a
           | disk swap issue, but probably not.)
        
         | nick_ wrote:
         | Interesting take. If you don't mind explaining, what is the MS
         | Windows model in in this context?
        
           | mrob wrote:
           | You opt into inheriting specific contexts from the parent,
           | instead of copying everything by default:
           | 
           | https://docs.microsoft.com/en-
           | us/windows/win32/api/processth...
        
             | manwe150 wrote:
             | More importantly, all syscalls also take a target process
             | as an argument, making the Windows version both simpler and
             | more powerful than can be done with fork. Spawn is also a
             | lot slower on Windows, but that is an implementation issue.
        
               | kllrnohj wrote:
               | > Spawn is also a lot slower on Windows, but that is an
               | implementation issue.
               | 
               | afaik _most_ of that slowdown is because malware scanners
               | (including Windows Defender) hook spawn to do blocking
               | verification of what to launch. Which is an issue also
               | present on eg. MacOS, and why it 's also kinda slow to
               | launch new processes (and can be subject to extreme
               | latencies): https://www.engadget.com/macos-slow-apps-
               | launching-221445977...
               | 
               | Which is yes an implementation problem, but also a
               | problem that potentially changes/impacts the design. Like
               | maybe it'd make sense to get a handle to a pre-verified
               | process so that repeated spawns of it don't need to hit
               | that path (for eg. something like Make or Ninja that just
               | spam the same executable over and over and over again).
               | Or the kernel/trusted module needs to in some way be
               | involved & can recognize that an executable was already
               | scanned & doesn't need to be re-scanned.
        
               | [deleted]
        
           | indrora wrote:
           | Windows doesn't have fork as you know it. It has a POSIX-ish
           | fork-alike for compliance, but under the hood it's
           | CreateThread[0] with some Magic.
           | 
           | in Windows, you create the thread with CreateThread, then are
           | passed back a handle to that thread. You then can query the
           | state of the thread using GetExitCodeThread[1] or if you need
           | to wait for the thread to finish, you call
           | WaitForSingleObject [2] with an Infinite timeout
           | 
           | Aside: WaitForSingleObject is how you track _a bunch_ of
           | stuff: semaphores, mutexes, processes, events, timers, etc.
           | 
           | The flipside of this is that Windows processes are buckets of
           | handles: a Process object maintains a series of handles to
           | (threads, files, sockets, WMI meters, etc), one of which
           | happens to be the main thread. Once the main thread exits,
           | the system goes back and cleans up (as it can) the rest of
           | the threads. This is why sometimes you can get zombie'd
           | processes holding onto a stuck thread.
           | 
           | This is also how it's a very cheap operation to interrogate
           | what's going on in a process ala Process Explorer.
           | 
           | If I had to describe the difference between Windows and Linux
           | at a process model level, I have to back up to the
           | fundamental difference between the Linux and Windows
           | _programming_ models: Linux is is a kernel that has to hide
           | its inner workings for its safety and security, passing
           | wrapped versions of structures back and forth through the
           | kernel-userspace boundary; Windows is a kernel that considers
           | each portion of its core separated, isolated through ACLs,
           | and where a handle to something can be passed around without
           | worry. The windows ABI has been so fundamentally stable over
           | 30 years now because so much of it is built around
           | controlling object handles (which are allowed to change under
           | the hood) rather than manipulation of of kernel primitives
           | through syscalls.
           | 
           | Early WinNT was very restrictive and eased up a bit as
           | development continued so that win9x software would run on it
           | under the VDM. Since then, most windows software insecurities
           | are the result of people making assumptions about what will
           | or won't happen with a particular object's ACL.
           | 
           | There's a great overview of windows programming over at [3].
           | It covers primarily Win32, but gets into the NT kernel
           | primitives and how it works.
           | 
           | A lot of work has gone into making Windows an object-oriented
           | kernel; where Linux has been looking at C11 as a "next step"
           | and considering if Rust makes sense as a kernel component,
           | Windows likely has leftovers of Midori and Singularity [4]
           | lingering in it that have gone onto be used for core
           | functionality where it makes sense.
           | 
           | [0] https://docs.microsoft.com/en-
           | us/windows/win32/api/processth... [1]
           | https://docs.microsoft.com/en-
           | us/windows/win32/api/processth... [2]
           | https://docs.microsoft.com/en-
           | us/windows/win32/api/synchapi/... [3]
           | https://www.tenouk.com/cnwin32tutorials.html [4]
           | https://www.microsoft.com/en-
           | us/research/project/singularity...
        
         | cryptonector wrote:
         | Overcommits exist any time you can have a debugger anyways.
         | 
         | fork() was a brilliant way to make Unix development easy in the
         | 70s: it made it trivial move a lot of development activity out
         | of the kernel and into user-land.
         | 
         | But with it came problems that only became apparent much later.
        
         | oconnor663 wrote:
         | > the reason the "safe" Rust programming language leaves
         | software vulnerable to DOS attacks if it uses the standard
         | library
         | 
         | Linux overcommitment is often cited as an argument for the
         | "panic on OOM" design of the allocating parts of the Rust
         | standard library, and it's an important part of the story. But
         | I think even if the Linux defaults were different, Rust would
         | still have gone with the same design. For example, here's Herb
         | Sutter (who works for Microsoft) arguing that C++ would benefit
         | from aborting on allocation failure:
         | https://youtu.be/ARYP83yNAWk?t=3510. The argument is that the
         | vast majority of allocations in the vast majority of programs
         | don't have any reasonable options for handling an alloc failure
         | besides aborting. For languages like C++ and Rust, which want
         | to support large, high-level applications in addition to low-
         | level stuff, making programmers litter their code with explicit
         | aborts next to every allocation would be really painful.
         | 
         | I think it's very interesting that Zig has gone the opposite
         | direction. It could be that writing big applications with lots
         | of allocs ends up feelign cumbersome in Zig, or it could be
         | that they bend the curve. Fingers crossed.
        
       | JoeAltmaier wrote:
       | The whole idea of fork is strange - the design pattern of "child
       | process is executing exactly where the parent process is
       | executing" is foreign to me. Don't we want to direct where the
       | child process is executing? Like, when creating a thread? Why is
       | fork() so conceptually orthogonal to that? Is there a good
       | reason? A historical reason?
       | 
       | I don't find fork() to be obvious or useful or natural. I work
       | hard to never do it.
        
         | zamalek wrote:
         | It's a leaky abstraction and everything it does can be done
         | manually, and possibly better. It exists purely because, at
         | some point in the past, threads didn't exist.
         | 
         | If you design your program without fork, you'll probably end up
         | with a cleaner and faster solution. Some things are best
         | forgotten or never learned in the first place.
        
           | alerighi wrote:
           | A thread is not the same thing of a process. There are
           | situations where you are fine with a thread, other where you
           | need a process.
        
           | _flux wrote:
           | Can it though?
           | 
           | The beauty of (v)fork(+exec) is that it doesn't need a new
           | interface for configuring the environment in whichever way
           | you want before the other process starts. Instead you get to
           | use the exact same means of modifying the environment to your
           | needs, and once it's done, you can call exec and the new
           | process inherits those things.
           | 
           | I mean, just look at the interface of posix_spawn.
           | 
           | I grant though that this isn't without its problems
           | (including performance) and IMO e.g. FD_CLOEXEC is one
           | example of how those problems can be patched up. It's like
           | the reverse problem: you have too wide implicit interface in
           | it, and then you need to come up with all these ways to be
           | explicit about some things.
        
         | shadowofneptune wrote:
         | fork()-exec() separation indeed exists for historical reasons:
         | https://www.bell-labs.com/usr/dmr/www/hist.html
         | 
         | Search for the phrase "Process control in its modern form was
         | designed and implemented within a couple of days."
        
         | kccqzy wrote:
         | If you want the child to start executing some other code but
         | you have fork(), it's easy to do it yourself by calling that
         | function.
         | 
         | But on the other hand, if you do want the child to execute code
         | at the same place as the parent, but a hypothetical fork() asks
         | you to provide a function pointer, it would be a bit more
         | complicated.
        
         | alerighi wrote:
         | It makes creating processes easy to me, when you did understand
         | how it works:                   while (1) {             int
         | client_socket = accept(socket, &client_addr, &client_len);
         | if (client_socket > 0) {                pid_t pid = fork();
         | if (pid < 0) {                    // handle error
         | }                if (pid == 0) {
         | handle_connection(client_socket, &client_addr);
         | }             } else {                // handle error
         | }        }
         | 
         | No need to do complex things to start a new process, having to
         | pass argument to it in some way, etc.
        
       | monocasa wrote:
       | Hard disagree to most of this.
       | 
       | fork(2) makes a lot more sense when you realize its heritage. It
       | came from a land before Unix supported full MMUs. In this model,
       | to still have per process address spaces and preemptive
       | multitasking on what was essentially a PC-DOS level of hardware,
       | the kernel would checkpoint the memory for a process, slurp it
       | all out to dectape or some such, and load in the memory for
       | whatever the scheduler wanted to run next. It's simplicity of
       | being process checkpoint based wasn't a reaction to windows style
       | calls (which wouldn't exist for almost a couple decades), but
       | instead mainframe process spawning abominations like JCL. The
       | idea "you probably want most of what you have so force a
       | checkpoint, copy the checkpoint into a new slot, and continue
       | separately from both checkpoints" was soooo much better than JCL
       | and it's tomes of incantations to do just about anything.
       | 
       | vfork(2) is an abomination. Even when the child returns, the
       | parent now has a heavily modified stack if the child didn't
       | immediately exec(). All of those bugs that causes are super fun
       | to chase, lemme tell you. AFAIC, about the only valid use for
       | vfork now is nommu systems where fork() incredibly expensive
       | compared to what is generally expected.
       | 
       | clone(2) is great. Start from a checkpoint like fork, but instead
       | of semantically copying everything, optionally share or not based
       | on a bitmask. Share a tgid, virtual address space, and FD table?
       | You just made a thread. Share nothing? You just made a process.
       | It's the most 'mechanism, not policy' way I've seen to do context
       | creation outside of maybe the l4 variants and the exokernels.
       | This isn't an old holdover, this is how threads work today,
       | processes spawned that happen to share resources. Modern archs on
       | linux don't even have a fork(2) syscall; it all happens through
       | clone(2). Even vfork is clone set to share virtual address space
       | and nothing else that fork wouldn't share. Namespaces are a way
       | to opt into not sharing resources that normally fork _would_
       | share.
       | 
       | And I don't see what afork gets you that clone doesn't, except
       | afork isn't as general.
        
       | [deleted]
        
       | sys_64738 wrote:
       | The intent of fork() is to start a new process in its own address
       | space. That *fork() variations that run in the SAME address space
       | are confusing. A use case today for fork() might also be
       | sandboxing apps. Certainly I expect browsers use this approach to
       | spawn unique pages. But generally fork() is very specific from my
       | recollection.
        
         | cryptonector wrote:
         | > The intent of fork() is to start a new process in its own
         | address space.
         | 
         | True!
         | 
         | > That _fork() variations that run in the SAME address space
         | are confusing.
         | 
         | Why is it confusing? They are distinct and different system
         | calls, with different semantics. They are also sufficiently
         | similar that they are also similarly named. But there's nothing
         | confusing about their _semantics _. vfork() is_ not* harder to
         | use than fork() -- it's just subtly different.
         | 
         | > A use case today for fork() might also be sandboxing apps.
         | Certainly I expect browsers use this approach to spawn unique
         | pages.
         | 
         | I wouldn't expect that. Sandboxing is a large and complex
         | topic.
        
       | cryptonector wrote:
       | Well, I'm surprised to see this on the front page, let alone as
       | #1. Ask me anything.
       | 
       | EDIT: Also, don't miss @NobodyXu's comment on my gist, and don't
       | miss @NobodyXu's aspawn[1].                 [0] https://gist.gith
       | ub.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234?permalink_co
       | mment_id=3467980#gistcomment-3467980        [1]
       | https://github.com/NobodyXu/aspawn/
        
         | Lerc wrote:
         | Since you said anything... This is not strictly related to the
         | article but your expertise seems to be in the right area.
         | 
         | I have a process that executes actions for users, at the moment
         | that process runs as root until it receives a token indicating
         | an accepted user, then it fork()s and the fork changes to the
         | UID of the user before executing the action.
         | 
         | Is there a better way? I hadn't actually heard of vfork()
         | before reading this article. I'm guessing maybe you could do a
         | threaded server model where each thread vfork()s. I'm not
         | really aware what happens when threads and forks combine. Does
         | the v/fork() branch get trimmed down to just that one thread?
         | If so what happens to the other thread stacks? It feels like a
         | can of worms.
        
           | cryptonector wrote:
           | If the parent is threaded, then yes, vfork() will be better.
           | You could also use posix_spawn().
           | 
           | As to "becoming a user", that's a tough one. There are no
           | standard tools for this on Unix. The most correct way to do
           | it would be to use PAM in the child. See su(1) and sudo(1),
           | and how they do it.
           | 
           | > I'm not really aware what happens when threads and forks
           | combine. Does the v/fork() branch get trimmed down to just
           | that one thread? If so what happens to the other thread
           | stacks? It feels like a can of worms.
           | 
           | Yes, fork() only copies the calling thread. The other
           | threads' stacks also get copied (because, well, you might
           | have pointers into them, who knows), but there will only be
           | one thread in the child process.
           | 
           | vfork() also creates only one thread in the child.
           | 
           | There used to be a forkall() on Solaris that created a child
           | with copies of all the threads in the parent. That system
           | call was a spectacularly bad idea that existed only to help
           | daemonize: the parent would do everything to start the
           | service, then it would forkall(), and on the parent side it
           | would exit() (or maybe _exit()). That is, the idea is that
           | the parent would not finish daemonizing (i.e., exit) until
           | the child (or grandchild) was truly ready. However, there's
           | no way to make forkall() remotely safe, and there's a much
           | better way to achieve the same effect of not completing
           | daemonization until the child (or grandchild) is fully ready.
           | 
           | In fact, the daemonization pattern of not exiting the parent
           | until the child (or grandchild) is ready is very important,
           | especially in the SMF / systemd world. I've implemented the
           | correct pattern many times now, starting in 2005 when project
           | Greenline (SMF) delivered into OS/Net. It's this: instead of
           | calling daemon(), you need a function that calls pipe(), then
           | fork() or vfork(), and if fork(), and on the parent side then
           | calls read() on the read end of the pipe, while on the child
           | side it returns immediately so the child can do the rest of
           | the setup work, then finally it should write one byte into
           | the write side of the pipe to tell the parent it's ready so
           | the parent can exit.
        
         | ahmedalsudani wrote:
         | No questions yet as I am yet to read ... but I can already
         | comment and say grade A title.
        
           | cryptonector wrote:
           | It's a bit opinionated. It's meant to get a reaction, but
           | also to have meaningful and thought-provoking content, and I
           | think it's correct in the main too. Anyways, hope you and
           | others enjoy it.
        
             | ahmedalsudani wrote:
             | That was a great read. Thank you for writing it up; I
             | learned quite a few things!
             | 
             | Especially appreciated the OS minutiae and opinionated
             | commentary (... and the doc vs reality observation in
             | Linux's vfork).
             | 
             | The piece lives up to the great title :)
        
             | disgruntledphd2 wrote:
             | What do you mean by zones/jails and why are they better
             | than containers?
        
               | cryptonector wrote:
               | Zones -> Solaris/Illumos Zones
               | 
               | Jails -> BSD jails
               | 
               | They're software VMs. It's a lot like containers, yes.
               | 
               | The problem with containers is that the construction
               | toolkit for them is subtractive ("start by cloning my
               | environment, then remove / replace various namespaces"),
               | while the construction toolkit for zones/jails is
               | additive ("start with an empty universe, and add
               | namespaces or share them with the parent").
               | 
               | Constructing containers subtractively means that every
               | time there's a new kind of namespace to virtualize, you
               | have to update all container-creating tools or risk a
               | security vulnerability.
               | 
               | Constructing containers additively from an empty universe
               | means that every time there's a new kind of namespace to
               | virtualize, you have to update all container-creating
               | tools or risk not getting sharing that you want (i.e.,
               | breakage).
               | 
               | I'm placing a higher value on security. Maybe that's a
               | bad choice. It's not like breaking is a good thing -- it
               | might be just as bad as creating a security
               | vulnerability.
        
               | ape4 wrote:
               | Yes if we starting again today, we wouldn't do containers
               | as they are now.
        
       | mywacaday wrote:
       | "I won't bother explaining what fork(2) is -- if you're reading
       | this, I assume you know.", If that applied to everything I looked
       | at from HN I'd read precious little.
        
         | cryptonector wrote:
         | I didn't write it for HN. It wasn't a paper to publish in some
         | Computer Science journal. It was just a github gist. If you
         | don't get the subject, it's not for you. I might well write a
         | paper now based on it, and then it might be a good read for
         | you, but I still won't be writing it for you, but for people
         | who are interested in the topic. The intended audience is
         | small, expert on the matter, and probably even more opinionated
         | than I am.
        
       | throwaway984393 wrote:
       | I don't think containers should be like jails. Containers should
       | be more like chroots than they are now.
       | 
       | Have you ever tried to run a modern X/whatever app with 3D
       | graphics and audio and DBUS and God knows what else in a
       | container and get it to show up on your desktop? It's a fucking
       | nightmare. I spent over a week trying to get 1Password to run in
       | a container. Somebody decided containers had to be "secure", even
       | though they don't actually exist as a single concept and security
       | was never their primary purpose. If instead containers were used
       | _only to isolate filesystem dependencies_ , we could actually
       | pretend containers were like normal applications and treat them
       | with the same lack of security concern that all the rest of our
       | non-containerized programs are.
       | 
       | Firecracker is the correct abstraction for isolation: a micro-VM.
       | That is the model you want if you want to run an app securely
       | (not to mention reliably, as it can come with its own kernel,
       | rather than needing you to run a compatible host kernel).
        
       | aylmao wrote:
       | Meta comment: Github Gist seems to be great for blogging. Yeah,
       | the UI is not very blog-specific, but it has all the useful
       | features, and then some: markdown, comments, hosting, an index of
       | all posts, some measure of popularity (stars), a very detailed
       | edit history, etc.
       | 
       | All without having to pay or setup anything yourself.
        
         | JoshTriplett wrote:
         | Unfortunately, there's no way to turn _off_ comments on a Gist,
         | which makes it not a viable replacement for anyone who doesn 't
         | want to spend a lot of time processing and moderating comments.
        
       | jph wrote:
       | Is it a fair point to implement first with fork() because of
       | memory protection, then optimize by using benchmarks and
       | potentially vfork() for speed? Benchmark areas can look at
       | synchronous locks, copy-on-write memory, stack sharing, etc.
       | 
       | What are the good practices of security tradeoffs of fork() vs.
       | vfork() especially in terms of ease of writing correct code? I'd
       | thought that fork() + exec() tends to favor thinking about
       | clearer separation/isolation. For example I've written small
       | daemons using fork() + exec() because it seems safe and easy to
       | do at the start.
        
         | yakubin wrote:
         | In short, fork() mixes poorly with multi-threaded code (and has
         | some security footguns like needing to explicitly unshare
         | elements of environment which may be sensitive, such as file
         | descriptors (suddenly you need to know all the file descriptors
         | used in the whole program from a single place in code)). Here
         | is a well-written comment about fork() from David Chisnall:
         | <https://lobste.rs/s/cowy6y/fork_road_2019#c_zec42d>
         | 
         | Additionally, the fork()+exec() idiom practically forces OS
         | designers into a corner where they simply have to implement
         | Copy-on-Write for virtual memory pages, or otherwise the whole
         | userspace using this idiom is going to be terribly slow.
         | Without the fork()+exec() idiom you don't need CoW to be
         | efficient.
        
           | jerf wrote:
           | Fork mixes so poorly with multithreaded code that a lot of
           | modern languages that are built from the beginning with
           | threads of one sort or another in mind, like Go, simply won't
           | let you do it. There is no binding to fork in the standard
           | library.
           | 
           | I think you could bash it together yourself with raw
           | syscalls, because that can't really be stopped once you have
           | a syscall interface, but basically the Go runtime is built
           | around assuming it won't be forked. I have no idea what would
           | happen to even a "single threaded" Go program if you forked
           | it, and I have no intention of finding out. The lowest level
           | option given in the syscall package is ForkExec:
           | https://pkg.go.dev/syscall#ForkExec And this is a package
           | that will, if you want, create new event loops outside of the
           | Go runtime's control, set up network connections outside of
           | the runtime's control, and go behind the runtime's back in a
           | variety of other ways... but not this one. If you want this,
           | you'll be looking up numbers yourself and using the raw
           | Syscall or RawSyscall functions.
        
             | pcwalton wrote:
             | > I have no idea what would happen to even a "single
             | threaded" Go program if you forked it, and I have no
             | intention of finding out.
             | 
             | I'm not an expert on Go internals, but the GC in Go is
             | multithreaded, so I would assume forking will kill the GC.
             | Better hope it's not holding any mutexes.
        
           | kevincox wrote:
           | TL;DR if another thread is holding a lock when you fork that
           | lock will be stuck locked in the child, but that thread that
           | was using that lock no longer exists.
           | 
           | So if your multi-threaded program uses malloc you may fork
           | while a global allocation lock is being held and you won't be
           | able to use malloc or free in the child (thread-local caches
           | aside).
           | 
           | There are other problems but this is the basic idea. To be
           | fork-safe you need to allow any thread to just disappear (or
           | halt forever) at any point in your program.
        
             | kazinator wrote:
             | malloc has to guard its locks against fork, probably using
             | pthread_atfork, or some lower level internal API related to
             | that.
             | 
             | The problem with pthread_atfork is third party libs.
             | 
             | YOU will use it in YOUR code. The C library will correctly
             | use it in its code. But you have no assurance that any
             | other libraries are doing the right things with their
             | locks.
        
               | dgrunwald wrote:
               | Your "third party libs" includes system libraries like
               | libdl.
               | 
               | We had a Python process using both threads (for stuff
               | like background downloads, where the GIL doesn't hurt)
               | and multiprocessing (for CPU-intensive work), and found
               | that on Linux, the child process sometimes deadlocks in
               | libdl (which Python uses to import extension modules).
               | 
               | The fix was to use
               | `multiprocessing.set_start_method('spawn')` so that
               | Python doesn't use fork().
        
               | kazinator wrote:
               | libdl is a component of glibc; that needs to be debugged.
        
               | kllrnohj wrote:
               | The more stuff that piles on using pthread_atfork then
               | also contribute to fork() being unnecessarily slow for
               | the specific combination of fork+exec.
        
               | kazinator wrote:
               | Right, and so POSIX "fixed" that by standardizing
               | posix_spawn. Thus fork is now mainly for those scenarios
               | in which exec is not called, plus traditional coding that
               | is portable to old systems.
        
               | mark_undoio wrote:
               | Also if, for any reason, you end up doing a `fork()`
               | syscall directly rather than via libc you'll still have a
               | problem as appropriate cleanup won't happen.
               | 
               | Of course, the best answer to that is usually going to be
               | "don't do that"!
        
           | kazinator wrote:
           | fork came first; it's POSIX threads that is a bolted on piece
           | of clunk that mixes badly with fork, signal handlers, chdir,
           | ...
        
       | londons_explore wrote:
       | I was always disappointed by the performance of fork()/clone().
       | 
       | CompSci class told me it was a very cheap operation, because all
       | the actual memory is copy-on-write, so its a great way to do all
       | kinds of things.
       | 
       | But the reality is that duplicating huge page tables, and
       | hundreds of file handles is very slow. Like 10's of milliseconds
       | slow for a big process.
       | 
       | And then the process runs slowly for a long time after that
       | because every memory access ends up causing lots of faults and
       | page copying.
       | 
       | I think my CompSci class lied to me... it might _seem_ cheap and
       | a neat thing to do, but the reality is there are very few
       | usecases where it makes sense.
        
         | mark_undoio wrote:
         | Agreed that these costs can be larger than is perhaps implied
         | in compsci classes (though it's possible that they've changed
         | their message since I took them!)
         | 
         | I suppose it is still essentially free for some common uses -
         | e.g. if a shell uses `fork()` rather than one of the
         | alternatives it's unlikely to have a very big address space, so
         | it'll still be fast.
         | 
         | My experience has been that big processes - 100+GB - which are
         | now pretty reasonable in size really do show some human-
         | perceptible latency for forking. At least tens of milliseconds
         | matches my experience (I wouldn't be surprised to see higher).
         | This is really jarring when you're used to thinking of it as
         | cost-free.
         | 
         | The slowdown afterwards, resulting from copy-on-write, is
         | especially noticeable if (for instance) your process has a high
         | memory dirtying rate. Simulators that rapidly write to a large
         | array in memory are a good example here.
         | 
         | When you really need `fork()` semantics this could all still be
         | acceptable - but I think some projects do ban the use of
         | `fork()` within a program to avoid unexpected costs. If you
         | really have a big process that needs to start workers I guess
         | it might be worth having a small daemon specifically for doing
         | that.
        
           | cryptonector wrote:
           | Right, shells are no threaded and they tend to have small
           | resident set sizes. Even in shells though, there's no reason
           | not to use vfork(), and if you have a tight loop over
           | starting a bunch of child processes, you might as well use
           | it. Though, in a shell, you do need fork() in order to
           | _trivially_ implement sub-shells.
           | 
           | fork() is most problematic for things like Java.
        
         | cryptonector wrote:
         | Copy-on-write is _supposed_ to be cheap, but in fact it 's not.
         | MMU/TLB manipulations are very slow. Page faults are slow. So
         | the common thing now is to just copy the entire resident set
         | size (well, the writable pages in it), and if that is large,
         | that too is slow.
        
         | smasher164 wrote:
         | Also, mandating copy-on-write as an implementation strategy is
         | a huge burden to place on the host. Now you've made the amount
         | of memory a process is is using unquantifiable.
        
           | immibis wrote:
           | You also mandate a system complex enough to have an MMU.
        
           | vgel wrote:
           | It's not necessarily unquantifiable -- the kernel can count
           | the not-yet-copied pages pessimistically as allocated memory,
           | triggering OOM allocation failures if the amount of
           | _potential_ memory usage is greater than RAM. IIUC, this is
           | how Linux vm.overcommit_memory[1] mode 2 works, if
           | overcommit_ratio = 100.
           | 
           | However, if an application is written to assume that it can
           | fork a ton and rely on COW to not trigger OOM, it obviously
           | won't work under mode 2.
           | 
           | [1] https://www.kernel.org/doc/Documentation/vm/overcommit-
           | accou...
           | 
           | > 2 - Don't overcommit. The total address space commit for
           | the system is not permitted to exceed swap + a configurable
           | amount (default is 50%) of physical RAM.
           | 
           | > Depending on the amount you use, in most situations this
           | means a process will not be killed while accessing pages but
           | will receive errors on memory allocation as appropriate.
           | 
           | > Useful for applications that want to guarantee their memory
           | allocations will be available in the future without having to
           | initialize every page.
        
             | smasher164 wrote:
             | You're right, "unquantifiable" was the wrong word here. I
             | meant, a program has no real way of predicting/reacting to
             | OOM. I didn't realize mode 2 with overcommit_ratio = 100
             | behaved that way, thanks for sharing.
        
               | vgel wrote:
               | Yeah I think in a practical sense you're right, since
               | AFAIK using mode 2 is fairly rare because most software
               | assumes overcommit, and even if a program _is_ written
               | with an understanding that malloc can return NULL, its in
               | the sense of                   if (!(ptr = malloc(...)))
               | { exit(1); }
        
           | cryptonector wrote:
           | POSIX doesn't require that fork() be implemented using copy-
           | on-write techniques. An implementation is free to copy all of
           | the parent's writable address space.
        
         | vgel wrote:
         | CS classes (and, far too often, professional programmers) talk
         | about computers like they're just faster PDP-11s with
         | fundamentally the same performance characteristics.
        
       | albertzeyer wrote:
       | We had the case that some library we were using (OpenBLAS) used
       | pthread_atfork. Unfortunately, the atfork handler behaved buggy
       | in certain situations involving multiple threads and caused a
       | crash. This was annoying because we basically did not need fork
       | at all but just fork+exec (for various other libraries spawning
       | sub processes), where those atfork handlers would not be
       | relevant.
       | 
       | Our solution was to override pthread_atfork to ignore any
       | functions, and in case this is not enough, also fork itself to
       | just directly do the syscall without calling the atfork handlers.
       | 
       | https://github.com/tensorflow/tensorflow/issues/13802
       | https://github.com/xianyi/OpenBLAS/issues/240
       | https://trac.sagemath.org/ticket/22021
       | https://bugs.python.org/issue31814
       | https://stackoverflow.com/questions/46845496/ld-preload-and-...
       | https://stackoverflow.com/questions/46810597/forkexec-withou...
        
       | tiffanyh wrote:
       | Slightly off topic, how does Erlang handle this because isn't it
       | know for having extremely fast & cheap process spawning baked in
       | (with isolation).
        
         | __s wrote:
         | m:n threads (aka green threads)
         | https://twitter.com/joeerl/status/1010485913393254401
        
           | tiffanyh wrote:
           | I realize that tweet is from the authority himself but am I
           | mistaken in my understanding ...
           | 
           | I thought green threads share memory but Erlang processes do
           | NOT share memory, which is what makes Erlang so unique.
           | 
           | Did Erlang create a so called "green process"? If so, why
           | can't this model be implemented in the kernel?
        
             | masklinn wrote:
             | > I thought green threads share memory but Erlang processes
             | do NOT share memory, which is what makes Erlang so unique.
             | 
             | Erlang processes don't share memory because the language
             | and vm don't give primitives which let you do it. They all
             | exist within the same address space (e.g. large binaries
             | are reference-counted and stored on a shared heap,
             | excluding clustering obviously).
             | 
             | > Did Erlang create a so called "green process"?
             | 
             | Yes.
             | 
             | > If so, why can't this model be implemented in the kernel?
             | 
             | Because erlang processes are not an antagonistic model, and
             | the _language_ restricts the ability to attack the VM
             | (kinda, I'm sure you could write NIFs to fuck up
             | everything, you just don't have any reason to as an
             | application developer).
        
         | butterisgood wrote:
         | Erlang processes aren't unix processes. They're more like
         | coroutines.
        
         | jerf wrote:
         | In another comment, I observe how Go doesn't even have a
         | binding to fork.
         | 
         | Erlang is another example of that. There is no standard library
         | binding to the fork function. If someone were to bash one into
         | a NIF, I have no idea what would happen to the resulting
         | processes, but there's no good that can come of it. (To use
         | Star Trek, think less good and evil Kirk and more "What we got
         | back, didn't live long... fortunately.") Despite the
         | terminology, all Erlang processes are green threads in a single
         | OS process.
        
           | dragonwriter wrote:
           | > Despite the terminology, all Erlang processes are green
           | threads in a single OS process.
           | 
           | The main Erlang runtime uses an M:N Erlang:native process
           | model, not an N:1. So Erlang processes are like green threads
           | (they are called processes instead of threads because they
           | are shared-nothing), but not in a single process.
        
           | tiffanyh wrote:
           | I mentioned this somewhere else but I thought Erlang does NOT
           | share memory.
           | 
           | Doesn't that make Erlang a bit unique. It was the ability to
           | spawn a new process extremely fast AND also have memory
           | isolation. This combination is what the OP was wanting to
           | achieve.
        
             | jerf wrote:
             | Erlang _mostly_ doesn 't doesn't share memory between its
             | Erlang processes, but it does this by making it so there's
             | simply no way, at the Erlang level, of even writing code
             | that refers to the memory in another Erlang process. It's
             | an Erlang-level thing, not an OS-level thing.
             | 
             | If you write a NIF in C, it can do whatever it wants within
             | that process.
             | 
             | The BEAM VM itself will share references to large binaries.
             | Erlang, at the language level, declares those to be
             | immutable so "sharing" doesn't matter. As an optimization,
             | the VM could choose to convert some of your immutable
             | operations into mutation-based ones, but if it does that,
             | it's responsible for making the correct copies so you can't
             | witness this at the Erlang level.
             | 
             | The Erlang spawn function spawns a new _Erlang_ process. It
             | _does not_ spawn a new OS process. While BEAM may run in
             | multiple OS processes per dragonwriter, the spawn function
             | certainly isn 't what starts them. The VM would.
             | 
             | So, you can not spawn a new Erlang process, then set its
             | UID, priority, current directory, and all that other state
             | that _OS processes_ have, because an Erlang process is not
             | an OS process. If the user wants to fork for some reason
             | beyond simply running a program simply, because they want
             | to change the OS process attributes for some reason, Erlang
             | is not a viable choice.
             | 
             | Erlang is not unique in that sense. It runs as a normal OS
             | process. What abilities it has are implemented within that
             | sandbox, no different than the JVM or a browser hosting a
             | Javascript VM.
        
         | masklinn wrote:
         | Client space scheduler and processes. The isolation is a
         | property of the VM and langage primitives (you just don't get
         | any way to share stuff, kinda).
         | 
         | Also Erlang is known for cheap and plentiful processes, not for
         | being fast. It's fast enough but it's no speed demon.
        
           | tiffanyh wrote:
           | My reference to "fast" was in the context of creating a new
           | process due to the OP post talking about how long fork/etc
           | can take. Not in reference to executing code itself.
        
             | masklinn wrote:
             | In that sense it's fast in the same way e.g.
             | coroutines(/goroutines) are fast: it's just the erlang
             | scheduler performing some allocation (possibly from a
             | freelist) and initialisation. Avoiding the kernel having to
             | set things up and the related context switches makes for
             | much better performances.
        
       | mark_undoio wrote:
       | The code I currently work on actually has a use of `clone` with
       | the `CLONE_VM` flag to create something that isn't a thread.
       | Since `CLONE_VM` will share the entire address space with the
       | child (you know, like a thread does!) a very reasonable response
       | would be "WAT?!"
       | 
       | What led us here was a need to create an additional thread within
       | an existing process's address space but in a way that was non-
       | disruptive - to the rest of the process it shouldn't really
       | appear to exist.
       | 
       | We achieved this by using `CLONE_VM` (and a handful of other
       | flags) to give the new "thread-like" entity access to the whole
       | address space. _But_ , we omitted `CLONE_THREAD`, as if we were
       | making a new process. The new "thread-like" entity would not
       | technically be part of the same thread group but would live in
       | the same address space.
       | 
       | We also used two chained `clone()` calls (with the intermediate
       | exiting, like when you daemonise) so that the new "thread-like"
       | wouldn't be a child of the original process.
       | 
       | All this existed before I joined, it's just really cool that it
       | works. I've never encountered a such a non-standard use of clone
       | before but it was the right tool for this particular job!
        
         | scottlamb wrote:
         | > What led us here was a need to create an additional thread
         | within an existing process's address space but in a way that
         | was non-disruptive - to the rest of the process it shouldn't
         | really appear to exist.
         | 
         | I'm curious to hear more. What's its purpose?
        
           | mark_undoio wrote:
           | > I'm curious to hear more. What's its purpose?
           | 
           | Sure! I'll try to illustrate the general idea, though I'm
           | taking liberties with a few of the details to keen things
           | simple(r).
           | 
           | Our software (see https://undo.io) does record and replay
           | (including the full set of Time Travel Debug stuff -
           | executing backwards, etc) of Linux processes. Conceptually
           | that's similar to `rr` (see https://rr-project.org/) - the
           | differences probably aren't relevant here.
           | 
           | We're using `ptrace` as part of monitoring process behaviour
           | (we also have in-process instrumentation). This reflects our
           | origins in building a debugger - but it's also because
           | `ptrace` is just very powerful for monitoring a process /
           | thread. It is a very challenging API to work with, though.
           | 
           | One feature / quirk of `ptrace` is that you can't really do
           | anything useful with a traced thread that's currently running
           | - including peeking its memory. So if a program we're
           | recording is just getting along with its day we can't just
           | examine it whenever we want.
           | 
           | First choice is just to avoid messing with the process but
           | sometimes we really do need to interact with it. We _could_
           | just interrupt a thread, use `ptrace` to examine it, then
           | start it up again. But there 's a problem - in the corners of
           | Linux kernel behaviour there's a risk that this will have a
           | program-visible side effect. Specifically, you might cause a
           | syscall restart not to happen.
           | 
           | So when we're recording a real process we need something
           | that:
           | 
           | * acts like a thread in the process - so we can peek / poke
           | its memory, etc via ptrace * is always in a known, quiescent
           | state - so that we can use ptrace on it whenever we want *
           | doesn't impact the behaviour of the process it's "in" - so we
           | don't affect the process we're trying to record * doesn't
           | cause SIGCHLD to be sent to the process we're recording when
           | it does stuff - so we don't affect the process we're trying
           | to record
           | 
           | Our solution is double clone + magic flags. There are other
           | points in the solution space (manage without, handle the
           | syscall restarting problem, ...) but this seems to be a
           | pretty good tradeoff.
        
           | kccqzy wrote:
           | Maybe some kind of snapshotting for an in-memory database?
        
       | mattgreenrocks wrote:
       | > Long ago, I, like many Unix fans, thought that fork(2) and the
       | fork-exec process spawning model were the greatest thing, and the
       | Windows sucked for only having exec _() and _spawn_ (), the last
       | being a Windows-ism.
       | 
       | I appreciate this quite a bit. Vocal Unix proponents tend to
       | believe that anything Unix does is automatically better than
       | Windows, sometimes without even knowing what the Windows analogue
       | is. Programming in both is necessary to have an informed opinion
       | on this subject.
       | 
       | The one thing I miss most on Unix: the unified model of HANDLEs
       | that enables you to WaitOnMultipleObjects() with almost any
       | system primitive you could want, such as an event with a socket
       | (blocking I/O + a shutdown notification) in one call. On Unix, a
       | flavor of select() tends to be the base primitive for waiting on
       | things to happen, which means you end up writing adapter code for
       | file descriptors to other resources, or need something like
       | eventfd.
       | 
       | Things I don't miss from Windows at all: wchar_t everywhere. :)
        
         | ogazitt wrote:
         | Having written server software that had to work in both places,
         | I always loved the simplicity of fork(2) / vfork(2) relative to
         | Windows CreateProcess. Threading models in Win32 were always a
         | pain. Which only got worse with COM (remember apartment
         | threading? rental threading? ugh)
         | 
         | Back in the 90's, processes had smaller memory footprint, and
         | every UNIX my software supported had COW optimizations. So the
         | difference between fork(2) and vfork(2) were not very large in
         | practice. Often, the TCP handshake behind the accept(2) call
         | was of more concern than how long it would take fork(2) to
         | complete. Of course, bandwidth has increased by a factor of
         | 1000 since then, so considerations have changed.
        
         | AnIdiotOnTheNet wrote:
         | UCS-2 seemed like a good(ish) idea at the time when Unicode's
         | scope didn't include every possible human concept represented
         | in icon form and UTF-8 hadn't yet been spec'd on a napkin by
         | the first adults to bother thinking about the problem.
        
           | cryptonector wrote:
           | Quite true. One of the things Windows got very wrong was
           | UCS-2 and, later, UTF-16. So did JavaScript.
        
           | xiaq wrote:
           | Even in 1989, it should have been clear that 16 bits were not
           | enough to encode all of the Chinese characters, let alone
           | encoding all the human scripts. Unicode today encodes 92,865
           | Chinese characters
           | (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).
           | 
           | The only reason anybody would think of UCS-2 was a good idea
           | was that they did not consult a single Chinese or Japanese
           | scholar on Chinese characters.
        
         | monocasa wrote:
         | These decisions here are all older than Windows and weren't in
         | reaction to them. It's in reaction to the awful mainframe ways
         | to spawn processes like using JCL.
         | 
         | We've sort of come back to that with kubernetes yaml files to
         | describe how to launch an executable in a specific env and all
         | of the resources it needs. Like it can be traced explicitly,
         | the Borg paper references mainframes and knowingly calls the
         | lnaguage that would be replaced by kubernetes's yaml files
         | 'BCL' instead of z/OS's JCL.
        
         | cryptonector wrote:
         | WIN32 got a few things very right:                 - SIDs
         | - access tokens         (like struct cred / cred_t in Unix
         | kernels,          but exposed as a first-class type to user-
         | land)       - security descriptors         (like owner + group
         | mode_t + ACL in Unix land,          but as a first-class type)
         | - HANDLEs, as you say       - HANDLEs for processes
         | 
         | Many other things, Windows got wrong. But the above are far
         | superior to what Unix has to offer.
        
           | al2o3cr wrote:
           | I'd be curious how many of those derive from NT's VMS roots -
           | for instance:
           | 
           | http://lxmi.mi.infn.it/~calcolo/OpenVMS/ssb71/6346/6346p004..
           | ..
        
             | cryptonector wrote:
             | Most of them, as far as I know.
        
         | marwis wrote:
         | Is there any difference between Windows HANDLE and Linux file
         | descriptor? Aren't they both just indexes into a table of
         | objects managed by the kernel?
        
           | cryptonector wrote:
           | HANDLE values are opaque, and generally not reused. Imagine
           | an implementation like this:                 typedef struct
           | HANDLE_s {         uintptr_t ptr;         uintptr_t verifier;
           | } HANDLE;
           | 
           | where `ptr` might be an index into a table (much like a file
           | descriptor) or maybe a pointer in kernel-land (dangerous
           | sounding!) and `verifier` is some sort of value that can be
           | used by the kernel to validate the `ptr` before
           | "dereferencing" it.
           | 
           | On Unix the semantics of file descriptors are dangerous.
           | EBADF can be a symptom of a very dangerous bug where some
           | thread closed a still-in-use FD then a open gets the same FD
           | and now maybe you get file corruption. This particular type
           | of bug doesn't happen with HANDLEs.
        
             | marwis wrote:
             | Gotcha. But it looks like file descriptors could be made
             | almost as safe by avoiding index reuse. Is there any reason
             | why it is not done? Hashtable too costly costly vs array?
        
               | cryptonector wrote:
               | File descriptor numbers have to be "small" -- that's part
               | of their semantics. To ensure this, the kernel is
               | supposed to always allocate the smallest available FD
               | number. A lot of code assumes that FDs are "small" like
               | this. Threaded code can't assume that "no FD numbers less
               | than some number are available", but all code on Unix can
               | assume that generally the used FD number space is dense.
               | Even single-threaded code can't assume that "no FD
               | numbers less than some number are available" because of
               | libraries, but still, the assumption that the used FD
               | number space is dense does get made. This basically
               | forces the reuse of FDs to be a thing that happens.
               | 
               | For example, the traditional implementations of FD_SET()
               | and related macros for select(3) assume that FDs are
               | <1024.
               | 
               | Mind you, aside from select(), not much might break from
               | doing away with the FDs-are-small constraint. Still, even
               | so, they'd better be 64-bit ints if you want to be safe.
               | 
               | HANDLEs are just better.
        
       | lisper wrote:
       | There is something about fork which I have never understood.
       | Maybe someone here can explain it to me.
       | 
       | Why would anyone ever want fork as a primitive? It seems to me
       | that what you really want is a combination of fork and exec
       | because 99% of the time you immediately call exec after fork (at
       | least that's what I do 99% of the time when I use fork). If you
       | _know_ that you 're going to call exec immediately after fork,
       | then all the issues of dealing with the (potentially large)
       | address space of the parent just evaporate because the child
       | process is just going to immediately discard it all.
       | 
       | So why is there not a fork-exec combo? And why has it not
       | replaced fork for 99% of use cases?
       | 
       | And as long as I'm asking stupid questions, why would anyone ever
       | use vfork? If the child shares the parent's address space and
       | uses the same stack as the parent, and the parent has to block,
       | how is that different from a function call (other than being more
       | expensive)?
       | 
       | None of this makes sense to me.
        
         | [deleted]
        
         | kazinator wrote:
         | > So why is there not a fork-exec combo?
         | 
         | posix_spawn
         | 
         | > Why would anyone ever want fork as a primitive?
         | 
         | With fork you can very easily write a sever like mini_httpd:
         | 
         | https://acme.com/software/mini_httpd/
         | 
         | Or, in Unix shells:                 # function1 and funtion2
         | are shell functions            $ function1 | grep foo |
         | function2
         | 
         | here, the shell must fork a process (without exec) to run one
         | of these functions.
         | 
         | For instance function1 might run in a fork, the grep is a fork
         | and exec of course, and function2 could be in the shell's
         | primary process.
         | 
         | In the POSIX shell language, fork is so tightly integrated that
         | you can access it just by parenthesizing commands:
         | $ (cd /path/to/whatever; command) && other command
         | 
         | Everything in the parentheses is a sub-process; the effect of
         | the cd, and any variable assignments, are lost (whether
         | exported to the environment or not).
         | 
         | In Lisp terms, fork makes _everything_ dynamically scoped, and
         | rebinds it in the child 's context: except for inherited
         | resources like signal handlers and file descriptors.
         | 
         | Imagine every memory location having *earmuffs* like a defvar,
         | and being bound to its current value by a giant _let_ , and
         | imagine that being blindingly efficient to do thanks to VM
         | hardware.
        
         | dzaima wrote:
         | You might want to move around some file descriptors if you
         | don't want the child process to inherit your
         | stdin/stdout/stderr (e.g. if you want to read the stdout of the
         | process you launched, or give it some stdin).
         | 
         | And there does exist such a fork-exec combo - posix_spawn. It
         | allows adding some "commands" of what file descriptor
         | operations to do between the fork & exec before they're ever
         | done, among some other things. But, as the article mentions,
         | using it is annoying - you have to invoke various
         | posix_spawn_file_actions_* functions, instead of the regular C
         | functions you'd use.
        
         | outworlder wrote:
         | > 99% of the time you immediately call exec after fork
         | 
         | What about forking servers? listen() and then immediately
         | fork() to handle the inbound connection? Those don't need exec.
         | 
         | Also daemons. It's a common pattern to ditch permissions and
         | then fork(), as per the old "Linux Daemon Writing HOWTO".
        
           | lisper wrote:
           | Do people really do that? It sounds like a huge DOS
           | vulnerability to me.
        
         | emmelaich wrote:
         | Well, fork() is simple. No args, simple semantics.
         | 
         | Flexibility; you can set up pipes.
         | 
         | > _why is there not a fork-exec combo_
         | 
         | There is, the spawn calls mentioned.
        
         | jcranmer wrote:
         | `fork` is a classic example, as others have mentioned, as
         | something that was implemented because it was [at the time]
         | easy rather than because it was a good design. In the decades
         | since, we've found there are issues that are caused by the
         | semantics of fork, especially if the most common subsequent
         | system call is `exec`.
         | 
         | If you're designing an OS from scratch, support for `fork` and
         | `exec` as separate system calls is not what you want. Instead,
         | you'd be likely to describe something in terms of a process
         | creation system call, which will have eleventy billion
         | parameters governing all of the attributes of the spawned
         | process.
         | 
         | POSIX specifies a fork+exec combo called posix_spawn. This is
         | actually used a fair amount, but the reason it isn't used more
         | is because it doesn't support all of the eleventy-billion
         | parameters governing all of the attributes of the spawned
         | process. Instead, these parameters are usually set by calling
         | system calls that change these parameters between fork and
         | exec. These system calls might, for example, change the root
         | directory of a process or attach a debugger. Neither of these
         | are supported by posix_spawn, which only allows the common
         | operations of changing the file descriptors or resetting the
         | signal mask in the list of actions to do.
         | 
         | And this suggests why you might want vfork: vfork allows you
         | write something that looks like posix_spawn: you get to fork,
         | do your new-process-attribute-setting-flags, and then exec to
         | the new process image, all while being able to report errors in
         | the same memory space.
        
         | boring_twenties wrote:
         | Splitting fork and exec allows you to do stuff before calling
         | exec, for example redirecting file descriptors (like
         | stdin/out/err), creating a pipe, modifying the child's
         | environment, and so on.
        
           | loeg wrote:
           | (This is particularly useful for shells.)
        
         | 10000truths wrote:
         | Because there are many, many use cases where you _don 't_ want
         | to call exec() immediately after fork().
         | 
         | Want to constrain memory usage or CPU time of an arbitrary
         | child process? You have to call setrlimit() before exec().
         | Privilege separation? Call setuid() before exec(). Sandbox an
         | untrusted child process in some way? Call seccomp() (or your OS
         | equivalent) before exec(). And so on and so forth. Any time you
         | want to change what OS resources the child process will have
         | access to, you'll need to do some set-up work before invoking
         | exec().
        
           | wongarsu wrote:
           | Windows solves this by adding a bunch of optional parameters
           | to CreateProcess, as well as having two more variants
           | (CreateProcessAsUser and CreateProcessWithLogon). Some of the
           | arguments are complicated enough that they have helper
           | functions to construct them.
           | 
           | I like the more composable fork()->modify->exec() approach of
           | unix, but I wouldn't call either of them really elegant.
        
             | ChrisSD wrote:
             | A third way is to grant the parent process access to the
             | child such that they can use the child process handle to
             | "remotely" set restrictions, write memory, start a thread,
             | etc.
        
             | notriddle wrote:
             | That's one option, yes.
             | 
             | The one I've favored while reading these arguments has been
             | the "suspended process" model. The primitives are CREATE(),
             | which takes an executable as a parameter and returns the
             | PID of a paused process, and START(), which allows the
             | process to actually run.
             | 
             | Unix already has the concept of a paused executable, after
             | all.
             | 
             | This model also requires all the process-mutation syscalls,
             | like setrlimit(), to accept a PID as a parameter, but
             | prlimit() wound up being created anyway, because the
             | ability to mutate an already-running process is useful.
        
           | univspl wrote:
           | To me this feels like a call for more powerful language
           | primitives. i.e. a way to specify some action to take to "set
           | up" the child process that's more explicit and readable than
           | one special behaving in a particularly odd way. I'm imagining
           | closures with some kind of Rust-like move semantics, but not
           | entirely sure.
           | 
           | (if we're speaking in terms of greenfield implementation of
           | OS features)
        
             | lisper wrote:
             | Yeah, this. Why not mkprocess/exec instead of fork/exec?
        
             | retbull wrote:
             | Builder patterns for primitives? I think that seems super
             | cool but then aren't you just building a new language?
        
           | LeifCarrotson wrote:
           | But my child processes are not arbitrary or untrusted,
           | they're hard-coded and written by me!
           | 
           | I'm not writing a shell, I'm writing an application!
        
         | xioxox wrote:
         | I use fork a lot in my Python science programs. It's really
         | great - you can stick it in a loop and get immediate
         | parallelism. It's much better than multiprocessing, etc, as you
         | keep the state from just before the fork happened, so you can
         | share huge data structures between the processes, without
         | having to process the same data again or duplicate them. I've
         | even written a module for processing things in forked
         | processes: https://pypi.org/project/forkqueue/
        
         | khaledh wrote:
         | From "Operating Systems: Three Easy Pieces" chapter on "Process
         | API" (section 5.4 "Why? Motivating The API") [1]:
         | ... the separation of fork() and exec() is essential in
         | building a UNIX shell,         because it lets the shell run
         | code after the call to fork() but before the call         to
         | exec(); this code can alter the environment of the about-to-be-
         | run program,         and thus enables a variety of interesting
         | features to be readily built.              ...              The
         | separation of fork() and exec() allows the shell to do a whole
         | bunch of         useful things rather easily. For example:
         | prompt> wc p3.c > newfile.txt                  In the example
         | above, the output of the program wc is redirected into the
         | output         file newfile.txt (the greater-than sign is how
         | said redirection is indicated).         The way the shell
         | accomplishes this task is quite simple: when the child is
         | created, before calling exec(), the shell closes standard
         | output and opens the         file newfile.txt. By doing so, any
         | output from the soon-to-be-running program wc         are sent
         | to the file instead of the screen.
         | 
         | [1] https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-api.pdf
        
         | Animats wrote:
         | Because "fork" was easy to implement in UNIX on the PDP-11.
         | 
         | The original implementation was for a machine with very limited
         | memory. So fork worked by swapping out the process. But then,
         | instead of releasing the in-memory copy, the kernel duplicated
         | the process table entry. So there were now two copies of the
         | process, one in memory and one swapped out. Both were runnable,
         | even if there wasn't enough memory for both to fit at once.
         | Both executed onward from there.
         | 
         | And that's why "fork" exists. It was a cram job to fit in a
         | machine with a small address space.
        
         | garaetjjte wrote:
         | >So why is there not a fork-exec combo?
         | 
         | There is, posix_spawn.
        
         | shadowofneptune wrote:
         | Dennis Richie addresses this in a history of early Unix:
         | https://www.bell-labs.com/usr/dmr/www/hist.html
         | 
         | "Process control in its modern form was designed and
         | implemented within a couple of days. It is astonishing how
         | easily it fitted into the existing system; at the same time it
         | is easy to see how some of the slightly unusual features of the
         | design are present precisely because they represented small,
         | easily-coded changes to what existed. A good example is the
         | separation of the fork and exec functions. The most common
         | model for the creation of new processes involves specifying a
         | program for the process to execute; in Unix, a forked process
         | continues to run the same program as its parent until it
         | performs an explicit exec. The separation of the functions is
         | certainly not unique to Unix, and in fact it was present in the
         | Berkeley time-sharing system [2], which was well-known to
         | Thompson. Still, it seems reasonable to suppose that it exists
         | in Unix mainly because of the ease with which fork could be
         | implemented without changing much else."
        
           | lisper wrote:
           | OK, but why has it not be replaced with something better in
           | the intervening 50 years? There have been a lot of
           | improvements to unix since 1970. Why not this?
        
             | cryptonector wrote:
             | It was! vfork() was added to BSD because fork() sucks.
             | 
             | But then someone very opinionated wrote "vfork() Considered
             | Dangerous" and too many people accepted that incorrect
             | conclusion.
        
             | kelnos wrote:
             | It was; ~20 years ago we got posix_spawn(3).
        
         | pm215 wrote:
         | There is exactly a fork-exec combo like that: it's called
         | posix_spawn(): https://man7.org/linux/man-
         | pages/man3/posix_spawn.3.html
         | 
         | I think the reason for fork() and exec() as primitives goes
         | back to the early days Unix design philosophy. Unix tends to
         | favour "easy and simple for the OS to implement" rather than
         | "convenient for user processes to use". (For another example of
         | that, see the mess around EINTR.) fork() in early unix was not
         | a lot of code, and splitting into fork/exec means two simple
         | syscalls rather than needing a lot of extra fiddly parameters
         | to set up things like file descriptors for the child.
         | 
         | There's a bit on this in "The Evolution of the UNIX Time-
         | Sharing System" at https://www.bell-
         | labs.com/usr/dmr/www/hist.html -- "The separation of the
         | functions is certainly not unique to Unix, and in fact it was
         | present in the Berkeley time-sharing system [2], which was
         | well-known to Thompson. Still, it seems reasonable to suppose
         | that it exists in Unix mainly because of the ease with which
         | fork could be implemented without changing much else." It says
         | the initial fork syscall only needed 27 lines of assembly
         | code...
         | 
         | (Edit: I see while I was typing that other commenters also
         | noted both the existence of posix_spawn and that quote...)
        
           | lisper wrote:
           | > Unix tends to favour "easy and simple for the OS to
           | implement"
           | 
           | Well, yeah, but the whole problem here, it seems to me, is
           | that fork is _not_ simple to implement precisely _because_ it
           | combines the creation of the kernel data structures required
           | for a process with the actual initiation of the process. Why
           | not mkprocess, which creates a suspended process that has to
           | be started with a separate call to exec? That way you never
           | have to worry about all the hairy issues that arise from
           | having to copy the parent 's process memory state.
        
             | pm215 wrote:
             | It was simple specifically for the people writing it at the
             | time. We know this, because they've helpfully told us so
             | :-) It might or might not have been harder than a different
             | approach for some other programmers writing some other OS
             | running on different hardware, but the accidents of history
             | mean we got the APIs designed by Thompson, Ritchie, et al,
             | and so we get what they personally found easy for their
             | PDP7/PDP11 OS...
        
             | cryptonector wrote:
             | fork() was trivial to implement back then. It became non-
             | trivial later when RAM sizes and resident set sizes too
             | increased.
        
         | slaymaker1907 wrote:
         | I think it's actually a pretty useful primitive for doing
         | multiprocessing. Unlike threading, you have a completely
         | separate memory space both for avoiding data races and
         | performance (memory allocators still aren't perfect and weird
         | stuff can happen with cache lines). Unlike exec after fork or
         | anything equivalent, you still get to share things like file
         | descriptors and read only memory for convenience.
        
         | cryptonector wrote:
         | > Why would anyone ever want fork as a primitive?
         | 
         | > So why is there not a fork-exec combo?
         | 
         | There are so many variations to what you can do with fork+exec
         | that designing a suitable "fork-exec combo" API is really
         | difficult, so any attempts tend to yield a fairly limited API
         | or a very difficult-to-use API, and that ends up being very
         | limiting to its consumers.
         | 
         | On the flip side, fork()+exec() made early Unix development
         | very easy by... avoiding the need to design and implement a
         | complex spawn API in kernel-land.
         | 
         | Nowadays there are spawn APIs. On Unix that would be
         | posix_spawn().
         | 
         | > And as long as I'm asking stupid questions, why would anyone
         | ever use vfork? If the child shares the parent's address space
         | and uses the same stack as the parent, and the parent has to
         | block, how is that different from a function call (other than
         | being more expensive)?
         | 
         | (Not a stupid question.)
         | 
         | You'd use vfork() only to finish setting up the child side
         | before it execs, and the reason you'd use vfork() instead of
         | fork() is that vfork()'s semantics permit a very high
         | performance implementation while fork()'s semantics necessarily
         | preclude a high performance implementation altogether.
        
         | indrora wrote:
         | > why would anyone ever want fork as a primitive
         | 
         | Long ago in the far away land of UNIX, fork was a primitive
         | because the primary use of fork was to do more work on the
         | system. You likely were one of thee or four other people, at
         | any given moment vying for CPU time, and it wasn't uncommon to
         | see loads of 11 on a typical university UNIX system.
         | 
         | > so why is there not a fork-exec combo
         | 
         | you're looking for system(3). Turns out, most people
         | waitpid(fork()). Windows explicitly handles this situation with
         | CreateProcess[0] which does a way better job of it than POSIX
         | does (which, IMO, is the standard for most of the win32 API,
         | but that's a whole can of worms I won't get into).
         | 
         | > why would anyone ever use vfork?
         | 
         | Small shells, tools that need the scheduling weight of "another
         | process" but not for long, etc. See also, waitpid(fork()).
         | 
         | When you have something with MASSIVE page tables, you don't
         | want to spend the time copying the whole thing over. There's a
         | huge overhead to that.
         | 
         | [0] https://docs.microsoft.com/en-
         | us/windows/win32/api/processth...
        
           | anderskaseorg wrote:
           | system(3) is not a good alternative because it indirects
           | through the shell, which adds the overhead of launching the
           | shell as well as the danger of misinterpreting shell
           | metacharacters in the command if you aren't meticulous about
           | escaping them correctly.
        
         | kelnos wrote:
         | > _Why would anyone ever want fork as a primitive? It seems to
         | me that what you really want is a combination of fork and exec
         | because 99% of the time you immediately call exec after fork
         | (at least that 's what I do 99% of the time when I use fork)._
         | 
         | If you eliminate fork, then what do you do for those 1% of
         | cases where you actually _do_ need it? I agree that it 's
         | uncommon, but I have written code before that calls fork() but
         | then does not exec().
         | 
         | > _So why is there not a fork-exec combo?_
         | 
         | There is; it's called posix_spawn(3).
         | 
         | > _And why has it not replaced fork for 99% of use cases?_
         | 
         | Even though it's been around for about 20 years, it's still
         | newer than fork+exec, so I assume a) many people just don't
         | know about it, or b) people still want to go for maximum
         | compatibility with old systems that may not have it, even if
         | that's a little silly.
        
         | duskwuff wrote:
         | > Why would anyone ever want fork as a primitive?
         | 
         | fork() without exec() can make sense in the context of a
         | process-per-connection application server (like SSH). I've also
         | used it quite effectively as a threading alternative in some
         | scripting languages.
         | 
         | > So why is there not a fork-exec combo?
         | 
         | There is; it's called posix_spawn(). Like a lot of POSIX APIs,
         | it's kind of overcomplicated, but it does solve a lot of the
         | problems with fork/exec.
         | 
         | > And as long as I'm asking stupid questions, why would anyone
         | ever use vfork?
         | 
         | For processes with a very large address space, fork() can be an
         | expensive operation. vfork() avoids that, so long as you can
         | guarantee that it'll immediately be followed by an exec().
        
           | kazinator wrote:
           | fork with copy-on-write semantics avoids copying the whole
           | address space. It does have to copy some data structures that
           | manage virtual memory and maybe the first level of the paging
           | structure(page directory or whatever).
        
             | cryptonector wrote:
             | copy-on-write == slow when called from threaded processes
             | with large resident set sizes.
        
               | bogomipz wrote:
               | Can you elaborate on this? I understand why copying a
               | large address space might be slow but how or why does the
               | number of threads in a process affects this? Is it
               | scheduling?
        
               | cryptonector wrote:
               | Copy-on-write means twiddling with the MMU, and TLB
               | updates across cores ("TLB shootdowns") can be very
               | expensive. If the process is not threaded, then the OS
               | could make sure to schedule the child and parent on the
               | same CPU to avoid needing TLB shootdowns, but if it's
               | threaded, forget about it.
        
       | evmar wrote:
       | In Ninja, which needs to spawn a lot of subprocesses but it
       | otherwise not especially large in memory and which doesn't use
       | threads, we moved from fork to posix_spawn (which is the "I want
       | fork+exec immediately, please do the smartest thing you can"
       | wrapper) because it performed better on OS X and Solaris:
       | 
       | https://github.com/ninja-build/ninja/commit/89587196705f54af...
        
         | ridiculous_fish wrote:
         | posix_spawn also outperforms fork on Linux under more recent
         | glibc and musl, which can use vfork under the hood.
         | https://twitter.com/ridiculous_fish/status/12328893907639336...
        
       | ismaildonmez wrote:
       | Microsoft Research has a paper about the very same issue (2019):
       | https://www.microsoft.com/en-us/research/publication/a-fork-...
        
         | georgia_peach wrote:
         | That paper smacks of a Chesterton Fence. They haven't come up
         | with a tested replacement for many of the use cases, i.e.:
         | These designs are not yet general enough to cover all the use-
         | cases outlined above, but perhaps can serve as a starting
         | point...
         | 
         | yet bullet #1 in the next paragraph is
         | Deprecate Fork
         | 
         | I think this is a case of security guys being upset about fork
         | gumming-up their experiments. I don't really care about their
         | experiments. The security regime for the past 20 years may have
         | bought us a little more security against eastern bloc hackers,
         | but it hasn't done squat to protect us from Apple, Google, &
         | Microsoft! I have never had a virus de-rail my computing life
         | as much as the automatic Windows 10 upgrade. Robert Morris got
         | 400 hours community service for a relatively benign worm. If
         | that's the penalty scale, Redmond should get actual time in the
         | slammer for Cortana, forced Windows Update, and adding
         | telemetry to Calculator.
        
         | cryptonector wrote:
         | It's a very good paper, yeah. I will link it from the gist.
        
       ___________________________________________________________________
       (page generated 2022-02-28 23:00 UTC)