[HN Gopher] Fork() is evil; vfork() is goodness; afork() would b...
___________________________________________________________________
Fork() is evil; vfork() is goodness; afork() would be better;
clone() is stupid
Author : __s
Score : 237 points
Date : 2022-02-28 17:25 UTC (5 hours ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| kazinator wrote:
| Concurrently running dupe currently on front page:
| https://news.ycombinator.com/item?id=30499169
|
| :) :) :)
| cryptonector wrote:
| Ha!
| immibis wrote:
| Another option is to allow the parent to create an empty child
| process, and then make arbitrary system calls and execute code in
| the child, like a debugger does. In most cases the last "remote
| system call" would be exec.
| cryptonector wrote:
| posix_spawn() essentially is like that, or can be, as an
| implementation detail.
| infogulch wrote:
| The dense fog lifts, tree branches part, a ray of light beams
| down on a pedestal revealing the hidden intentions of the
| ancients. A plaque states "The operational semantics of the most
| basic primitives of your operating system are designed to
| simplify the implementation of shells." You hesitantly lift your
| eyes to the item presented upon the pedestal, take a pause in
| respect, then turn away slumped and disappointed but not entirely
| surprised. As you walk you shake your head trying to evict the
| after image of a beam of light illuminating a turd.
| ckastner wrote:
| > _" The operational semantics of the most basic primitives of
| your operating system are designed to simplify the
| implementation of shells."_
|
| Yes, but why is this characterized as something negative?
|
| Isn't that the entire point? Operating systems are there to
| serve user requests, and shells are an interface between user
| and OS.
|
| Shells simply developed features that users required of them.
| [deleted]
| psanford wrote:
| I saw a bug once where an application would get way slower on
| MacOS after calling fork(). Not just temporarily either; many
| syscalls would continue to run slowly from the call to fork()
| until the process exited.
|
| Looking on Stack Overflow, I see a few reports of this
| behavior[0][1].
|
| [0]: https://stackoverflow.com/questions/4411840/memory-access-
| af...
|
| [1]: https://stackoverflow.com/questions/27932330/why-is-
| tzset-a-...
| tych0 wrote:
| The problem with this argument is that the set of programs that
| just fork() and then exec() is fairly small. Sure, shells are
| small and do this, but then the article argues that shells are a
| good use of fork().
|
| In larger programs, you're forking because you need to diverge
| the work that's going to be done and probably where it's going to
| be done (maybe you want to create a new pid ns, you need a
| separate mm because you're going to allocate a bunch, whatever).
| Maybe the argument is that programs should never do this? I don't
| buy that. Then there's a lot of string-slinging through exec().
| olliej wrote:
| The vast majority of programs that fork are doing fork()
| followed almost immediately by exec(), to the extent that on
| macOS for example a process is only really considered safe for
| exec() after fork() happens. Pretty much nothing else is
| considered safe.
| NovemberWhiskey wrote:
| Yeah; that would be my assumption too. I worked one time on a
| significant project that benefit from fork() without exec()
| and it was a monstrous pain - only if you own every single
| line of code in your project, have centralized resource
| management, and have no significant library dependencies
| should you ever consider doing this.
| olliej wrote:
| Yeah, you can't depend on pthreads or pthread mutexes
| (they're not defined as being fork safe).
|
| The entirety of Foundation (so presumably anything in
| Swift) is not fork safe either.
|
| To be clear: "not fork safe" in this case means "severely
| constrained environment": e.g. you can do things liker
| limits, set up pipes, etc but good luck with much more. I
| guess morally similar to the restrictions you have in a
| signal handler, albeit with different restrictions.
| cryptonector wrote:
| Oh no, there's tons of ProcessBuilder type APIs in Java,
| Python, and... every major language you can think of.
|
| The problems with fork() become very apparent in any Java apps
| that try to run external programs, especially in apps that have
| many threads and massive heaps and are very busy.
| kllrnohj wrote:
| > In larger programs, you're forking because you need to
| diverge the work that's going to be done and probably where
| it's going to be done
|
| That's usually going to be done with clone() instead, no?
| You'll likely want to fiddle with the various flags for those
| usages and are unlikely to be happy with what fork() otherwise
| does.
| pm215 wrote:
| That's backwards from my experience, which is that most users
| of fork() only do "fork; child does small amount of setup, eg
| closing file descriptors; exec". Shells are one of the few
| programs that do serious work in the child, because the POSIX
| shell semantics surface "create a subshell and do..." to the
| shell user, and then the natural way to implement that when
| you're evaluating an expression tree is "fork, and let the
| child process continue evaluating as a long-lived process
| continuing to execute as the same shell binary". (Depending on
| what's in that sub-tree of the expression, it might eventually
| exec, but it equally might not.)
|
| Many years back I worked on an rtos that had no fork(), only a
| 'spawn new process' primitive (it didn't use an MMU and all
| processes shared an address space, so fork would have been
| hard). Most unixy programs were easy to port, because you could
| just replace the fork-tweak-exec sequence with an appropriate
| spawn call. The shells (bash, ash I think were the two I looked
| at) were practically impossible to port -- at any rate, we
| never found it worth the effort, though I think with a lot of
| effort and willingness to carry invasive local patches it could
| have been done.
| lucideer wrote:
| The good/evil/etc. here seem to be defined exclusively around
| "performance above all else", and - more specifically -
| performant primitives over performant application architecture.
|
| It strikes me that performance gains associated with sharing
| address space & stack are similar to many performance gains:
| trade-offs. So calling them "good" and "evil" when performance is
| seemingly your sole goal and interest seems a bit forward.
| cryptonector wrote:
| In my world we often say things like "X is the moral equivalent
| of Y" where X and Y are just technologies and, _clearly_ , are
| morally-neutral things.
|
| Why do we do this? Well, because it adds emphasis, and a dash
| of humor.
|
| Clearly fork() is neither Good nor Evil. It's morally neutral.
| It has no moral value whatsoever. But to say "fork() is evil"
| is to cause the audience to raise their eyebrows -"what, why
| would you say fork() is evil?!"- and maybe pay attention.
|
| Yes, there is the risk that the audience might react
| dismissively because fork() obviously is morally-neutral, so
| any claim that it is "evil" must be vacuous or hyperbolic. It's
| a risk I chose to take.
|
| Really, it's a rhetorical device. I think it's pretty standard.
| I didn't create that device myself -- I've seen it used before
| and I _liked_ it.
| pipeline_peak wrote:
| Your idea good
|
| Your idea stupid
|
| I'm not woke by any means, idk what it is about low level
| programming but calling someone's idea "stupid" is a really
| shitty thing to say.
|
| "He chose to take it personally" is the type of lazy, pseudo-
| stoic argument I have no interest in reading.
|
| Yes I'm having a morning, lol.
| cryptonector wrote:
| I answered this here:
| https://news.ycombinator.com/item?id=30504804
|
| It's a rhetorical device. I didn't expect this to -years later-
| become a front-page item on HN. I wrote that to share with
| certain people.
|
| And yes, clone() has some real problems, and if calling it
| "stupid" pisses off some people, but maybe also leads others to
| want to improve clone() or create a better alternative, then
| that's fine. If I'd wanted to write an alternative to Linux I'd
| probably have had to deal with the very, very fine language
| that Linus and others use on the Linux kernel mailing lists --
| if you don't like my using the word "stupid", then you really
| shouldn't look there because you're likely to be very
| disappointed. Indeed, not only would I have to accept colorful
| language from reviewers there, I'd probably have to employ some
| such language myself.
|
| TL;DR: clone() came from Linux, where "stupid" is the least
| colorful language you'll find, and me calling it "stupid" is
| just a rhetorical device.
| ridiculous_fish wrote:
| Amusingly vfork semantics differ across OSes. This program prints
| 42 in Linux but 1 on Mac: https://godbolt.org/z/jn7Gaf5Me because
| on Linux they share address space.
| butterisgood wrote:
| I am pretty sure Mac OS doesn't COW fork(), and that the
| address space is copied. At least it was the last time I
| looked. FreeBSD and Linux both seem to COW.
|
| Perhaps there's a reason vfork is different too.
| ridiculous_fish wrote:
| My (very possibly wrong) understanding is that xnu does CoW
| fork but doesn't overcommit, meaning that memory must be
| reserved (perhaps in swap) in case the pages need to be
| duplicated.
|
| There's other complications relating to inheriting Mach ports
| and the mach_task <-> BSD process "duality" in xnu, which
| Linux doesn't have. I'd love for someone to chime in who
| knows more about how this stuff works.
| cryptonector wrote:
| Unfortunately there was this paper from the 80s titled "vfork()
| Considered Dangerous", which led to BSDs removing vfork(), and
| then later it was re-added because that paper was clearly quite
| wrong. But the news hasn't quite filtered through to Apple, I
| guess.
| scottlamb wrote:
| > clone() is stupid ... the clone(2) design, or its maintainers,
| encourages a proliferation of flags, which means one must
| constantly pay attention to the possible need to add new flags at
| existing call sites.
|
| IMHO a bigger problem [2] in practice with clone is that
| (according to glibc maintainers) once your program calls it, you
| can't call any glibc function anymore. [1] Essentially the raw
| syscall is a tool for the libc implementation to use. The libc
| implementation hasn't provided a wrapper for programs to use
| which maintains the libc's internal invariants about things like
| (IIUC) thread-local storage for errno.
|
| The author's aforkx implementation is something that glibc
| maintainers could (and maybe should) provide, but my
| understanding is that you can get in trouble by implementing it
| yourself.
|
| [1] https://github.com/rust-
| lang/rust/issues/89522#issuecomment-...
|
| [2] editing to add: or at least a more _concrete_ expression of
| the problem. Wouldn 't surprise me if they haven't provided this
| wrapper in part because the proliferation the author mentioned
| makes it difficult for them to do so.
| saurik wrote:
| One use case for fork()--which is used extensively on Android--is
| to build an expensive template process that can then be
| replicated for later work, which is exactly what people often
| want for the behavior with virtual machines. I wrote an article
| on the history of linking and loading optimizations leading up to
| how Android handles their "zygote" which touches on this
| behavior.
|
| http://www.cydiasubstrate.com/id/727f62ed-69d3-4956-86b2-bc0...
| mrob wrote:
| Fork() is the second worst idea in programming, behind null
| pointers. Fork() is the reason overcommit exists, which is the
| reason my web browser crashes if I open too many tabs, and the
| reason the "safe" Rust programming language leaves software
| vulnerable to DOS attacks if it uses the standard library. It's a
| clear example of "worse is worse", and we should have switched to
| the Microsoft Windows model decades ago.
|
| Here's a paper from Microsoft Research supporting this point of
| view:
|
| https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
| silon42 wrote:
| Agreed about overcommit and resulting mess.
| alerighi wrote:
| Why overcommit is a problem? A program is unlikely to use all
| the memory that it allocates, or use it only at a later time.
| It would be a waste to not have it, it would mean having a ton
| of RAM that never gets used because a lot of programs allocates
| more ram that they will probably ever need. And it would be
| inefficient, costly and error prone to use dynamic memory
| allocation for everything.
|
| The cause of your browser crash is not the overcommit, is
| simply the fact that you have not enough memory. If you disable
| overcommit (something you can do on Linux) you would the same
| crash earlier, before you allocated (not necessary used) 100%
| of your RAM (because really no software handles the dynamic
| memory fail condition, i.e. malloc returning null, that you
| can't handle reasonably).
|
| Null pointers are not a mistake, how do you signal the absence
| of a value otherwise? How do you signal the failure of a
| function that returns a pointer without having to return a
| struct with a pointer and an error code (which is inefficient
| since the return value doesn't fit a single register)? null
| makes a perfect sense to be used as a value to signal "this
| pointer doesn't point to something valid".
|
| Microsoft saying that fork() was a mistake... well, of course,
| because Windows doesn't have it. fork was a good idea and that
| is the reason why it's still used these days. Of course
| nowadays there are evolution, in Linux there is the clone
| system call (fork is deprecated and still there for
| compatibility reasons, the glibc fork is implemented with the
| clone system call). But the concept of creating a process by
| cloning the resources of the parent is something that to me
| always seamed very elegant to me.
|
| In reality fork is something that (if I remember correctly, I
| don't have that much experience in programming in Windows)
| doesn't exist on Windows, and the only way to create a new
| process of the same program is to launch the executable, and
| pass the parameters from the command line, that is not that
| great for efficiency at all, and also can have its problems
| (for example the executable was deleted, renamed, etc while the
| program was running). Also in Windows there is neither the
| concept of exec, tough I think it can be emulated in software
| (while fork can't).
|
| To me it makes perfect sense to separate the concept of
| creating a new process (fork/clone) and loading an executable
| from disk (exec). It gives a lot of flexibility, at a cost that
| is not that high (and there are alternatives to avoid it, such
| as vfork or variations of the clone system call, or directly
| higher level API such as posix_spawn).
| mrob wrote:
| >Null pointers are not a mistake
|
| The inventor, Tony Hoare, famously called them his "billion-
| dollar mistake". The better way to do it is with nullable
| types (which could internally represent null as 0 as a
| performance optimization). This is something Rust gets right.
| alerighi wrote:
| Nullable types... they have the same problems as null
| pointers: if you don't care about handling the case they
| are null the program will crash, if you handle it, you can
| handle it also for null pointers. Well, they have a nicer
| syntax, and that's it. How much Rust code is full of
| `.unwrap()` because programmers are lazy and don't want to
| check each optional to see if it's valid? Or simply don't
| care about it, since having the program crash on an
| unexpected condition is not the end of the world.
| initplus wrote:
| I think much of the confusion around nulls stems from the
| fact that in mainstream languages pointers are overloaded for
| two purposes: for passing values by reference, and for
| optionality.
|
| Nearly every pointer bug is caused by the programmer wanting
| one of these two properties, and not considering the
| consequences of the other.
|
| Non-nullable references and pass-by-value optionals can
| replace many usages of pointers.
| alerighi wrote:
| Yes, and they are just two usages of pointers. The fact is
| that, whatever you call it, null pointer, nullable
| reference, optional, you have to put in a language a
| concept of "reference to an object that can reference a non
| valid object".
| nqzero wrote:
| unpopular opinion: null pointers (in at least java and c) are
| the single greatest metaphor in software development, and are
| the CS analog to the invention of zero
| im3w1l wrote:
| There was an article about exceptions the other day that
| lamented that exceptions are high latency because the
| exceptional path will be paged out. I would assume overcommit
| is to blame for that too.
| mort96 wrote:
| Why would you assume that..?
| oconnor663 wrote:
| That's probably a caching issue, and caching issues are a
| fact of life for the foreseeable future. (Could also be a
| disk swap issue, but probably not.)
| nick_ wrote:
| Interesting take. If you don't mind explaining, what is the MS
| Windows model in in this context?
| mrob wrote:
| You opt into inheriting specific contexts from the parent,
| instead of copying everything by default:
|
| https://docs.microsoft.com/en-
| us/windows/win32/api/processth...
| manwe150 wrote:
| More importantly, all syscalls also take a target process
| as an argument, making the Windows version both simpler and
| more powerful than can be done with fork. Spawn is also a
| lot slower on Windows, but that is an implementation issue.
| kllrnohj wrote:
| > Spawn is also a lot slower on Windows, but that is an
| implementation issue.
|
| afaik _most_ of that slowdown is because malware scanners
| (including Windows Defender) hook spawn to do blocking
| verification of what to launch. Which is an issue also
| present on eg. MacOS, and why it 's also kinda slow to
| launch new processes (and can be subject to extreme
| latencies): https://www.engadget.com/macos-slow-apps-
| launching-221445977...
|
| Which is yes an implementation problem, but also a
| problem that potentially changes/impacts the design. Like
| maybe it'd make sense to get a handle to a pre-verified
| process so that repeated spawns of it don't need to hit
| that path (for eg. something like Make or Ninja that just
| spam the same executable over and over and over again).
| Or the kernel/trusted module needs to in some way be
| involved & can recognize that an executable was already
| scanned & doesn't need to be re-scanned.
| [deleted]
| indrora wrote:
| Windows doesn't have fork as you know it. It has a POSIX-ish
| fork-alike for compliance, but under the hood it's
| CreateThread[0] with some Magic.
|
| in Windows, you create the thread with CreateThread, then are
| passed back a handle to that thread. You then can query the
| state of the thread using GetExitCodeThread[1] or if you need
| to wait for the thread to finish, you call
| WaitForSingleObject [2] with an Infinite timeout
|
| Aside: WaitForSingleObject is how you track _a bunch_ of
| stuff: semaphores, mutexes, processes, events, timers, etc.
|
| The flipside of this is that Windows processes are buckets of
| handles: a Process object maintains a series of handles to
| (threads, files, sockets, WMI meters, etc), one of which
| happens to be the main thread. Once the main thread exits,
| the system goes back and cleans up (as it can) the rest of
| the threads. This is why sometimes you can get zombie'd
| processes holding onto a stuck thread.
|
| This is also how it's a very cheap operation to interrogate
| what's going on in a process ala Process Explorer.
|
| If I had to describe the difference between Windows and Linux
| at a process model level, I have to back up to the
| fundamental difference between the Linux and Windows
| _programming_ models: Linux is is a kernel that has to hide
| its inner workings for its safety and security, passing
| wrapped versions of structures back and forth through the
| kernel-userspace boundary; Windows is a kernel that considers
| each portion of its core separated, isolated through ACLs,
| and where a handle to something can be passed around without
| worry. The windows ABI has been so fundamentally stable over
| 30 years now because so much of it is built around
| controlling object handles (which are allowed to change under
| the hood) rather than manipulation of of kernel primitives
| through syscalls.
|
| Early WinNT was very restrictive and eased up a bit as
| development continued so that win9x software would run on it
| under the VDM. Since then, most windows software insecurities
| are the result of people making assumptions about what will
| or won't happen with a particular object's ACL.
|
| There's a great overview of windows programming over at [3].
| It covers primarily Win32, but gets into the NT kernel
| primitives and how it works.
|
| A lot of work has gone into making Windows an object-oriented
| kernel; where Linux has been looking at C11 as a "next step"
| and considering if Rust makes sense as a kernel component,
| Windows likely has leftovers of Midori and Singularity [4]
| lingering in it that have gone onto be used for core
| functionality where it makes sense.
|
| [0] https://docs.microsoft.com/en-
| us/windows/win32/api/processth... [1]
| https://docs.microsoft.com/en-
| us/windows/win32/api/processth... [2]
| https://docs.microsoft.com/en-
| us/windows/win32/api/synchapi/... [3]
| https://www.tenouk.com/cnwin32tutorials.html [4]
| https://www.microsoft.com/en-
| us/research/project/singularity...
| cryptonector wrote:
| Overcommits exist any time you can have a debugger anyways.
|
| fork() was a brilliant way to make Unix development easy in the
| 70s: it made it trivial move a lot of development activity out
| of the kernel and into user-land.
|
| But with it came problems that only became apparent much later.
| oconnor663 wrote:
| > the reason the "safe" Rust programming language leaves
| software vulnerable to DOS attacks if it uses the standard
| library
|
| Linux overcommitment is often cited as an argument for the
| "panic on OOM" design of the allocating parts of the Rust
| standard library, and it's an important part of the story. But
| I think even if the Linux defaults were different, Rust would
| still have gone with the same design. For example, here's Herb
| Sutter (who works for Microsoft) arguing that C++ would benefit
| from aborting on allocation failure:
| https://youtu.be/ARYP83yNAWk?t=3510. The argument is that the
| vast majority of allocations in the vast majority of programs
| don't have any reasonable options for handling an alloc failure
| besides aborting. For languages like C++ and Rust, which want
| to support large, high-level applications in addition to low-
| level stuff, making programmers litter their code with explicit
| aborts next to every allocation would be really painful.
|
| I think it's very interesting that Zig has gone the opposite
| direction. It could be that writing big applications with lots
| of allocs ends up feelign cumbersome in Zig, or it could be
| that they bend the curve. Fingers crossed.
| JoeAltmaier wrote:
| The whole idea of fork is strange - the design pattern of "child
| process is executing exactly where the parent process is
| executing" is foreign to me. Don't we want to direct where the
| child process is executing? Like, when creating a thread? Why is
| fork() so conceptually orthogonal to that? Is there a good
| reason? A historical reason?
|
| I don't find fork() to be obvious or useful or natural. I work
| hard to never do it.
| zamalek wrote:
| It's a leaky abstraction and everything it does can be done
| manually, and possibly better. It exists purely because, at
| some point in the past, threads didn't exist.
|
| If you design your program without fork, you'll probably end up
| with a cleaner and faster solution. Some things are best
| forgotten or never learned in the first place.
| alerighi wrote:
| A thread is not the same thing of a process. There are
| situations where you are fine with a thread, other where you
| need a process.
| _flux wrote:
| Can it though?
|
| The beauty of (v)fork(+exec) is that it doesn't need a new
| interface for configuring the environment in whichever way
| you want before the other process starts. Instead you get to
| use the exact same means of modifying the environment to your
| needs, and once it's done, you can call exec and the new
| process inherits those things.
|
| I mean, just look at the interface of posix_spawn.
|
| I grant though that this isn't without its problems
| (including performance) and IMO e.g. FD_CLOEXEC is one
| example of how those problems can be patched up. It's like
| the reverse problem: you have too wide implicit interface in
| it, and then you need to come up with all these ways to be
| explicit about some things.
| shadowofneptune wrote:
| fork()-exec() separation indeed exists for historical reasons:
| https://www.bell-labs.com/usr/dmr/www/hist.html
|
| Search for the phrase "Process control in its modern form was
| designed and implemented within a couple of days."
| kccqzy wrote:
| If you want the child to start executing some other code but
| you have fork(), it's easy to do it yourself by calling that
| function.
|
| But on the other hand, if you do want the child to execute code
| at the same place as the parent, but a hypothetical fork() asks
| you to provide a function pointer, it would be a bit more
| complicated.
| alerighi wrote:
| It makes creating processes easy to me, when you did understand
| how it works: while (1) { int
| client_socket = accept(socket, &client_addr, &client_len);
| if (client_socket > 0) { pid_t pid = fork();
| if (pid < 0) { // handle error
| } if (pid == 0) {
| handle_connection(client_socket, &client_addr);
| } } else { // handle error
| } }
|
| No need to do complex things to start a new process, having to
| pass argument to it in some way, etc.
| monocasa wrote:
| Hard disagree to most of this.
|
| fork(2) makes a lot more sense when you realize its heritage. It
| came from a land before Unix supported full MMUs. In this model,
| to still have per process address spaces and preemptive
| multitasking on what was essentially a PC-DOS level of hardware,
| the kernel would checkpoint the memory for a process, slurp it
| all out to dectape or some such, and load in the memory for
| whatever the scheduler wanted to run next. It's simplicity of
| being process checkpoint based wasn't a reaction to windows style
| calls (which wouldn't exist for almost a couple decades), but
| instead mainframe process spawning abominations like JCL. The
| idea "you probably want most of what you have so force a
| checkpoint, copy the checkpoint into a new slot, and continue
| separately from both checkpoints" was soooo much better than JCL
| and it's tomes of incantations to do just about anything.
|
| vfork(2) is an abomination. Even when the child returns, the
| parent now has a heavily modified stack if the child didn't
| immediately exec(). All of those bugs that causes are super fun
| to chase, lemme tell you. AFAIC, about the only valid use for
| vfork now is nommu systems where fork() incredibly expensive
| compared to what is generally expected.
|
| clone(2) is great. Start from a checkpoint like fork, but instead
| of semantically copying everything, optionally share or not based
| on a bitmask. Share a tgid, virtual address space, and FD table?
| You just made a thread. Share nothing? You just made a process.
| It's the most 'mechanism, not policy' way I've seen to do context
| creation outside of maybe the l4 variants and the exokernels.
| This isn't an old holdover, this is how threads work today,
| processes spawned that happen to share resources. Modern archs on
| linux don't even have a fork(2) syscall; it all happens through
| clone(2). Even vfork is clone set to share virtual address space
| and nothing else that fork wouldn't share. Namespaces are a way
| to opt into not sharing resources that normally fork _would_
| share.
|
| And I don't see what afork gets you that clone doesn't, except
| afork isn't as general.
| [deleted]
| sys_64738 wrote:
| The intent of fork() is to start a new process in its own address
| space. That *fork() variations that run in the SAME address space
| are confusing. A use case today for fork() might also be
| sandboxing apps. Certainly I expect browsers use this approach to
| spawn unique pages. But generally fork() is very specific from my
| recollection.
| cryptonector wrote:
| > The intent of fork() is to start a new process in its own
| address space.
|
| True!
|
| > That _fork() variations that run in the SAME address space
| are confusing.
|
| Why is it confusing? They are distinct and different system
| calls, with different semantics. They are also sufficiently
| similar that they are also similarly named. But there's nothing
| confusing about their _semantics _. vfork() is_ not* harder to
| use than fork() -- it's just subtly different.
|
| > A use case today for fork() might also be sandboxing apps.
| Certainly I expect browsers use this approach to spawn unique
| pages.
|
| I wouldn't expect that. Sandboxing is a large and complex
| topic.
| cryptonector wrote:
| Well, I'm surprised to see this on the front page, let alone as
| #1. Ask me anything.
|
| EDIT: Also, don't miss @NobodyXu's comment on my gist, and don't
| miss @NobodyXu's aspawn[1]. [0] https://gist.gith
| ub.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234?permalink_co
| mment_id=3467980#gistcomment-3467980 [1]
| https://github.com/NobodyXu/aspawn/
| Lerc wrote:
| Since you said anything... This is not strictly related to the
| article but your expertise seems to be in the right area.
|
| I have a process that executes actions for users, at the moment
| that process runs as root until it receives a token indicating
| an accepted user, then it fork()s and the fork changes to the
| UID of the user before executing the action.
|
| Is there a better way? I hadn't actually heard of vfork()
| before reading this article. I'm guessing maybe you could do a
| threaded server model where each thread vfork()s. I'm not
| really aware what happens when threads and forks combine. Does
| the v/fork() branch get trimmed down to just that one thread?
| If so what happens to the other thread stacks? It feels like a
| can of worms.
| cryptonector wrote:
| If the parent is threaded, then yes, vfork() will be better.
| You could also use posix_spawn().
|
| As to "becoming a user", that's a tough one. There are no
| standard tools for this on Unix. The most correct way to do
| it would be to use PAM in the child. See su(1) and sudo(1),
| and how they do it.
|
| > I'm not really aware what happens when threads and forks
| combine. Does the v/fork() branch get trimmed down to just
| that one thread? If so what happens to the other thread
| stacks? It feels like a can of worms.
|
| Yes, fork() only copies the calling thread. The other
| threads' stacks also get copied (because, well, you might
| have pointers into them, who knows), but there will only be
| one thread in the child process.
|
| vfork() also creates only one thread in the child.
|
| There used to be a forkall() on Solaris that created a child
| with copies of all the threads in the parent. That system
| call was a spectacularly bad idea that existed only to help
| daemonize: the parent would do everything to start the
| service, then it would forkall(), and on the parent side it
| would exit() (or maybe _exit()). That is, the idea is that
| the parent would not finish daemonizing (i.e., exit) until
| the child (or grandchild) was truly ready. However, there's
| no way to make forkall() remotely safe, and there's a much
| better way to achieve the same effect of not completing
| daemonization until the child (or grandchild) is fully ready.
|
| In fact, the daemonization pattern of not exiting the parent
| until the child (or grandchild) is ready is very important,
| especially in the SMF / systemd world. I've implemented the
| correct pattern many times now, starting in 2005 when project
| Greenline (SMF) delivered into OS/Net. It's this: instead of
| calling daemon(), you need a function that calls pipe(), then
| fork() or vfork(), and if fork(), and on the parent side then
| calls read() on the read end of the pipe, while on the child
| side it returns immediately so the child can do the rest of
| the setup work, then finally it should write one byte into
| the write side of the pipe to tell the parent it's ready so
| the parent can exit.
| ahmedalsudani wrote:
| No questions yet as I am yet to read ... but I can already
| comment and say grade A title.
| cryptonector wrote:
| It's a bit opinionated. It's meant to get a reaction, but
| also to have meaningful and thought-provoking content, and I
| think it's correct in the main too. Anyways, hope you and
| others enjoy it.
| ahmedalsudani wrote:
| That was a great read. Thank you for writing it up; I
| learned quite a few things!
|
| Especially appreciated the OS minutiae and opinionated
| commentary (... and the doc vs reality observation in
| Linux's vfork).
|
| The piece lives up to the great title :)
| disgruntledphd2 wrote:
| What do you mean by zones/jails and why are they better
| than containers?
| cryptonector wrote:
| Zones -> Solaris/Illumos Zones
|
| Jails -> BSD jails
|
| They're software VMs. It's a lot like containers, yes.
|
| The problem with containers is that the construction
| toolkit for them is subtractive ("start by cloning my
| environment, then remove / replace various namespaces"),
| while the construction toolkit for zones/jails is
| additive ("start with an empty universe, and add
| namespaces or share them with the parent").
|
| Constructing containers subtractively means that every
| time there's a new kind of namespace to virtualize, you
| have to update all container-creating tools or risk a
| security vulnerability.
|
| Constructing containers additively from an empty universe
| means that every time there's a new kind of namespace to
| virtualize, you have to update all container-creating
| tools or risk not getting sharing that you want (i.e.,
| breakage).
|
| I'm placing a higher value on security. Maybe that's a
| bad choice. It's not like breaking is a good thing -- it
| might be just as bad as creating a security
| vulnerability.
| ape4 wrote:
| Yes if we starting again today, we wouldn't do containers
| as they are now.
| mywacaday wrote:
| "I won't bother explaining what fork(2) is -- if you're reading
| this, I assume you know.", If that applied to everything I looked
| at from HN I'd read precious little.
| cryptonector wrote:
| I didn't write it for HN. It wasn't a paper to publish in some
| Computer Science journal. It was just a github gist. If you
| don't get the subject, it's not for you. I might well write a
| paper now based on it, and then it might be a good read for
| you, but I still won't be writing it for you, but for people
| who are interested in the topic. The intended audience is
| small, expert on the matter, and probably even more opinionated
| than I am.
| throwaway984393 wrote:
| I don't think containers should be like jails. Containers should
| be more like chroots than they are now.
|
| Have you ever tried to run a modern X/whatever app with 3D
| graphics and audio and DBUS and God knows what else in a
| container and get it to show up on your desktop? It's a fucking
| nightmare. I spent over a week trying to get 1Password to run in
| a container. Somebody decided containers had to be "secure", even
| though they don't actually exist as a single concept and security
| was never their primary purpose. If instead containers were used
| _only to isolate filesystem dependencies_ , we could actually
| pretend containers were like normal applications and treat them
| with the same lack of security concern that all the rest of our
| non-containerized programs are.
|
| Firecracker is the correct abstraction for isolation: a micro-VM.
| That is the model you want if you want to run an app securely
| (not to mention reliably, as it can come with its own kernel,
| rather than needing you to run a compatible host kernel).
| aylmao wrote:
| Meta comment: Github Gist seems to be great for blogging. Yeah,
| the UI is not very blog-specific, but it has all the useful
| features, and then some: markdown, comments, hosting, an index of
| all posts, some measure of popularity (stars), a very detailed
| edit history, etc.
|
| All without having to pay or setup anything yourself.
| JoshTriplett wrote:
| Unfortunately, there's no way to turn _off_ comments on a Gist,
| which makes it not a viable replacement for anyone who doesn 't
| want to spend a lot of time processing and moderating comments.
| jph wrote:
| Is it a fair point to implement first with fork() because of
| memory protection, then optimize by using benchmarks and
| potentially vfork() for speed? Benchmark areas can look at
| synchronous locks, copy-on-write memory, stack sharing, etc.
|
| What are the good practices of security tradeoffs of fork() vs.
| vfork() especially in terms of ease of writing correct code? I'd
| thought that fork() + exec() tends to favor thinking about
| clearer separation/isolation. For example I've written small
| daemons using fork() + exec() because it seems safe and easy to
| do at the start.
| yakubin wrote:
| In short, fork() mixes poorly with multi-threaded code (and has
| some security footguns like needing to explicitly unshare
| elements of environment which may be sensitive, such as file
| descriptors (suddenly you need to know all the file descriptors
| used in the whole program from a single place in code)). Here
| is a well-written comment about fork() from David Chisnall:
| <https://lobste.rs/s/cowy6y/fork_road_2019#c_zec42d>
|
| Additionally, the fork()+exec() idiom practically forces OS
| designers into a corner where they simply have to implement
| Copy-on-Write for virtual memory pages, or otherwise the whole
| userspace using this idiom is going to be terribly slow.
| Without the fork()+exec() idiom you don't need CoW to be
| efficient.
| jerf wrote:
| Fork mixes so poorly with multithreaded code that a lot of
| modern languages that are built from the beginning with
| threads of one sort or another in mind, like Go, simply won't
| let you do it. There is no binding to fork in the standard
| library.
|
| I think you could bash it together yourself with raw
| syscalls, because that can't really be stopped once you have
| a syscall interface, but basically the Go runtime is built
| around assuming it won't be forked. I have no idea what would
| happen to even a "single threaded" Go program if you forked
| it, and I have no intention of finding out. The lowest level
| option given in the syscall package is ForkExec:
| https://pkg.go.dev/syscall#ForkExec And this is a package
| that will, if you want, create new event loops outside of the
| Go runtime's control, set up network connections outside of
| the runtime's control, and go behind the runtime's back in a
| variety of other ways... but not this one. If you want this,
| you'll be looking up numbers yourself and using the raw
| Syscall or RawSyscall functions.
| pcwalton wrote:
| > I have no idea what would happen to even a "single
| threaded" Go program if you forked it, and I have no
| intention of finding out.
|
| I'm not an expert on Go internals, but the GC in Go is
| multithreaded, so I would assume forking will kill the GC.
| Better hope it's not holding any mutexes.
| kevincox wrote:
| TL;DR if another thread is holding a lock when you fork that
| lock will be stuck locked in the child, but that thread that
| was using that lock no longer exists.
|
| So if your multi-threaded program uses malloc you may fork
| while a global allocation lock is being held and you won't be
| able to use malloc or free in the child (thread-local caches
| aside).
|
| There are other problems but this is the basic idea. To be
| fork-safe you need to allow any thread to just disappear (or
| halt forever) at any point in your program.
| kazinator wrote:
| malloc has to guard its locks against fork, probably using
| pthread_atfork, or some lower level internal API related to
| that.
|
| The problem with pthread_atfork is third party libs.
|
| YOU will use it in YOUR code. The C library will correctly
| use it in its code. But you have no assurance that any
| other libraries are doing the right things with their
| locks.
| dgrunwald wrote:
| Your "third party libs" includes system libraries like
| libdl.
|
| We had a Python process using both threads (for stuff
| like background downloads, where the GIL doesn't hurt)
| and multiprocessing (for CPU-intensive work), and found
| that on Linux, the child process sometimes deadlocks in
| libdl (which Python uses to import extension modules).
|
| The fix was to use
| `multiprocessing.set_start_method('spawn')` so that
| Python doesn't use fork().
| kazinator wrote:
| libdl is a component of glibc; that needs to be debugged.
| kllrnohj wrote:
| The more stuff that piles on using pthread_atfork then
| also contribute to fork() being unnecessarily slow for
| the specific combination of fork+exec.
| kazinator wrote:
| Right, and so POSIX "fixed" that by standardizing
| posix_spawn. Thus fork is now mainly for those scenarios
| in which exec is not called, plus traditional coding that
| is portable to old systems.
| mark_undoio wrote:
| Also if, for any reason, you end up doing a `fork()`
| syscall directly rather than via libc you'll still have a
| problem as appropriate cleanup won't happen.
|
| Of course, the best answer to that is usually going to be
| "don't do that"!
| kazinator wrote:
| fork came first; it's POSIX threads that is a bolted on piece
| of clunk that mixes badly with fork, signal handlers, chdir,
| ...
| londons_explore wrote:
| I was always disappointed by the performance of fork()/clone().
|
| CompSci class told me it was a very cheap operation, because all
| the actual memory is copy-on-write, so its a great way to do all
| kinds of things.
|
| But the reality is that duplicating huge page tables, and
| hundreds of file handles is very slow. Like 10's of milliseconds
| slow for a big process.
|
| And then the process runs slowly for a long time after that
| because every memory access ends up causing lots of faults and
| page copying.
|
| I think my CompSci class lied to me... it might _seem_ cheap and
| a neat thing to do, but the reality is there are very few
| usecases where it makes sense.
| mark_undoio wrote:
| Agreed that these costs can be larger than is perhaps implied
| in compsci classes (though it's possible that they've changed
| their message since I took them!)
|
| I suppose it is still essentially free for some common uses -
| e.g. if a shell uses `fork()` rather than one of the
| alternatives it's unlikely to have a very big address space, so
| it'll still be fast.
|
| My experience has been that big processes - 100+GB - which are
| now pretty reasonable in size really do show some human-
| perceptible latency for forking. At least tens of milliseconds
| matches my experience (I wouldn't be surprised to see higher).
| This is really jarring when you're used to thinking of it as
| cost-free.
|
| The slowdown afterwards, resulting from copy-on-write, is
| especially noticeable if (for instance) your process has a high
| memory dirtying rate. Simulators that rapidly write to a large
| array in memory are a good example here.
|
| When you really need `fork()` semantics this could all still be
| acceptable - but I think some projects do ban the use of
| `fork()` within a program to avoid unexpected costs. If you
| really have a big process that needs to start workers I guess
| it might be worth having a small daemon specifically for doing
| that.
| cryptonector wrote:
| Right, shells are no threaded and they tend to have small
| resident set sizes. Even in shells though, there's no reason
| not to use vfork(), and if you have a tight loop over
| starting a bunch of child processes, you might as well use
| it. Though, in a shell, you do need fork() in order to
| _trivially_ implement sub-shells.
|
| fork() is most problematic for things like Java.
| cryptonector wrote:
| Copy-on-write is _supposed_ to be cheap, but in fact it 's not.
| MMU/TLB manipulations are very slow. Page faults are slow. So
| the common thing now is to just copy the entire resident set
| size (well, the writable pages in it), and if that is large,
| that too is slow.
| smasher164 wrote:
| Also, mandating copy-on-write as an implementation strategy is
| a huge burden to place on the host. Now you've made the amount
| of memory a process is is using unquantifiable.
| immibis wrote:
| You also mandate a system complex enough to have an MMU.
| vgel wrote:
| It's not necessarily unquantifiable -- the kernel can count
| the not-yet-copied pages pessimistically as allocated memory,
| triggering OOM allocation failures if the amount of
| _potential_ memory usage is greater than RAM. IIUC, this is
| how Linux vm.overcommit_memory[1] mode 2 works, if
| overcommit_ratio = 100.
|
| However, if an application is written to assume that it can
| fork a ton and rely on COW to not trigger OOM, it obviously
| won't work under mode 2.
|
| [1] https://www.kernel.org/doc/Documentation/vm/overcommit-
| accou...
|
| > 2 - Don't overcommit. The total address space commit for
| the system is not permitted to exceed swap + a configurable
| amount (default is 50%) of physical RAM.
|
| > Depending on the amount you use, in most situations this
| means a process will not be killed while accessing pages but
| will receive errors on memory allocation as appropriate.
|
| > Useful for applications that want to guarantee their memory
| allocations will be available in the future without having to
| initialize every page.
| smasher164 wrote:
| You're right, "unquantifiable" was the wrong word here. I
| meant, a program has no real way of predicting/reacting to
| OOM. I didn't realize mode 2 with overcommit_ratio = 100
| behaved that way, thanks for sharing.
| vgel wrote:
| Yeah I think in a practical sense you're right, since
| AFAIK using mode 2 is fairly rare because most software
| assumes overcommit, and even if a program _is_ written
| with an understanding that malloc can return NULL, its in
| the sense of if (!(ptr = malloc(...)))
| { exit(1); }
| cryptonector wrote:
| POSIX doesn't require that fork() be implemented using copy-
| on-write techniques. An implementation is free to copy all of
| the parent's writable address space.
| vgel wrote:
| CS classes (and, far too often, professional programmers) talk
| about computers like they're just faster PDP-11s with
| fundamentally the same performance characteristics.
| albertzeyer wrote:
| We had the case that some library we were using (OpenBLAS) used
| pthread_atfork. Unfortunately, the atfork handler behaved buggy
| in certain situations involving multiple threads and caused a
| crash. This was annoying because we basically did not need fork
| at all but just fork+exec (for various other libraries spawning
| sub processes), where those atfork handlers would not be
| relevant.
|
| Our solution was to override pthread_atfork to ignore any
| functions, and in case this is not enough, also fork itself to
| just directly do the syscall without calling the atfork handlers.
|
| https://github.com/tensorflow/tensorflow/issues/13802
| https://github.com/xianyi/OpenBLAS/issues/240
| https://trac.sagemath.org/ticket/22021
| https://bugs.python.org/issue31814
| https://stackoverflow.com/questions/46845496/ld-preload-and-...
| https://stackoverflow.com/questions/46810597/forkexec-withou...
| tiffanyh wrote:
| Slightly off topic, how does Erlang handle this because isn't it
| know for having extremely fast & cheap process spawning baked in
| (with isolation).
| __s wrote:
| m:n threads (aka green threads)
| https://twitter.com/joeerl/status/1010485913393254401
| tiffanyh wrote:
| I realize that tweet is from the authority himself but am I
| mistaken in my understanding ...
|
| I thought green threads share memory but Erlang processes do
| NOT share memory, which is what makes Erlang so unique.
|
| Did Erlang create a so called "green process"? If so, why
| can't this model be implemented in the kernel?
| masklinn wrote:
| > I thought green threads share memory but Erlang processes
| do NOT share memory, which is what makes Erlang so unique.
|
| Erlang processes don't share memory because the language
| and vm don't give primitives which let you do it. They all
| exist within the same address space (e.g. large binaries
| are reference-counted and stored on a shared heap,
| excluding clustering obviously).
|
| > Did Erlang create a so called "green process"?
|
| Yes.
|
| > If so, why can't this model be implemented in the kernel?
|
| Because erlang processes are not an antagonistic model, and
| the _language_ restricts the ability to attack the VM
| (kinda, I'm sure you could write NIFs to fuck up
| everything, you just don't have any reason to as an
| application developer).
| butterisgood wrote:
| Erlang processes aren't unix processes. They're more like
| coroutines.
| jerf wrote:
| In another comment, I observe how Go doesn't even have a
| binding to fork.
|
| Erlang is another example of that. There is no standard library
| binding to the fork function. If someone were to bash one into
| a NIF, I have no idea what would happen to the resulting
| processes, but there's no good that can come of it. (To use
| Star Trek, think less good and evil Kirk and more "What we got
| back, didn't live long... fortunately.") Despite the
| terminology, all Erlang processes are green threads in a single
| OS process.
| dragonwriter wrote:
| > Despite the terminology, all Erlang processes are green
| threads in a single OS process.
|
| The main Erlang runtime uses an M:N Erlang:native process
| model, not an N:1. So Erlang processes are like green threads
| (they are called processes instead of threads because they
| are shared-nothing), but not in a single process.
| tiffanyh wrote:
| I mentioned this somewhere else but I thought Erlang does NOT
| share memory.
|
| Doesn't that make Erlang a bit unique. It was the ability to
| spawn a new process extremely fast AND also have memory
| isolation. This combination is what the OP was wanting to
| achieve.
| jerf wrote:
| Erlang _mostly_ doesn 't doesn't share memory between its
| Erlang processes, but it does this by making it so there's
| simply no way, at the Erlang level, of even writing code
| that refers to the memory in another Erlang process. It's
| an Erlang-level thing, not an OS-level thing.
|
| If you write a NIF in C, it can do whatever it wants within
| that process.
|
| The BEAM VM itself will share references to large binaries.
| Erlang, at the language level, declares those to be
| immutable so "sharing" doesn't matter. As an optimization,
| the VM could choose to convert some of your immutable
| operations into mutation-based ones, but if it does that,
| it's responsible for making the correct copies so you can't
| witness this at the Erlang level.
|
| The Erlang spawn function spawns a new _Erlang_ process. It
| _does not_ spawn a new OS process. While BEAM may run in
| multiple OS processes per dragonwriter, the spawn function
| certainly isn 't what starts them. The VM would.
|
| So, you can not spawn a new Erlang process, then set its
| UID, priority, current directory, and all that other state
| that _OS processes_ have, because an Erlang process is not
| an OS process. If the user wants to fork for some reason
| beyond simply running a program simply, because they want
| to change the OS process attributes for some reason, Erlang
| is not a viable choice.
|
| Erlang is not unique in that sense. It runs as a normal OS
| process. What abilities it has are implemented within that
| sandbox, no different than the JVM or a browser hosting a
| Javascript VM.
| masklinn wrote:
| Client space scheduler and processes. The isolation is a
| property of the VM and langage primitives (you just don't get
| any way to share stuff, kinda).
|
| Also Erlang is known for cheap and plentiful processes, not for
| being fast. It's fast enough but it's no speed demon.
| tiffanyh wrote:
| My reference to "fast" was in the context of creating a new
| process due to the OP post talking about how long fork/etc
| can take. Not in reference to executing code itself.
| masklinn wrote:
| In that sense it's fast in the same way e.g.
| coroutines(/goroutines) are fast: it's just the erlang
| scheduler performing some allocation (possibly from a
| freelist) and initialisation. Avoiding the kernel having to
| set things up and the related context switches makes for
| much better performances.
| mark_undoio wrote:
| The code I currently work on actually has a use of `clone` with
| the `CLONE_VM` flag to create something that isn't a thread.
| Since `CLONE_VM` will share the entire address space with the
| child (you know, like a thread does!) a very reasonable response
| would be "WAT?!"
|
| What led us here was a need to create an additional thread within
| an existing process's address space but in a way that was non-
| disruptive - to the rest of the process it shouldn't really
| appear to exist.
|
| We achieved this by using `CLONE_VM` (and a handful of other
| flags) to give the new "thread-like" entity access to the whole
| address space. _But_ , we omitted `CLONE_THREAD`, as if we were
| making a new process. The new "thread-like" entity would not
| technically be part of the same thread group but would live in
| the same address space.
|
| We also used two chained `clone()` calls (with the intermediate
| exiting, like when you daemonise) so that the new "thread-like"
| wouldn't be a child of the original process.
|
| All this existed before I joined, it's just really cool that it
| works. I've never encountered a such a non-standard use of clone
| before but it was the right tool for this particular job!
| scottlamb wrote:
| > What led us here was a need to create an additional thread
| within an existing process's address space but in a way that
| was non-disruptive - to the rest of the process it shouldn't
| really appear to exist.
|
| I'm curious to hear more. What's its purpose?
| mark_undoio wrote:
| > I'm curious to hear more. What's its purpose?
|
| Sure! I'll try to illustrate the general idea, though I'm
| taking liberties with a few of the details to keen things
| simple(r).
|
| Our software (see https://undo.io) does record and replay
| (including the full set of Time Travel Debug stuff -
| executing backwards, etc) of Linux processes. Conceptually
| that's similar to `rr` (see https://rr-project.org/) - the
| differences probably aren't relevant here.
|
| We're using `ptrace` as part of monitoring process behaviour
| (we also have in-process instrumentation). This reflects our
| origins in building a debugger - but it's also because
| `ptrace` is just very powerful for monitoring a process /
| thread. It is a very challenging API to work with, though.
|
| One feature / quirk of `ptrace` is that you can't really do
| anything useful with a traced thread that's currently running
| - including peeking its memory. So if a program we're
| recording is just getting along with its day we can't just
| examine it whenever we want.
|
| First choice is just to avoid messing with the process but
| sometimes we really do need to interact with it. We _could_
| just interrupt a thread, use `ptrace` to examine it, then
| start it up again. But there 's a problem - in the corners of
| Linux kernel behaviour there's a risk that this will have a
| program-visible side effect. Specifically, you might cause a
| syscall restart not to happen.
|
| So when we're recording a real process we need something
| that:
|
| * acts like a thread in the process - so we can peek / poke
| its memory, etc via ptrace * is always in a known, quiescent
| state - so that we can use ptrace on it whenever we want *
| doesn't impact the behaviour of the process it's "in" - so we
| don't affect the process we're trying to record * doesn't
| cause SIGCHLD to be sent to the process we're recording when
| it does stuff - so we don't affect the process we're trying
| to record
|
| Our solution is double clone + magic flags. There are other
| points in the solution space (manage without, handle the
| syscall restarting problem, ...) but this seems to be a
| pretty good tradeoff.
| kccqzy wrote:
| Maybe some kind of snapshotting for an in-memory database?
| mattgreenrocks wrote:
| > Long ago, I, like many Unix fans, thought that fork(2) and the
| fork-exec process spawning model were the greatest thing, and the
| Windows sucked for only having exec _() and _spawn_ (), the last
| being a Windows-ism.
|
| I appreciate this quite a bit. Vocal Unix proponents tend to
| believe that anything Unix does is automatically better than
| Windows, sometimes without even knowing what the Windows analogue
| is. Programming in both is necessary to have an informed opinion
| on this subject.
|
| The one thing I miss most on Unix: the unified model of HANDLEs
| that enables you to WaitOnMultipleObjects() with almost any
| system primitive you could want, such as an event with a socket
| (blocking I/O + a shutdown notification) in one call. On Unix, a
| flavor of select() tends to be the base primitive for waiting on
| things to happen, which means you end up writing adapter code for
| file descriptors to other resources, or need something like
| eventfd.
|
| Things I don't miss from Windows at all: wchar_t everywhere. :)
| ogazitt wrote:
| Having written server software that had to work in both places,
| I always loved the simplicity of fork(2) / vfork(2) relative to
| Windows CreateProcess. Threading models in Win32 were always a
| pain. Which only got worse with COM (remember apartment
| threading? rental threading? ugh)
|
| Back in the 90's, processes had smaller memory footprint, and
| every UNIX my software supported had COW optimizations. So the
| difference between fork(2) and vfork(2) were not very large in
| practice. Often, the TCP handshake behind the accept(2) call
| was of more concern than how long it would take fork(2) to
| complete. Of course, bandwidth has increased by a factor of
| 1000 since then, so considerations have changed.
| AnIdiotOnTheNet wrote:
| UCS-2 seemed like a good(ish) idea at the time when Unicode's
| scope didn't include every possible human concept represented
| in icon form and UTF-8 hadn't yet been spec'd on a napkin by
| the first adults to bother thinking about the problem.
| cryptonector wrote:
| Quite true. One of the things Windows got very wrong was
| UCS-2 and, later, UTF-16. So did JavaScript.
| xiaq wrote:
| Even in 1989, it should have been clear that 16 bits were not
| enough to encode all of the Chinese characters, let alone
| encoding all the human scripts. Unicode today encodes 92,865
| Chinese characters
| (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs).
|
| The only reason anybody would think of UCS-2 was a good idea
| was that they did not consult a single Chinese or Japanese
| scholar on Chinese characters.
| monocasa wrote:
| These decisions here are all older than Windows and weren't in
| reaction to them. It's in reaction to the awful mainframe ways
| to spawn processes like using JCL.
|
| We've sort of come back to that with kubernetes yaml files to
| describe how to launch an executable in a specific env and all
| of the resources it needs. Like it can be traced explicitly,
| the Borg paper references mainframes and knowingly calls the
| lnaguage that would be replaced by kubernetes's yaml files
| 'BCL' instead of z/OS's JCL.
| cryptonector wrote:
| WIN32 got a few things very right: - SIDs
| - access tokens (like struct cred / cred_t in Unix
| kernels, but exposed as a first-class type to user-
| land) - security descriptors (like owner + group
| mode_t + ACL in Unix land, but as a first-class type)
| - HANDLEs, as you say - HANDLEs for processes
|
| Many other things, Windows got wrong. But the above are far
| superior to what Unix has to offer.
| al2o3cr wrote:
| I'd be curious how many of those derive from NT's VMS roots -
| for instance:
|
| http://lxmi.mi.infn.it/~calcolo/OpenVMS/ssb71/6346/6346p004..
| ..
| cryptonector wrote:
| Most of them, as far as I know.
| marwis wrote:
| Is there any difference between Windows HANDLE and Linux file
| descriptor? Aren't they both just indexes into a table of
| objects managed by the kernel?
| cryptonector wrote:
| HANDLE values are opaque, and generally not reused. Imagine
| an implementation like this: typedef struct
| HANDLE_s { uintptr_t ptr; uintptr_t verifier;
| } HANDLE;
|
| where `ptr` might be an index into a table (much like a file
| descriptor) or maybe a pointer in kernel-land (dangerous
| sounding!) and `verifier` is some sort of value that can be
| used by the kernel to validate the `ptr` before
| "dereferencing" it.
|
| On Unix the semantics of file descriptors are dangerous.
| EBADF can be a symptom of a very dangerous bug where some
| thread closed a still-in-use FD then a open gets the same FD
| and now maybe you get file corruption. This particular type
| of bug doesn't happen with HANDLEs.
| marwis wrote:
| Gotcha. But it looks like file descriptors could be made
| almost as safe by avoiding index reuse. Is there any reason
| why it is not done? Hashtable too costly costly vs array?
| cryptonector wrote:
| File descriptor numbers have to be "small" -- that's part
| of their semantics. To ensure this, the kernel is
| supposed to always allocate the smallest available FD
| number. A lot of code assumes that FDs are "small" like
| this. Threaded code can't assume that "no FD numbers less
| than some number are available", but all code on Unix can
| assume that generally the used FD number space is dense.
| Even single-threaded code can't assume that "no FD
| numbers less than some number are available" because of
| libraries, but still, the assumption that the used FD
| number space is dense does get made. This basically
| forces the reuse of FDs to be a thing that happens.
|
| For example, the traditional implementations of FD_SET()
| and related macros for select(3) assume that FDs are
| <1024.
|
| Mind you, aside from select(), not much might break from
| doing away with the FDs-are-small constraint. Still, even
| so, they'd better be 64-bit ints if you want to be safe.
|
| HANDLEs are just better.
| lisper wrote:
| There is something about fork which I have never understood.
| Maybe someone here can explain it to me.
|
| Why would anyone ever want fork as a primitive? It seems to me
| that what you really want is a combination of fork and exec
| because 99% of the time you immediately call exec after fork (at
| least that's what I do 99% of the time when I use fork). If you
| _know_ that you 're going to call exec immediately after fork,
| then all the issues of dealing with the (potentially large)
| address space of the parent just evaporate because the child
| process is just going to immediately discard it all.
|
| So why is there not a fork-exec combo? And why has it not
| replaced fork for 99% of use cases?
|
| And as long as I'm asking stupid questions, why would anyone ever
| use vfork? If the child shares the parent's address space and
| uses the same stack as the parent, and the parent has to block,
| how is that different from a function call (other than being more
| expensive)?
|
| None of this makes sense to me.
| [deleted]
| kazinator wrote:
| > So why is there not a fork-exec combo?
|
| posix_spawn
|
| > Why would anyone ever want fork as a primitive?
|
| With fork you can very easily write a sever like mini_httpd:
|
| https://acme.com/software/mini_httpd/
|
| Or, in Unix shells: # function1 and funtion2
| are shell functions $ function1 | grep foo |
| function2
|
| here, the shell must fork a process (without exec) to run one
| of these functions.
|
| For instance function1 might run in a fork, the grep is a fork
| and exec of course, and function2 could be in the shell's
| primary process.
|
| In the POSIX shell language, fork is so tightly integrated that
| you can access it just by parenthesizing commands:
| $ (cd /path/to/whatever; command) && other command
|
| Everything in the parentheses is a sub-process; the effect of
| the cd, and any variable assignments, are lost (whether
| exported to the environment or not).
|
| In Lisp terms, fork makes _everything_ dynamically scoped, and
| rebinds it in the child 's context: except for inherited
| resources like signal handlers and file descriptors.
|
| Imagine every memory location having *earmuffs* like a defvar,
| and being bound to its current value by a giant _let_ , and
| imagine that being blindingly efficient to do thanks to VM
| hardware.
| dzaima wrote:
| You might want to move around some file descriptors if you
| don't want the child process to inherit your
| stdin/stdout/stderr (e.g. if you want to read the stdout of the
| process you launched, or give it some stdin).
|
| And there does exist such a fork-exec combo - posix_spawn. It
| allows adding some "commands" of what file descriptor
| operations to do between the fork & exec before they're ever
| done, among some other things. But, as the article mentions,
| using it is annoying - you have to invoke various
| posix_spawn_file_actions_* functions, instead of the regular C
| functions you'd use.
| outworlder wrote:
| > 99% of the time you immediately call exec after fork
|
| What about forking servers? listen() and then immediately
| fork() to handle the inbound connection? Those don't need exec.
|
| Also daemons. It's a common pattern to ditch permissions and
| then fork(), as per the old "Linux Daemon Writing HOWTO".
| lisper wrote:
| Do people really do that? It sounds like a huge DOS
| vulnerability to me.
| emmelaich wrote:
| Well, fork() is simple. No args, simple semantics.
|
| Flexibility; you can set up pipes.
|
| > _why is there not a fork-exec combo_
|
| There is, the spawn calls mentioned.
| jcranmer wrote:
| `fork` is a classic example, as others have mentioned, as
| something that was implemented because it was [at the time]
| easy rather than because it was a good design. In the decades
| since, we've found there are issues that are caused by the
| semantics of fork, especially if the most common subsequent
| system call is `exec`.
|
| If you're designing an OS from scratch, support for `fork` and
| `exec` as separate system calls is not what you want. Instead,
| you'd be likely to describe something in terms of a process
| creation system call, which will have eleventy billion
| parameters governing all of the attributes of the spawned
| process.
|
| POSIX specifies a fork+exec combo called posix_spawn. This is
| actually used a fair amount, but the reason it isn't used more
| is because it doesn't support all of the eleventy-billion
| parameters governing all of the attributes of the spawned
| process. Instead, these parameters are usually set by calling
| system calls that change these parameters between fork and
| exec. These system calls might, for example, change the root
| directory of a process or attach a debugger. Neither of these
| are supported by posix_spawn, which only allows the common
| operations of changing the file descriptors or resetting the
| signal mask in the list of actions to do.
|
| And this suggests why you might want vfork: vfork allows you
| write something that looks like posix_spawn: you get to fork,
| do your new-process-attribute-setting-flags, and then exec to
| the new process image, all while being able to report errors in
| the same memory space.
| boring_twenties wrote:
| Splitting fork and exec allows you to do stuff before calling
| exec, for example redirecting file descriptors (like
| stdin/out/err), creating a pipe, modifying the child's
| environment, and so on.
| loeg wrote:
| (This is particularly useful for shells.)
| 10000truths wrote:
| Because there are many, many use cases where you _don 't_ want
| to call exec() immediately after fork().
|
| Want to constrain memory usage or CPU time of an arbitrary
| child process? You have to call setrlimit() before exec().
| Privilege separation? Call setuid() before exec(). Sandbox an
| untrusted child process in some way? Call seccomp() (or your OS
| equivalent) before exec(). And so on and so forth. Any time you
| want to change what OS resources the child process will have
| access to, you'll need to do some set-up work before invoking
| exec().
| wongarsu wrote:
| Windows solves this by adding a bunch of optional parameters
| to CreateProcess, as well as having two more variants
| (CreateProcessAsUser and CreateProcessWithLogon). Some of the
| arguments are complicated enough that they have helper
| functions to construct them.
|
| I like the more composable fork()->modify->exec() approach of
| unix, but I wouldn't call either of them really elegant.
| ChrisSD wrote:
| A third way is to grant the parent process access to the
| child such that they can use the child process handle to
| "remotely" set restrictions, write memory, start a thread,
| etc.
| notriddle wrote:
| That's one option, yes.
|
| The one I've favored while reading these arguments has been
| the "suspended process" model. The primitives are CREATE(),
| which takes an executable as a parameter and returns the
| PID of a paused process, and START(), which allows the
| process to actually run.
|
| Unix already has the concept of a paused executable, after
| all.
|
| This model also requires all the process-mutation syscalls,
| like setrlimit(), to accept a PID as a parameter, but
| prlimit() wound up being created anyway, because the
| ability to mutate an already-running process is useful.
| univspl wrote:
| To me this feels like a call for more powerful language
| primitives. i.e. a way to specify some action to take to "set
| up" the child process that's more explicit and readable than
| one special behaving in a particularly odd way. I'm imagining
| closures with some kind of Rust-like move semantics, but not
| entirely sure.
|
| (if we're speaking in terms of greenfield implementation of
| OS features)
| lisper wrote:
| Yeah, this. Why not mkprocess/exec instead of fork/exec?
| retbull wrote:
| Builder patterns for primitives? I think that seems super
| cool but then aren't you just building a new language?
| LeifCarrotson wrote:
| But my child processes are not arbitrary or untrusted,
| they're hard-coded and written by me!
|
| I'm not writing a shell, I'm writing an application!
| xioxox wrote:
| I use fork a lot in my Python science programs. It's really
| great - you can stick it in a loop and get immediate
| parallelism. It's much better than multiprocessing, etc, as you
| keep the state from just before the fork happened, so you can
| share huge data structures between the processes, without
| having to process the same data again or duplicate them. I've
| even written a module for processing things in forked
| processes: https://pypi.org/project/forkqueue/
| khaledh wrote:
| From "Operating Systems: Three Easy Pieces" chapter on "Process
| API" (section 5.4 "Why? Motivating The API") [1]:
| ... the separation of fork() and exec() is essential in
| building a UNIX shell, because it lets the shell run
| code after the call to fork() but before the call to
| exec(); this code can alter the environment of the about-to-be-
| run program, and thus enables a variety of interesting
| features to be readily built. ... The
| separation of fork() and exec() allows the shell to do a whole
| bunch of useful things rather easily. For example:
| prompt> wc p3.c > newfile.txt In the example
| above, the output of the program wc is redirected into the
| output file newfile.txt (the greater-than sign is how
| said redirection is indicated). The way the shell
| accomplishes this task is quite simple: when the child is
| created, before calling exec(), the shell closes standard
| output and opens the file newfile.txt. By doing so, any
| output from the soon-to-be-running program wc are sent
| to the file instead of the screen.
|
| [1] https://pages.cs.wisc.edu/~remzi/OSTEP/cpu-api.pdf
| Animats wrote:
| Because "fork" was easy to implement in UNIX on the PDP-11.
|
| The original implementation was for a machine with very limited
| memory. So fork worked by swapping out the process. But then,
| instead of releasing the in-memory copy, the kernel duplicated
| the process table entry. So there were now two copies of the
| process, one in memory and one swapped out. Both were runnable,
| even if there wasn't enough memory for both to fit at once.
| Both executed onward from there.
|
| And that's why "fork" exists. It was a cram job to fit in a
| machine with a small address space.
| garaetjjte wrote:
| >So why is there not a fork-exec combo?
|
| There is, posix_spawn.
| shadowofneptune wrote:
| Dennis Richie addresses this in a history of early Unix:
| https://www.bell-labs.com/usr/dmr/www/hist.html
|
| "Process control in its modern form was designed and
| implemented within a couple of days. It is astonishing how
| easily it fitted into the existing system; at the same time it
| is easy to see how some of the slightly unusual features of the
| design are present precisely because they represented small,
| easily-coded changes to what existed. A good example is the
| separation of the fork and exec functions. The most common
| model for the creation of new processes involves specifying a
| program for the process to execute; in Unix, a forked process
| continues to run the same program as its parent until it
| performs an explicit exec. The separation of the functions is
| certainly not unique to Unix, and in fact it was present in the
| Berkeley time-sharing system [2], which was well-known to
| Thompson. Still, it seems reasonable to suppose that it exists
| in Unix mainly because of the ease with which fork could be
| implemented without changing much else."
| lisper wrote:
| OK, but why has it not be replaced with something better in
| the intervening 50 years? There have been a lot of
| improvements to unix since 1970. Why not this?
| cryptonector wrote:
| It was! vfork() was added to BSD because fork() sucks.
|
| But then someone very opinionated wrote "vfork() Considered
| Dangerous" and too many people accepted that incorrect
| conclusion.
| kelnos wrote:
| It was; ~20 years ago we got posix_spawn(3).
| pm215 wrote:
| There is exactly a fork-exec combo like that: it's called
| posix_spawn(): https://man7.org/linux/man-
| pages/man3/posix_spawn.3.html
|
| I think the reason for fork() and exec() as primitives goes
| back to the early days Unix design philosophy. Unix tends to
| favour "easy and simple for the OS to implement" rather than
| "convenient for user processes to use". (For another example of
| that, see the mess around EINTR.) fork() in early unix was not
| a lot of code, and splitting into fork/exec means two simple
| syscalls rather than needing a lot of extra fiddly parameters
| to set up things like file descriptors for the child.
|
| There's a bit on this in "The Evolution of the UNIX Time-
| Sharing System" at https://www.bell-
| labs.com/usr/dmr/www/hist.html -- "The separation of the
| functions is certainly not unique to Unix, and in fact it was
| present in the Berkeley time-sharing system [2], which was
| well-known to Thompson. Still, it seems reasonable to suppose
| that it exists in Unix mainly because of the ease with which
| fork could be implemented without changing much else." It says
| the initial fork syscall only needed 27 lines of assembly
| code...
|
| (Edit: I see while I was typing that other commenters also
| noted both the existence of posix_spawn and that quote...)
| lisper wrote:
| > Unix tends to favour "easy and simple for the OS to
| implement"
|
| Well, yeah, but the whole problem here, it seems to me, is
| that fork is _not_ simple to implement precisely _because_ it
| combines the creation of the kernel data structures required
| for a process with the actual initiation of the process. Why
| not mkprocess, which creates a suspended process that has to
| be started with a separate call to exec? That way you never
| have to worry about all the hairy issues that arise from
| having to copy the parent 's process memory state.
| pm215 wrote:
| It was simple specifically for the people writing it at the
| time. We know this, because they've helpfully told us so
| :-) It might or might not have been harder than a different
| approach for some other programmers writing some other OS
| running on different hardware, but the accidents of history
| mean we got the APIs designed by Thompson, Ritchie, et al,
| and so we get what they personally found easy for their
| PDP7/PDP11 OS...
| cryptonector wrote:
| fork() was trivial to implement back then. It became non-
| trivial later when RAM sizes and resident set sizes too
| increased.
| slaymaker1907 wrote:
| I think it's actually a pretty useful primitive for doing
| multiprocessing. Unlike threading, you have a completely
| separate memory space both for avoiding data races and
| performance (memory allocators still aren't perfect and weird
| stuff can happen with cache lines). Unlike exec after fork or
| anything equivalent, you still get to share things like file
| descriptors and read only memory for convenience.
| cryptonector wrote:
| > Why would anyone ever want fork as a primitive?
|
| > So why is there not a fork-exec combo?
|
| There are so many variations to what you can do with fork+exec
| that designing a suitable "fork-exec combo" API is really
| difficult, so any attempts tend to yield a fairly limited API
| or a very difficult-to-use API, and that ends up being very
| limiting to its consumers.
|
| On the flip side, fork()+exec() made early Unix development
| very easy by... avoiding the need to design and implement a
| complex spawn API in kernel-land.
|
| Nowadays there are spawn APIs. On Unix that would be
| posix_spawn().
|
| > And as long as I'm asking stupid questions, why would anyone
| ever use vfork? If the child shares the parent's address space
| and uses the same stack as the parent, and the parent has to
| block, how is that different from a function call (other than
| being more expensive)?
|
| (Not a stupid question.)
|
| You'd use vfork() only to finish setting up the child side
| before it execs, and the reason you'd use vfork() instead of
| fork() is that vfork()'s semantics permit a very high
| performance implementation while fork()'s semantics necessarily
| preclude a high performance implementation altogether.
| indrora wrote:
| > why would anyone ever want fork as a primitive
|
| Long ago in the far away land of UNIX, fork was a primitive
| because the primary use of fork was to do more work on the
| system. You likely were one of thee or four other people, at
| any given moment vying for CPU time, and it wasn't uncommon to
| see loads of 11 on a typical university UNIX system.
|
| > so why is there not a fork-exec combo
|
| you're looking for system(3). Turns out, most people
| waitpid(fork()). Windows explicitly handles this situation with
| CreateProcess[0] which does a way better job of it than POSIX
| does (which, IMO, is the standard for most of the win32 API,
| but that's a whole can of worms I won't get into).
|
| > why would anyone ever use vfork?
|
| Small shells, tools that need the scheduling weight of "another
| process" but not for long, etc. See also, waitpid(fork()).
|
| When you have something with MASSIVE page tables, you don't
| want to spend the time copying the whole thing over. There's a
| huge overhead to that.
|
| [0] https://docs.microsoft.com/en-
| us/windows/win32/api/processth...
| anderskaseorg wrote:
| system(3) is not a good alternative because it indirects
| through the shell, which adds the overhead of launching the
| shell as well as the danger of misinterpreting shell
| metacharacters in the command if you aren't meticulous about
| escaping them correctly.
| kelnos wrote:
| > _Why would anyone ever want fork as a primitive? It seems to
| me that what you really want is a combination of fork and exec
| because 99% of the time you immediately call exec after fork
| (at least that 's what I do 99% of the time when I use fork)._
|
| If you eliminate fork, then what do you do for those 1% of
| cases where you actually _do_ need it? I agree that it 's
| uncommon, but I have written code before that calls fork() but
| then does not exec().
|
| > _So why is there not a fork-exec combo?_
|
| There is; it's called posix_spawn(3).
|
| > _And why has it not replaced fork for 99% of use cases?_
|
| Even though it's been around for about 20 years, it's still
| newer than fork+exec, so I assume a) many people just don't
| know about it, or b) people still want to go for maximum
| compatibility with old systems that may not have it, even if
| that's a little silly.
| duskwuff wrote:
| > Why would anyone ever want fork as a primitive?
|
| fork() without exec() can make sense in the context of a
| process-per-connection application server (like SSH). I've also
| used it quite effectively as a threading alternative in some
| scripting languages.
|
| > So why is there not a fork-exec combo?
|
| There is; it's called posix_spawn(). Like a lot of POSIX APIs,
| it's kind of overcomplicated, but it does solve a lot of the
| problems with fork/exec.
|
| > And as long as I'm asking stupid questions, why would anyone
| ever use vfork?
|
| For processes with a very large address space, fork() can be an
| expensive operation. vfork() avoids that, so long as you can
| guarantee that it'll immediately be followed by an exec().
| kazinator wrote:
| fork with copy-on-write semantics avoids copying the whole
| address space. It does have to copy some data structures that
| manage virtual memory and maybe the first level of the paging
| structure(page directory or whatever).
| cryptonector wrote:
| copy-on-write == slow when called from threaded processes
| with large resident set sizes.
| bogomipz wrote:
| Can you elaborate on this? I understand why copying a
| large address space might be slow but how or why does the
| number of threads in a process affects this? Is it
| scheduling?
| cryptonector wrote:
| Copy-on-write means twiddling with the MMU, and TLB
| updates across cores ("TLB shootdowns") can be very
| expensive. If the process is not threaded, then the OS
| could make sure to schedule the child and parent on the
| same CPU to avoid needing TLB shootdowns, but if it's
| threaded, forget about it.
| evmar wrote:
| In Ninja, which needs to spawn a lot of subprocesses but it
| otherwise not especially large in memory and which doesn't use
| threads, we moved from fork to posix_spawn (which is the "I want
| fork+exec immediately, please do the smartest thing you can"
| wrapper) because it performed better on OS X and Solaris:
|
| https://github.com/ninja-build/ninja/commit/89587196705f54af...
| ridiculous_fish wrote:
| posix_spawn also outperforms fork on Linux under more recent
| glibc and musl, which can use vfork under the hood.
| https://twitter.com/ridiculous_fish/status/12328893907639336...
| ismaildonmez wrote:
| Microsoft Research has a paper about the very same issue (2019):
| https://www.microsoft.com/en-us/research/publication/a-fork-...
| georgia_peach wrote:
| That paper smacks of a Chesterton Fence. They haven't come up
| with a tested replacement for many of the use cases, i.e.:
| These designs are not yet general enough to cover all the use-
| cases outlined above, but perhaps can serve as a starting
| point...
|
| yet bullet #1 in the next paragraph is
| Deprecate Fork
|
| I think this is a case of security guys being upset about fork
| gumming-up their experiments. I don't really care about their
| experiments. The security regime for the past 20 years may have
| bought us a little more security against eastern bloc hackers,
| but it hasn't done squat to protect us from Apple, Google, &
| Microsoft! I have never had a virus de-rail my computing life
| as much as the automatic Windows 10 upgrade. Robert Morris got
| 400 hours community service for a relatively benign worm. If
| that's the penalty scale, Redmond should get actual time in the
| slammer for Cortana, forced Windows Update, and adding
| telemetry to Calculator.
| cryptonector wrote:
| It's a very good paper, yeah. I will link it from the gist.
___________________________________________________________________
(page generated 2022-02-28 23:00 UTC)