[HN Gopher] Why is spawning a new process in Node so slow?
       ___________________________________________________________________
        
       Why is spawning a new process in Node so slow?
        
       Author : maxmcd
       Score  : 50 points
       Date   : 2024-07-26 23:07 UTC (3 days ago)
        
 (HTM) web link (blog.val.town)
 (TXT) w3m dump (blog.val.town)
        
       | efilife wrote:
       | Nice article, definitely wasn't expecting some of its results.
       | Was good to see Node finally beat Deno near the end of the
       | article.
       | 
       | As a heads up, the author confuses _its_ and _it 's_. Makes the
       | article look unprofessional. https://youryoure.com/?its here's
       | how to differentiate
        
         | maxmcd wrote:
         | Thanks for noting it. It do indeed mix those two up all the
         | time. Pushed a fix!
        
           | codr7 wrote:
           | Don't we all :)
           | 
           | It's easy to forget what a big deal being semi fluent in
           | multiple languages is.
        
       | jart wrote:
       | It's important to specify which OS is being used because spawning
       | goes a lot faster on Linux which has vfork(). If 40 spawns per
       | second is the Linux speed, then I don't even want to know what
       | that looks like on Darwin and OpenBSD. With Cosmopolitan Libc, 20
       | spawns per second is about the speed I get when simulating fork()
       | on Windows. What makes fork() slow for large programs like Node
       | is that fork() needs to lock and hold every single mutex that
       | exists in a process while it happens. See the IEEE POSIX.1 notes
       | surrounding pthread_atfork(). So yeah, that means everything is
       | de facto blocked until the spawn completes, regardless of the
       | thread that's doing it. Using the separate process to spawn is a
       | smart idea. Especially if you can talk to it via a process shared
       | condition variable.
        
         | convolvatron wrote:
         | vfork() was introduced in 3.0 BSD
        
         | amelius wrote:
         | > What makes fork() slow for large programs like Node is that
         | fork() needs to lock and hold every single mutex that exists in
         | a process while it happens
         | 
         | Are there any circumstances where you'd want to fork a multi-
         | threaded process?
        
           | toast0 wrote:
           | forking is useful if you need to call other code when you
           | can't or don't want to link that code into your own process.
           | Pretty useful for things like ffmpeg where you wouldn't want
           | that in your process space (and you may want to put more than
           | just a single fork/exec between your code and theirs in that
           | case), but also handy for "simple" os integrations where it's
           | easier to run an existing utility than find a library or
           | interface more directly yourself.
           | 
           | Downsides in forking from a large multi-threaded process make
           | it pretty common to add something in early init that forks a
           | process (or several) that remain small and can be commanded
           | to fork at runtime. On modern systems, it's a very small cost
           | if you don't actually fork, and the small added cost of
           | intermediating fork/exec pays off the first time you don't
           | have to wait for hundreds or thousands or millions of fds to
           | be closed in the child process during the process. Of course,
           | if your parent process remains small with few opened
           | filehandles, maybe it is a measurable negative, but it's
           | still likely to be very small.
        
         | maxmcd wrote:
         | Ah, I will add it to the post, an accidental omission. This was
         | all on `Linux 6.8.0-35-generic`.
        
         | maxmcd wrote:
         | Do you have a sense for why things were so much faster in Go
         | and Rust if they are both using a single process? I have also
         | been experimenting with napi-rs which uses
         | https://nodejs.org/api/n-api.html and allows you to run Rust in
         | threads alongside the Node process. Performance with that
         | approach has been very good, so I have wondered about the
         | within-process limitations.
        
         | saurik wrote:
         | If you want to spawn a new process on Darwin you really want to
         | reach for posix_spawn, not any variant of fork. Using this API
         | should also do whatever is optimal on Linux, which still
         | doesn't seem to be merely calling vfork, as you instead can use
         | clone--with CLONE_VFORK, sure--to limit the work to CLONE_VM
         | (per the man page for glibc's posix_spawn). (Of course, this
         | API would also be the perfect fit for use on Windows, but I
         | don't think MinGW or Cygwin implement it, for reasons which are
         | beyond me, given how horrible simulating fork is in that OS;
         | but, I believe gnulib provides it, hopefully using
         | CreateProcess on Win32, for all of these platforms than are
         | missing out... and, they really are "missing out", as I feel
         | most developers--including me--end up incorrectly dealing with
         | the issues surrounding multi-threaded fork.)
        
           | o11c wrote:
           | > If you want to spawn a new process on Darwin you really
           | want to reach for posix_spawn
           | 
           | Assuming you can avoid the bugs and version-sensitivities.
           | There are a _lot_ of them on various platforms, and my gut
           | says they 're more prominent on Darwin since alternatives are
           | unrecommended.
           | 
           | The dup2/cloexec mess is the one that sticks in my mind but
           | it's far from the only one.
        
         | nullindividual wrote:
         | To be fair for Windows, you shouldn't emulate fork() but rather
         | spawn new threads, which Windows can do in less than a
         | millisecond. This should give you much quicker results, but
         | comes with other complexities, especially for UN*X-ported apps.
         | Given you emulated fork(), you probably know this.
        
           | Spivak wrote:
           | But those aren't really the same? I mean yes they are both
           | ways to acquire additional units of execution but if you're
           | reaching for subprocesses it's likely specifically because
           | it's something you can't do in threads*.
           | 
           | * Or can do but would be horrifying, like trying to run a
           | compiled binary by mmaping it into your address space.
        
         | o11c wrote:
         | > What makes fork() slow for large programs like Node is that
         | fork() needs to lock and hold every single mutex that exists in
         | a process while it happens
         | 
         | It really doesn't; if it does that's a bug to fix. It only
         | requires that the child process does not call any function that
         | is not async-signal-safe if threads exist, and end with a call
         | to exec or _exit. In C, this is usually pretty easy to arrange;
         | at most the difficulty is if you want to support arbitrary
         | file-descriptor mappings. However, it is impossible from
         | languages like Javascript (or Python, etc.).
         | 
         | pthread_atfork() is really only useful to call if you _know_
         | you 're using multiple processes and not using threads at all.
         | _Fork() is supposed to be a thing now (to avoid any historical
         | atfork handlers) but can't be relied on to exist.
         | 
         | posix_spawn() has the unfortunate problem of being buggy and
         | version-dependent when various options are used. It's probably
         | fine if you don't need the options (for example, if you exec a
         | shim first), or if you know how recent a platform you can use.
         | 
         | vfork() on Linux is mostly an optimization (for scheduling or
         | for when overcommit is disabled), but compiler optimizations
         | can make it buggy (though given that /bin/sh etc. use it,
         | compiler probably won't break it too much if you call it in the
         | same way). For best results, it's advisable to either call
         | clone (with a separate stack directly), or use architecture-
         | specific assembly to call `vfork` in a way the compiler can't
         | mess up.
         | 
         | vfork() on many other platforms is basically just an alias for
         | fork() (or maybe _Fork() ?), but I haven't investigated the
         | details much.
         | 
         | fork() is fine if your parent process isn't big enough that
         | you're worried about memory accounting though, and you haven't
         | added braindead atfork handlers.
        
       | thomasfromcdnjs wrote:
       | Thanks for the super interesting read.
       | 
       | Did you happen to look at how the load on the 8 cores looked at
       | any given time?
        
         | maxmcd wrote:
         | Both Go and Rust had CPU near 100% on all cores. Most of the
         | Deno and Bun workloads were at around 80%, some of the slower
         | Node benchmarks were around 40-60% and that increased over
         | time, I think most of the fastest benchmarks were seeing very
         | good CPU utilization across all runs.
         | 
         | Sorry, this is all from memory of just having `btop` open for
         | some of the benchmarks. Maybe in a future post I will dig into
         | CPU utlization more.
        
       | Denvercoder9 wrote:
       | I'm curious if anyone has any insights into the answer to the
       | titular question. The article, while certainly interesting,
       | mostly discusses workarounds, but doesn't really dive into a root
       | cause analysis.
        
         | maxmcd wrote:
         | Yes, apologies for the less than satisfying conclusion. I am
         | planning a part 2 of this post where I hope to actually answer
         | the question.
         | 
         | In the meantime the discussion on lobste.rs includes some
         | lower-level speculation:
         | https://lobste.rs/s/tr8ozm/why_is_spawning_new_process_node_...
        
       | yu3zhou4 wrote:
       | A really nice read, thanks for posting. I'm curious about why
       | it's not as fast as Go and is it possible to speed up to Go's
       | level?
        
       | throwitaway1123 wrote:
       | One addendum I would add to the section on node:cluster is that
       | Deno hasn't actually implemented the cluster module yet (it's
       | just a stub), so using it with Deno is pure overhead [1]. Also,
       | there's ongoing work to add node:cluster to Bun [2].
       | 
       | [1] https://docs.deno.com/runtime/manual/node/compatibility/
       | 
       | [2] https://github.com/oven-sh/bun/pull/11492
        
       | ambicapter wrote:
       | > Turns out, that if you use a Unix socket and the filename
       | starts with a null byte, like \0foo the socket will not exist on
       | the filesystem and it'll be automatically removed when no longer
       | used. Weird! Cool!
       | 
       | Is that...intentional on Unix's part? Seems kind of a weird thing
       | to implement.
        
         | spc476 wrote:
         | I believe that's a Linux thing.
        
         | o11c wrote:
         | It's very useful since "how do I atomically delete a stale
         | socket file" is not actually an easy thing to do.
        
       | cryptonector wrote:
       | The problem is `fork()`. Use `vfork()` or `posix_spawn()`.
        
       ___________________________________________________________________
       (page generated 2024-07-30 23:00 UTC)