[HN Gopher] Why is spawning a new process in Node so slow?
___________________________________________________________________
Why is spawning a new process in Node so slow?
Author : maxmcd
Score : 50 points
Date : 2024-07-26 23:07 UTC (3 days ago)
(HTM) web link (blog.val.town)
(TXT) w3m dump (blog.val.town)
| efilife wrote:
| Nice article, definitely wasn't expecting some of its results.
| Was good to see Node finally beat Deno near the end of the
| article.
|
| As a heads up, the author confuses _its_ and _it 's_. Makes the
| article look unprofessional. https://youryoure.com/?its here's
| how to differentiate
| maxmcd wrote:
| Thanks for noting it. It do indeed mix those two up all the
| time. Pushed a fix!
| codr7 wrote:
| Don't we all :)
|
| It's easy to forget what a big deal being semi fluent in
| multiple languages is.
| jart wrote:
| It's important to specify which OS is being used because spawning
| goes a lot faster on Linux which has vfork(). If 40 spawns per
| second is the Linux speed, then I don't even want to know what
| that looks like on Darwin and OpenBSD. With Cosmopolitan Libc, 20
| spawns per second is about the speed I get when simulating fork()
| on Windows. What makes fork() slow for large programs like Node
| is that fork() needs to lock and hold every single mutex that
| exists in a process while it happens. See the IEEE POSIX.1 notes
| surrounding pthread_atfork(). So yeah, that means everything is
| de facto blocked until the spawn completes, regardless of the
| thread that's doing it. Using the separate process to spawn is a
| smart idea. Especially if you can talk to it via a process shared
| condition variable.
| convolvatron wrote:
| vfork() was introduced in 3.0 BSD
| amelius wrote:
| > What makes fork() slow for large programs like Node is that
| fork() needs to lock and hold every single mutex that exists in
| a process while it happens
|
| Are there any circumstances where you'd want to fork a multi-
| threaded process?
| toast0 wrote:
| forking is useful if you need to call other code when you
| can't or don't want to link that code into your own process.
| Pretty useful for things like ffmpeg where you wouldn't want
| that in your process space (and you may want to put more than
| just a single fork/exec between your code and theirs in that
| case), but also handy for "simple" os integrations where it's
| easier to run an existing utility than find a library or
| interface more directly yourself.
|
| Downsides in forking from a large multi-threaded process make
| it pretty common to add something in early init that forks a
| process (or several) that remain small and can be commanded
| to fork at runtime. On modern systems, it's a very small cost
| if you don't actually fork, and the small added cost of
| intermediating fork/exec pays off the first time you don't
| have to wait for hundreds or thousands or millions of fds to
| be closed in the child process during the process. Of course,
| if your parent process remains small with few opened
| filehandles, maybe it is a measurable negative, but it's
| still likely to be very small.
| maxmcd wrote:
| Ah, I will add it to the post, an accidental omission. This was
| all on `Linux 6.8.0-35-generic`.
| maxmcd wrote:
| Do you have a sense for why things were so much faster in Go
| and Rust if they are both using a single process? I have also
| been experimenting with napi-rs which uses
| https://nodejs.org/api/n-api.html and allows you to run Rust in
| threads alongside the Node process. Performance with that
| approach has been very good, so I have wondered about the
| within-process limitations.
| saurik wrote:
| If you want to spawn a new process on Darwin you really want to
| reach for posix_spawn, not any variant of fork. Using this API
| should also do whatever is optimal on Linux, which still
| doesn't seem to be merely calling vfork, as you instead can use
| clone--with CLONE_VFORK, sure--to limit the work to CLONE_VM
| (per the man page for glibc's posix_spawn). (Of course, this
| API would also be the perfect fit for use on Windows, but I
| don't think MinGW or Cygwin implement it, for reasons which are
| beyond me, given how horrible simulating fork is in that OS;
| but, I believe gnulib provides it, hopefully using
| CreateProcess on Win32, for all of these platforms than are
| missing out... and, they really are "missing out", as I feel
| most developers--including me--end up incorrectly dealing with
| the issues surrounding multi-threaded fork.)
| o11c wrote:
| > If you want to spawn a new process on Darwin you really
| want to reach for posix_spawn
|
| Assuming you can avoid the bugs and version-sensitivities.
| There are a _lot_ of them on various platforms, and my gut
| says they 're more prominent on Darwin since alternatives are
| unrecommended.
|
| The dup2/cloexec mess is the one that sticks in my mind but
| it's far from the only one.
| nullindividual wrote:
| To be fair for Windows, you shouldn't emulate fork() but rather
| spawn new threads, which Windows can do in less than a
| millisecond. This should give you much quicker results, but
| comes with other complexities, especially for UN*X-ported apps.
| Given you emulated fork(), you probably know this.
| Spivak wrote:
| But those aren't really the same? I mean yes they are both
| ways to acquire additional units of execution but if you're
| reaching for subprocesses it's likely specifically because
| it's something you can't do in threads*.
|
| * Or can do but would be horrifying, like trying to run a
| compiled binary by mmaping it into your address space.
| o11c wrote:
| > What makes fork() slow for large programs like Node is that
| fork() needs to lock and hold every single mutex that exists in
| a process while it happens
|
| It really doesn't; if it does that's a bug to fix. It only
| requires that the child process does not call any function that
| is not async-signal-safe if threads exist, and end with a call
| to exec or _exit. In C, this is usually pretty easy to arrange;
| at most the difficulty is if you want to support arbitrary
| file-descriptor mappings. However, it is impossible from
| languages like Javascript (or Python, etc.).
|
| pthread_atfork() is really only useful to call if you _know_
| you 're using multiple processes and not using threads at all.
| _Fork() is supposed to be a thing now (to avoid any historical
| atfork handlers) but can't be relied on to exist.
|
| posix_spawn() has the unfortunate problem of being buggy and
| version-dependent when various options are used. It's probably
| fine if you don't need the options (for example, if you exec a
| shim first), or if you know how recent a platform you can use.
|
| vfork() on Linux is mostly an optimization (for scheduling or
| for when overcommit is disabled), but compiler optimizations
| can make it buggy (though given that /bin/sh etc. use it,
| compiler probably won't break it too much if you call it in the
| same way). For best results, it's advisable to either call
| clone (with a separate stack directly), or use architecture-
| specific assembly to call `vfork` in a way the compiler can't
| mess up.
|
| vfork() on many other platforms is basically just an alias for
| fork() (or maybe _Fork() ?), but I haven't investigated the
| details much.
|
| fork() is fine if your parent process isn't big enough that
| you're worried about memory accounting though, and you haven't
| added braindead atfork handlers.
| thomasfromcdnjs wrote:
| Thanks for the super interesting read.
|
| Did you happen to look at how the load on the 8 cores looked at
| any given time?
| maxmcd wrote:
| Both Go and Rust had CPU near 100% on all cores. Most of the
| Deno and Bun workloads were at around 80%, some of the slower
| Node benchmarks were around 40-60% and that increased over
| time, I think most of the fastest benchmarks were seeing very
| good CPU utilization across all runs.
|
| Sorry, this is all from memory of just having `btop` open for
| some of the benchmarks. Maybe in a future post I will dig into
| CPU utlization more.
| Denvercoder9 wrote:
| I'm curious if anyone has any insights into the answer to the
| titular question. The article, while certainly interesting,
| mostly discusses workarounds, but doesn't really dive into a root
| cause analysis.
| maxmcd wrote:
| Yes, apologies for the less than satisfying conclusion. I am
| planning a part 2 of this post where I hope to actually answer
| the question.
|
| In the meantime the discussion on lobste.rs includes some
| lower-level speculation:
| https://lobste.rs/s/tr8ozm/why_is_spawning_new_process_node_...
| yu3zhou4 wrote:
| A really nice read, thanks for posting. I'm curious about why
| it's not as fast as Go and is it possible to speed up to Go's
| level?
| throwitaway1123 wrote:
| One addendum I would add to the section on node:cluster is that
| Deno hasn't actually implemented the cluster module yet (it's
| just a stub), so using it with Deno is pure overhead [1]. Also,
| there's ongoing work to add node:cluster to Bun [2].
|
| [1] https://docs.deno.com/runtime/manual/node/compatibility/
|
| [2] https://github.com/oven-sh/bun/pull/11492
| ambicapter wrote:
| > Turns out, that if you use a Unix socket and the filename
| starts with a null byte, like \0foo the socket will not exist on
| the filesystem and it'll be automatically removed when no longer
| used. Weird! Cool!
|
| Is that...intentional on Unix's part? Seems kind of a weird thing
| to implement.
| spc476 wrote:
| I believe that's a Linux thing.
| o11c wrote:
| It's very useful since "how do I atomically delete a stale
| socket file" is not actually an easy thing to do.
| cryptonector wrote:
| The problem is `fork()`. Use `vfork()` or `posix_spawn()`.
___________________________________________________________________
(page generated 2024-07-30 23:00 UTC)