[HN Gopher] Way too many ways to wait on a child process with a ...
       ___________________________________________________________________
        
       Way too many ways to wait on a child process with a timeout
        
       Author : broken_broken_
       Score  : 83 points
       Date   : 2024-11-10 23:01 UTC (2 days ago)
        
 (HTM) web link (gaultier.github.io)
 (TXT) w3m dump (gaultier.github.io)
        
       | nf3 wrote:
       | FWIW io_uring does have support for waitid.
       | 
       | https://www.man7.org/linux/man-pages/man3/io_uring_prep_wait...
        
         | broken_broken_ wrote:
         | Many thanks! I have added it to the article in due form now.
        
         | EdSchouten wrote:
         | An interesting aspect of waitid is that it allows you to access
         | the full exit code of the process (i.e., the entire int instead
         | of just the bottom 8 bits).
         | 
         | Unfortunately, many operating systems implement waitid() on top
         | of one of the older APIs, meaning the top bits get lost
         | regardless...
        
       | xchip wrote:
       | Thanks for this great article, it is going to be very useful for
       | my project. I am currently developing an open source Android
       | native app that invokes rsync when a file gets closed (ie: you
       | take a picture)
       | 
       | https://github.com/aguaviva/Syncy
        
       | nasretdinov wrote:
       | So many ways and no-one mentioned threads..?
       | 
       | Edit: by threads I mean creating a new thread to wait for the
       | process, and then kill the process after a certain timeout if the
       | process hasn't terminated. I guess I'm spoiled by Go...
        
         | zbentley wrote:
         | The threading approach is roughly:
         | 
         | 1. Start a thread
         | 
         | 2. That thread starts a child process and signals "started" by
         | storing its PID somewhere globally-visible (and hopefully
         | atomic/lock-protected).
         | 
         | 3. The thread then blocks in wait(2), taking advantage of its
         | non-main-thread-ness to avoid some signals and optionally
         | masking/ignoring some more.
         | 
         | 4. When the process exits, the thread can write
         | exitstatus/"completed" to the globally-visible state next to
         | PID. The thread then exits.
         | 
         | 3. External observers wait for the process with a timeout by
         | attempting to join the thread with a timeout. If the timeout
         | occurs, they can access the globally-visible PID and send a
         | signal to it.
         | 
         | This is missing from the article (EDIT: it has since been
         | added, thanks!). That doesn't mean it's a good solution on many
         | platforms. It's more costly in resources (thread stack), more
         | code than most of the listed options, vulnerable to PID-reuse
         | problems that can cause a killsignal to go to the wrong
         | process, likely plays poorly with spawning methods that request
         | a SIGCHLD be sent to the parent on exit (and plays poorly with
         | signals in general if any customization is needed there), and
         | is probably often slower than most of TFA's alternatives as
         | well, both due to syscall count and pessimal thread/scheduler
         | switching conditions. Additionally, it multiplexes/composes to
         | large numbers of processes poorly and with a high resource
         | cost.
         | 
         | EDIT: Golang's version of this is less bad than described
         | above, but not perfect. Go's spawning infrastructure mitigates
         | resource cost (goroutines/segmented stacks are not as heavy as
         | threads), is vulnerable to PID-reuse (as are most platforms'
         | operations in this area), addresses the SIGCHLD risk through
         | the runtime and signal channels, and mitigates slowness with a
         | very good scheduler. For multiplexing, I would assume (but I
         | have not verified) that the Go runtime is internally using
         | pidfds/kqueue where supported. Where not supported, I would
         | assume Go is internally tracking spawn requests through its
         | stdlib, handling SIGCHLD, and has a single global routine
         | calling wait(2) without a specific PID, waking goroutines
         | waiting on a watched PID when it comes out of the call to
         | wait(2).
        
           | broken_broken_ wrote:
           | Thanks for the suggestion, I have added a short section about
           | threads.
        
           | nasretdinov wrote:
           | Thanks. I believe that Go indeed _could_ use those APIs to
           | wait for the child more efficiently if they chose to, but the
           | current implementation suggests that they're just calling
           | wait4() in a separate thread: https://cs.opensource.google/go
           | /go/+/refs/tags/go1.23.3:src/...
           | 
           | To be fair, in Go process spawning is very inefficient to
           | begin with, since it requires lots of runtime coordination to
           | not mess with the threads/goroutines state during fork, so
           | running wait4() in a separate thread (although the thread can
           | be re-used afterwards) is not the biggest concern here.
        
       | machine_coffee wrote:
       | Lol, author's thought process mirrored mine as I read the
       | article, as I was reading I was thinking, 'doesn't kqueue support
       | that?... and then a section on kqueue. Then I was thinking to
       | myself, so how does the Linux implementation do it then?... was
       | just about to start trawling the source code when 'A
       | parenthesis..'
       | 
       | Great article. Sorry to say though, Windows does manage all this
       | in a more consistent way - but I guess they had the benefit of a
       | clean slate.
        
         | silon42 wrote:
         | signalfd / process descriptiors are the Windows style
         | mechanism... what is missing are a few things like 'spawn' that
         | returns a fd directly (eliminating races...)
        
           | blibble wrote:
           | there is no race from the parent
           | 
           | the pid will not be reused until you either handle sigchld or
           | wait
        
       | JackSlateur wrote:
       | What is the meaning of this code ?                 void
       | on_sigchld(int sig) { (void)sig; }
        
         | naruhodo wrote:
         | If it's C code, that is the way to suppress a compiler warning
         | about sig being unused. In C++ you can omit (or comment-out)
         | the parameter name, e.g.:                   // C++         void
         | on_sigchld(int /*sig*/) {}
        
           | JackSlateur wrote:
           | Thank you
           | 
           | So this would be a way which predates' C23's maybe_unused
           | attribute1
           | 
           | Nice trick
           | 
           | [1] https://en.cppreference.com/w/c/language/attributes/maybe
           | _un...
        
           | kevin_thibedeau wrote:
           | That is K&R C syntax, supported up to C18. The solution to
           | tools emitting unwanted diagnostics is not to appease them
           | with pointless cruft but to shut off the diagnostic:
           | -Wall -Wextra -Wno-unused-parameter
        
       | moron123 wrote:
       | Parenting 101
        
       | eduction wrote:
       | He mentions Bryan Cantrill in there and I can't resist posting
       | his famous epoll/kqueue rant:
       | 
       | https://youtu.be/l6XQUciI-Sc?t=3643
       | 
       | I know this is related but maybe someone smarter than me can
       | explain how closely it relates (or doesn't) to this issue which
       | seems more general (iirc Cantrill was talking about fs events not
       | child processes generally)
        
       | o11c wrote:
       | > Because the Linux kernel coalesces SIGCHLD (and other signals),
       | the only way to reliably determine if a monitored process has
       | exited, is to loop through all PIDs registered by any kqueue when
       | we receive a SIGCHLD. This involves many calls to waitid(2) and
       | may have a negative performance impact.
       | 
       | This is somewhat wrong. To speed things up in the happy case
       | (where we are the only part of the program that is spawning
       | children), you can just do a `WNOHANG` wait for _any_ child
       | first, and check if it 's one of the children we care about. Only
       | if it's an unknown child do you have to do the full loop (of
       | course, if you only have a couple of children the loop may be
       | better).
        
       | akira2501 wrote:
       | > I would prefer extending poll to support things other than file
       | descriptors, instead of converting everything a file descriptor
       | to be able to use poll.
       | 
       | Why? The ability to block on these descriptors as a one off
       | rather than wrapping into a poll makes them extremely useful and
       | avoids the race issues that exist with signal handlers and other
       | non-blocking mechanisms.
       | 
       | signalfd, timerfd, eventfd, userfaultfd, pidfd are all great
       | applications of this strategy.
        
       | greggyb wrote:
       | Not so much about timeouts, but related in that it is based
       | around managing children processes:
       | 
       | The lineage of tools descending from daemontools for service
       | management is worth exploring:
       | 
       | daemontools: http://cr.yp.to/daemontools.html
       | 
       | runit: https://smarden.org/runit/
       | 
       | s6: https://skarnet.org/software/s6/
       | 
       | dinit: https://davmac.org/projects/dinit/
        
       | adrianmonk wrote:
       | Tenth Approach: fork() two processes.
       | 
       | Child 1 exec()s the command.
       | 
       | Child 2 does this:                   signal(SIGALRM,
       | alarm_handler);         alarm(timeout_length);         pause();
       | exit(0);
       | 
       | Start both children, then call wait(), which blocks until _any_
       | child exits and returns the pid of the child that exited. If it
       | 's the command child, then your command finished. If it's the
       | other child, then the timeout expired.
       | 
       | Now that one child has exited, kill() the other child with
       | SIGTERM and reap it by calling wait() again.
       | 
       | All of this assumes you'll only have these two children going,
       | but if you're writing a small exponential backoff command retry
       | utility, that should be OK.
        
       ___________________________________________________________________
       (page generated 2024-11-13 23:01 UTC)