[HN Gopher] Debugging a FUSE deadlock in the Linux kernel
       ___________________________________________________________________
        
       Debugging a FUSE deadlock in the Linux kernel
        
       Author : andsoitis
       Score  : 80 points
       Date   : 2023-05-19 19:28 UTC (3 hours ago)
        
 (HTM) web link (netflixtechblog.com)
 (TXT) w3m dump (netflixtechblog.com)
        
       | nyanpasu64 wrote:
       | I'm reminded of Bedrock Linux hanging during sleep because Linux
       | sleeps the FUSE daemon while a process is waiting on FUSE and
       | unable to be suspended for sleep:
       | https://news.ycombinator.com/item?id=34583495
        
       | kayson wrote:
       | TIL about `ps awwfux`. Great way to remember it
        
         | chatmasta wrote:
         | See explainshell [0], a great resource, for the full
         | explanation of these flags.
         | 
         | (Another one of my favorites is cat -vet)
         | 
         | [0] https://explainshell.com/explain?cmd=ps+awwfux
        
       | loeg wrote:
       | This article is mostly about rediscovering the distinction
       | between interruptible and non-interruptible waits.
       | 
       | At the end there is some sort of deadlock between namespace
       | teardown, signal delivery, and FUSE, but... it isn't articulated
       | in a way that is super comprehensible to me. The kernel flushes
       | open files on exit and also kills things in the namespace on
       | exit. But that means the race condition was always hitable if you
       | killed the FUSE daemon at the wrong time relative to the FUSE
       | client shutdown? It's not totally obvious to me why this would
       | impact other non-FUSE filesystems.
       | 
       | Signal delivery and multithreaded process teardown in the kernel
       | is certainly tricky, and it's really easy to get these weird edge
       | cases wrong.
        
         | sargun wrote:
         | I don't understand why wants_signal returns false on PF_EXITING
         | even if the signal is SIGKILL (and from the kernel). Shouldn't
         | wake up the process still, so it can get out of the flush?
         | 
         | I am curious, if you were just to walk over every PID in the
         | pid namespace after sending zap_pid_ns_processes, and perform
         | wake ups, would it break out of the `wait_event` loop?
         | 
         | Btw, this class of weirdness with FUSE isn't that unusual.
        
           | tych0 wrote:
           | Author here (hi Sargun), it's not really about rediscovering
           | killable vs. unkillable waits, and any confusion is probably
           | a result of my poor writing.
           | 
           | The crux of it is that once you've called exit_signals() from
           | do_exit(), signals will not get delivered. So if you
           | subsequently use the kernel's completions or other wait code,
           | you will not get the signal from zap_pid_ns_processes(), so
           | you don't know to wake up and exit.
           | 
           | There's a test case here if people want to play around:
           | https://github.com/tych0/kernel-utils/tree/master/fuse2
        
             | loeg wrote:
             | Should processes not be able to wait after exit_signals?
             | That seems like a plausible invariant.
        
               | tych0 wrote:
               | I think they definitely should not. I've considered
               | sending a patch that adds a WARN() or some syzkaller test
               | for it or something, especially now that I've seen it in
               | other filesystems.
        
               | loeg wrote:
               | Makes sense to me.
        
               | avianlyric wrote:
               | I think that's the point. Currently doing that will
               | potentially result in a deadlock.
        
               | loeg wrote:
               | Well, only if the wait is for userspace or a remote
               | resource, right? Regular disks are sometimes considered
               | infallible (or at least, the IO will timeout eventually
               | in the generic SCSI logic) and might be ok to wait on.
               | 
               | To generalize a bit, I think the problem is doing any
               | sort of interruptible wait -- because we can no longer be
               | interrupted. Uninterruptible waits aren't any different
               | without signal delivery. I might be oversimplifying,
               | though.
        
             | mjevans wrote:
             | It sounds like exit_signals() is being called too early,
             | and based on the test case linked this might be a library
             | issue rather than a code or kernel issue?
             | 
             | Edit: Reading the article it's more clear this happens in
             | kernel's:                 do_exit() {         ...
             | exit_signals(tsk); /* sets PF_EXITING */         ...
             | exit_files(tsk);
             | 
             | Would a better solution not be to exit_signals(tsk); later
             | in do_exit() after all possible signal sources are
             | exhausted?
        
               | loeg wrote:
               | > It sounds like exit_signals() is being called too early
               | 
               | Or zap_pid_ns too late, yeah.
        
             | sargun wrote:
             | Hi Tycho!
             | 
             | I'm glad you inherited this :).
             | 
             | Oh, I wasn't suggesting that it was about killable vs.
             | unkillable.
             | 
             | Couple of things: 1. Should prepare_to_wait_event check if
             | the task is in PF_EXITING, and if so, refuse to wait unless
             | a specific flag is provided? I'd be curious if you just add
             | a kprobe to prepare_to_wait_event that checks for
             | PF_EXITING, how many cases are valid?
             | 
             | 2. Following this:                 zap_pid_ns_processes ->
             | __fatal_signal_pending(task)          group_send_sig_info
             | do_send_sig_info              send_signal_locked
             | __send_signal_locked -> (jump to out_set)
             | sigaddset // It has the pending signal here
             | ....                  complete_signal
             | 
             | Shouldn't it wake up, even if in its in PF_EXITING, that
             | would trigger as reassessment of the condition, and then
             | the `__fatal_signal_pending` check would make it return
             | -ERESTARTSYS.
             | 
             | One note, in the post:                 # grep Pnd
             | /proc/1544574/status       SigPnd: 0000000000000000
             | ShdPnd: 0000000000000100
             | 
             | > Viewing process status this way, you can see 0x100 (i.e.
             | the 9th bit is set) under SigPnd, which is the signal
             | number corresponding to SIGKILL.
             | 
             | Shouldn't it be "ShdPnd"?
        
               | tych0 wrote:
               | > Couple of things: 1. Should prepare_to_wait_event check
               | if the task is in PF_EXITING, and if so, refuse to wait
               | unless a specific flag is provided? I'd be curious if you
               | just add a kprobe to prepare_to_wait_event that checks
               | for PF_EXITING, how many cases are valid?
               | 
               | I would argue they're all invalid if PF_EXITING is
               | present. Maybe I should send a patch to WARN() and see
               | how much I get yelled at.
               | 
               | > Shouldn't it wake up, even if in its in PF_EXITING,
               | that would trigger as reassessment of the condition, and
               | then the `__fatal_signal_pending` check would make it
               | return -ERESTARTSYS.
               | 
               | No, because the signal doesn't get delivered by
               | complete_signal(). wants_signal() returns false if
               | PF_EXITING is set. (Another maybe-interesting thing would
               | be to just delete that check.) Or am I misunderstanding
               | you?
               | 
               | > Shouldn't it be "ShdPnd"
               | 
               | derp, fixed, thanks.
        
               | tych0 wrote:
               | > Or am I misunderstanding you?
               | 
               | Oh, I see, you're suggesting exactly,
               | 
               | > (Another maybe-interesting thing would be to just
               | delete that check.)
               | 
               | I agree.
        
       | ndesaulniers wrote:
       | I thought Netflix used a BSD?
       | https://papers.freebsd.org/2019/fosdem/looney-netflix_and_fr...
        
         | prpl wrote:
         | I thought <bigcorp> used <xyz> is always a funny
         | question/assertion.
         | 
         | Bigcorps are large and diverse. In this case - this seems to be
         | user/desktop facing as it's a fuse module for studio assets.
        
         | eatonphil wrote:
         | They use FreeBSD for the CDN but I think their application
         | servers use Linux.
         | 
         | https://netflixtechblog.com/linux-performance-analysis-in-60...
        
         | loeg wrote:
         | Netflix uses FreeBSD for their dataplane and AWS Linux for
         | control plane, at least as of ~2020.
        
       ___________________________________________________________________
       (page generated 2023-05-19 23:00 UTC)