[HN Gopher] Debugging a FUSE deadlock in the Linux kernel
___________________________________________________________________
Debugging a FUSE deadlock in the Linux kernel
Author : andsoitis
Score : 80 points
Date : 2023-05-19 19:28 UTC (3 hours ago)
(HTM) web link (netflixtechblog.com)
(TXT) w3m dump (netflixtechblog.com)
| nyanpasu64 wrote:
| I'm reminded of Bedrock Linux hanging during sleep because Linux
| sleeps the FUSE daemon while a process is waiting on FUSE and
| unable to be suspended for sleep:
| https://news.ycombinator.com/item?id=34583495
| kayson wrote:
| TIL about `ps awwfux`. Great way to remember it
| chatmasta wrote:
| See explainshell [0], a great resource, for the full
| explanation of these flags.
|
| (Another one of my favorites is cat -vet)
|
| [0] https://explainshell.com/explain?cmd=ps+awwfux
| loeg wrote:
| This article is mostly about rediscovering the distinction
| between interruptible and non-interruptible waits.
|
| At the end there is some sort of deadlock between namespace
| teardown, signal delivery, and FUSE, but... it isn't articulated
| in a way that is super comprehensible to me. The kernel flushes
| open files on exit and also kills things in the namespace on
| exit. But that means the race condition was always hitable if you
| killed the FUSE daemon at the wrong time relative to the FUSE
| client shutdown? It's not totally obvious to me why this would
| impact other non-FUSE filesystems.
|
| Signal delivery and multithreaded process teardown in the kernel
| is certainly tricky, and it's really easy to get these weird edge
| cases wrong.
| sargun wrote:
| I don't understand why wants_signal returns false on PF_EXITING
| even if the signal is SIGKILL (and from the kernel). Shouldn't
| wake up the process still, so it can get out of the flush?
|
| I am curious, if you were just to walk over every PID in the
| pid namespace after sending zap_pid_ns_processes, and perform
| wake ups, would it break out of the `wait_event` loop?
|
| Btw, this class of weirdness with FUSE isn't that unusual.
| tych0 wrote:
| Author here (hi Sargun), it's not really about rediscovering
| killable vs. unkillable waits, and any confusion is probably
| a result of my poor writing.
|
| The crux of it is that once you've called exit_signals() from
| do_exit(), signals will not get delivered. So if you
| subsequently use the kernel's completions or other wait code,
| you will not get the signal from zap_pid_ns_processes(), so
| you don't know to wake up and exit.
|
| There's a test case here if people want to play around:
| https://github.com/tych0/kernel-utils/tree/master/fuse2
| loeg wrote:
| Should processes not be able to wait after exit_signals?
| That seems like a plausible invariant.
| tych0 wrote:
| I think they definitely should not. I've considered
| sending a patch that adds a WARN() or some syzkaller test
| for it or something, especially now that I've seen it in
| other filesystems.
| loeg wrote:
| Makes sense to me.
| avianlyric wrote:
| I think that's the point. Currently doing that will
| potentially result in a deadlock.
| loeg wrote:
| Well, only if the wait is for userspace or a remote
| resource, right? Regular disks are sometimes considered
| infallible (or at least, the IO will timeout eventually
| in the generic SCSI logic) and might be ok to wait on.
|
| To generalize a bit, I think the problem is doing any
| sort of interruptible wait -- because we can no longer be
| interrupted. Uninterruptible waits aren't any different
| without signal delivery. I might be oversimplifying,
| though.
| mjevans wrote:
| It sounds like exit_signals() is being called too early,
| and based on the test case linked this might be a library
| issue rather than a code or kernel issue?
|
| Edit: Reading the article it's more clear this happens in
| kernel's: do_exit() { ...
| exit_signals(tsk); /* sets PF_EXITING */ ...
| exit_files(tsk);
|
| Would a better solution not be to exit_signals(tsk); later
| in do_exit() after all possible signal sources are
| exhausted?
| loeg wrote:
| > It sounds like exit_signals() is being called too early
|
| Or zap_pid_ns too late, yeah.
| sargun wrote:
| Hi Tycho!
|
| I'm glad you inherited this :).
|
| Oh, I wasn't suggesting that it was about killable vs.
| unkillable.
|
| Couple of things: 1. Should prepare_to_wait_event check if
| the task is in PF_EXITING, and if so, refuse to wait unless
| a specific flag is provided? I'd be curious if you just add
| a kprobe to prepare_to_wait_event that checks for
| PF_EXITING, how many cases are valid?
|
| 2. Following this: zap_pid_ns_processes ->
| __fatal_signal_pending(task) group_send_sig_info
| do_send_sig_info send_signal_locked
| __send_signal_locked -> (jump to out_set)
| sigaddset // It has the pending signal here
| .... complete_signal
|
| Shouldn't it wake up, even if in its in PF_EXITING, that
| would trigger as reassessment of the condition, and then
| the `__fatal_signal_pending` check would make it return
| -ERESTARTSYS.
|
| One note, in the post: # grep Pnd
| /proc/1544574/status SigPnd: 0000000000000000
| ShdPnd: 0000000000000100
|
| > Viewing process status this way, you can see 0x100 (i.e.
| the 9th bit is set) under SigPnd, which is the signal
| number corresponding to SIGKILL.
|
| Shouldn't it be "ShdPnd"?
| tych0 wrote:
| > Couple of things: 1. Should prepare_to_wait_event check
| if the task is in PF_EXITING, and if so, refuse to wait
| unless a specific flag is provided? I'd be curious if you
| just add a kprobe to prepare_to_wait_event that checks
| for PF_EXITING, how many cases are valid?
|
| I would argue they're all invalid if PF_EXITING is
| present. Maybe I should send a patch to WARN() and see
| how much I get yelled at.
|
| > Shouldn't it wake up, even if in its in PF_EXITING,
| that would trigger as reassessment of the condition, and
| then the `__fatal_signal_pending` check would make it
| return -ERESTARTSYS.
|
| No, because the signal doesn't get delivered by
| complete_signal(). wants_signal() returns false if
| PF_EXITING is set. (Another maybe-interesting thing would
| be to just delete that check.) Or am I misunderstanding
| you?
|
| > Shouldn't it be "ShdPnd"
|
| derp, fixed, thanks.
| tych0 wrote:
| > Or am I misunderstanding you?
|
| Oh, I see, you're suggesting exactly,
|
| > (Another maybe-interesting thing would be to just
| delete that check.)
|
| I agree.
| ndesaulniers wrote:
| I thought Netflix used a BSD?
| https://papers.freebsd.org/2019/fosdem/looney-netflix_and_fr...
| prpl wrote:
| I thought <bigcorp> used <xyz> is always a funny
| question/assertion.
|
| Bigcorps are large and diverse. In this case - this seems to be
| user/desktop facing as it's a fuse module for studio assets.
| eatonphil wrote:
| They use FreeBSD for the CDN but I think their application
| servers use Linux.
|
| https://netflixtechblog.com/linux-performance-analysis-in-60...
| loeg wrote:
| Netflix uses FreeBSD for their dataplane and AWS Linux for
| control plane, at least as of ~2020.
___________________________________________________________________
(page generated 2023-05-19 23:00 UTC)