[HN Gopher] The case of the UI thread that hung in a kernel call
       ___________________________________________________________________
        
       The case of the UI thread that hung in a kernel call
        
       Author : luu
       Score  : 81 points
       Date   : 2025-04-15 17:13 UTC (5 hours ago)
        
 (HTM) web link (devblogs.microsoft.com)
 (TXT) w3m dump (devblogs.microsoft.com)
        
       | simscitizen wrote:
       | Oh I've debugged this before. Native memory allocator had a
       | scavenge function which suspended all other threads. Managed
       | language runtime had a stop the world phase which suspended all
       | mutator threads. They ran at about the same time and ended up
       | suspending each other. To fix this you need to enforce some sort
       | of hierarchy or mutual exclusion for suspension requests.
       | 
       | > Why you should never suspend a thread in your own process.
       | 
       | This sounds like a good general princple but suspending threads
       | in your own process is kind of necessary for e.g. many GC
       | algorithms. Now imagine multiple of those runtimes running in the
       | same process.
        
         | hyperpape wrote:
         | > suspending threads in your own process is kind of necessary
         | for e.g. many GC algorithms
         | 
         | I think this is typically done by having the compiler/runtime
         | insert safepoints, which cooperatively yield at specified
         | points to allow the GC to run without mutator threads being
         | active. Done correctly, this shouldn't be subject to the
         | problem the original post highlighted, because it doesn't rely
         | on the OS's ability to suspend threads when they aren't
         | expecting it.
        
       | ot wrote:
       | On Linux you'd do this by sending a signal to the thread you want
       | to analyze, and then the signal handler would take the stack
       | trace and send it back to the watchdog.
       | 
       | The tricky part is ensuring that the signal handler code is
       | async-signal-safe (which pretty much boils down to "ensure you're
       | not acquiring any locks and be careful about reentrant code"),
       | but at least that only has to be verified for a self-contained
       | small function.
       | 
       | Is there anything similar to signals on Windows?
        
         | dblohm7 wrote:
         | The closest thing is a special APC enqueued via QueueUserAPC2
         | [1], but that's relatively new functionality in user-mode.
         | 
         | [1] https://learn.microsoft.com/en-
         | us/windows/win32/api/processt...
        
           | jvert wrote:
           | Or SetThreadContext() if you want to be hardcore. (not
           | recommended)
        
       | zavec wrote:
       | I knew from seeing a title like that on microsoft.com that it was
       | going to be a Raymond Chen post! He writes fascinating stuff.
        
         | eyelidlessness wrote:
         | I thought the same thing. It's usually content that's well
         | outside my areas of familiarity, often even outside my areas of
         | interest. But I usually find _his writing_ interesting enough
         | to read through anyway, and clear enough that I can usually
         | follow it even without familiarity with the subject matter.
        
       | pitterpatter wrote:
       | Reminds me of a hang in the Settings UI that was because it would
       | get stuck on an RPC call to some service.
       | 
       | Why was the service holding things up? Because it was waiting on
       | acquiring a lock held by one of its other threads.
       | 
       | What was that other thread doing? It was deadlocked because it
       | tried to recursively acquire an exclusive srwlock (exactly what
       | the docs say will happen if you try).
       | 
       | Why was it even trying to reacquire said lock? Ultimately because
       | of a buffer overrun that ended up overwriting some important
       | structures.
        
       | rat87 wrote:
       | Reminds me of a bug that would bluescreen windows if I stopped
       | Visual Studio debugging if it was in the middle of calling the
       | native Ping from C#
        
         | bob1029 wrote:
         | I've been able to get managed code to BSOD my machine by simply
         | having a lot of thread instances that are aggressively
         | communicating with each other (i.e., via Channel<T>). It's
         | probably more of a hardware thing than a software thing. My
         | Spotify fails to keep the audio buffer filled when I've got it
         | fully saturated. I feel like the kernel occasionally panics
         | when something doesn't resolve fast enough with regard to
         | threads across core complexes.
        
       | brcmthrowaway wrote:
       | Can this happen with Grand Central Dispatch ?
        
         | immibis wrote:
         | did... did you understand what the bug was?
        
       | markus_zhang wrote:
       | Although I understand nothing from these posts, read Raymond's
       | posts somehow always "tranquil" my inner struggles.
       | 
       | Just curious, is this customer a game studio? I have never done
       | any serious system programming but the gist feels like one.
        
         | ajkjk wrote:
         | I would guess it's something corporate. They can afford to
         | pause the UI and ship debugging traces home more than a real-
         | time game might.
        
           | delusional wrote:
           | Id actually expect a customer facing program more. Corporate
           | software wouldn't care that the UI hung, you're getting paid
           | to sit there and look at it.
        
             | tedunangst wrote:
             | The banker trying to close a deal isn't paid by the hour.
        
             | immibis wrote:
             | Unless the user's boss complained to the programmer's boss
        
       | boxed wrote:
       | I had a support issue once at a well known and big US defense
       | firm. We got kernel hangs consistently in kernel space from
       | normal user-level code. Crazy shit. I opened a support issue
       | which eventually got closed because we used an old compiler. Fun
       | times.
        
       | makz wrote:
       | Looking at the title, at first I thought "uh?", but then I saw
       | microsoft and it made sense.
        
       | frabona wrote:
       | Such a clean breakdown. "Don't suspend your own threads" should
       | be tattooed on every Windows dev's arm at this point
        
       ___________________________________________________________________
       (page generated 2025-04-15 23:00 UTC)