[HN Gopher] The case of the UI thread that hung in a kernel call
       ___________________________________________________________________
        
       The case of the UI thread that hung in a kernel call
        
       Author : luu
       Score  : 136 points
       Date   : 2025-04-15 17:13 UTC (23 hours ago)
        
 (HTM) web link (devblogs.microsoft.com)
 (TXT) w3m dump (devblogs.microsoft.com)
        
       | simscitizen wrote:
       | Oh I've debugged this before. Native memory allocator had a
       | scavenge function which suspended all other threads. Managed
       | language runtime had a stop the world phase which suspended all
       | mutator threads. They ran at about the same time and ended up
       | suspending each other. To fix this you need to enforce some sort
       | of hierarchy or mutual exclusion for suspension requests.
       | 
       | > Why you should never suspend a thread in your own process.
       | 
       | This sounds like a good general princple but suspending threads
       | in your own process is kind of necessary for e.g. many GC
       | algorithms. Now imagine multiple of those runtimes running in the
       | same process.
        
         | hyperpape wrote:
         | > suspending threads in your own process is kind of necessary
         | for e.g. many GC algorithms
         | 
         | I think this is typically done by having the compiler/runtime
         | insert safepoints, which cooperatively yield at specified
         | points to allow the GC to run without mutator threads being
         | active. Done correctly, this shouldn't be subject to the
         | problem the original post highlighted, because it doesn't rely
         | on the OS's ability to suspend threads when they aren't
         | expecting it.
        
           | achierius wrote:
           | This is a good approach but can be tricky. E.g. what if your
           | thread spends a lot of time in a tight loop, e.g. doing a big
           | inlined matmul kernel? Since you never hit a function call
           | you don't get safepoints that way -- you can add them to the
           | back-edge of every loop, but that can be a bit unappetizing
           | from a performance perspective.
        
             | chipsa wrote:
             | If you don't create any GC-able objects in the loop, why
             | would you need to call the GC? And if you are, that should
             | involve a function call.
             | 
             | And if you do need to call the GC, you could manually
             | insert function calls every x loop iterations.
        
         | MarkSweep wrote:
         | > suspending threads in your own process is kind of necessary
         | for e.g. many GC algorithms
         | 
         | True. Maybe the more precise rule is "only suspend threads for
         | a short amount of time and don't acquire any locks while doing
         | it"?
         | 
         | The way the .NET runtime follows this rule is it only suspends
         | threads for a very short time. After suspending, the thread is
         | immediately resumed if it not running managed code (in a random
         | native library or syscall). If the thread is running managed
         | code, the thread is hijacked by replacing either the
         | instruction pointer or the return address with a the address of
         | a function that will wait for the GC to finish. The thread is
         | then immediately resumed. See the details here:
         | 
         | https://github.com/dotnet/runtime/blob/main/docs/design/core...
         | 
         | > Now imagine multiple of those runtimes running in the same
         | process.
         | 
         | Can that possibly reliably work? Sounds messy.
        
       | ot wrote:
       | On Linux you'd do this by sending a signal to the thread you want
       | to analyze, and then the signal handler would take the stack
       | trace and send it back to the watchdog.
       | 
       | The tricky part is ensuring that the signal handler code is
       | async-signal-safe (which pretty much boils down to "ensure you're
       | not acquiring any locks and be careful about reentrant code"),
       | but at least that only has to be verified for a self-contained
       | small function.
       | 
       | Is there anything similar to signals on Windows?
        
         | dblohm7 wrote:
         | The closest thing is a special APC enqueued via QueueUserAPC2
         | [1], but that's relatively new functionality in user-mode.
         | 
         | [1] https://learn.microsoft.com/en-
         | us/windows/win32/api/processt...
        
           | jvert wrote:
           | Or SetThreadContext() if you want to be hardcore. (not
           | recommended)
        
             | manwe150 wrote:
             | Why not recommended? As far as things close to signals go,
             | this is how you implement signals in user land on Windows
             | (along with pause/resume thread). You can even take locks
             | later during the process, as long as you also took them
             | before sending the signal (same exact restrictions as fork
             | actually, but unfortunately atfork hooks are not accessible
             | and often full of fork-unsafe data race and deadlock
             | implementation bugs themselves in my experience with all
             | the popular libc)
        
           | dwattttt wrote:
           | The 2 implies an older API, its predecessor QueueUserAPC has
           | been around since the XP days.
           | 
           | The older API is less like signals and more like cooperative
           | scheduling in that it waits for the target thread to be in an
           | "alertable" state before it runs (the thread executes a sleep
           | or a wait for something)
        
       | zavec wrote:
       | I knew from seeing a title like that on microsoft.com that it was
       | going to be a Raymond Chen post! He writes fascinating stuff.
        
         | eyelidlessness wrote:
         | I thought the same thing. It's usually content that's well
         | outside my areas of familiarity, often even outside my areas of
         | interest. But I usually find _his writing_ interesting enough
         | to read through anyway, and clear enough that I can usually
         | follow it even without familiarity with the subject matter.
        
         | billforsternz wrote:
         | I had the same thought too. I wonder if this his role at
         | Microsoft now? Kind of a human institutional knowledge
         | repository, plus a kind of brand ambassador to the developer
         | community, plus mentor to younger engineers, plus chronicler.
         | 
         | I hope he keeps going, no doubt he could choose to finish up
         | whenever he wants to.
        
         | ryao wrote:
         | I had the same thought. I imagine the percentage of hacker news
         | links to microsoft.com that are Raymond Chen links is high.
        
       | pitterpatter wrote:
       | Reminds me of a hang in the Settings UI that was because it would
       | get stuck on an RPC call to some service.
       | 
       | Why was the service holding things up? Because it was waiting on
       | acquiring a lock held by one of its other threads.
       | 
       | What was that other thread doing? It was deadlocked because it
       | tried to recursively acquire an exclusive srwlock (exactly what
       | the docs say will happen if you try).
       | 
       | Why was it even trying to reacquire said lock? Ultimately because
       | of a buffer overrun that ended up overwriting some important
       | structures.
        
       | rat87 wrote:
       | Reminds me of a bug that would bluescreen windows if I stopped
       | Visual Studio debugging if it was in the middle of calling the
       | native Ping from C#
        
         | bob1029 wrote:
         | I've been able to get managed code to BSOD my machine by simply
         | having a lot of thread instances that are aggressively
         | communicating with each other (i.e., via Channel<T>). It's
         | probably more of a hardware thing than a software thing. My
         | Spotify fails to keep the audio buffer filled when I've got it
         | fully saturated. I feel like the kernel occasionally panics
         | when something doesn't resolve fast enough with regard to
         | threads across core complexes.
        
       | brcmthrowaway wrote:
       | Can this happen with Grand Central Dispatch ?
        
         | immibis wrote:
         | did... did you understand what the bug was?
        
         | saagarjha wrote:
         | This is a complicated question. If you "suspend" a GCD queue
         | using the traditional APIs then it will happen between block
         | execution, which is unlikely to cause problems, because people
         | do not typically take locks between different items. But if you
         | suspend the thread that backs the queue (using thread_suspend)
         | you will definitely run into problems unless you're really
         | careful.
        
       | markus_zhang wrote:
       | Although I understand nothing from these posts, read Raymond's
       | posts somehow always "tranquil" my inner struggles.
       | 
       | Just curious, is this customer a game studio? I have never done
       | any serious system programming but the gist feels like one.
        
         | ajkjk wrote:
         | I would guess it's something corporate. They can afford to
         | pause the UI and ship debugging traces home more than a real-
         | time game might.
        
           | delusional wrote:
           | Id actually expect a customer facing program more. Corporate
           | software wouldn't care that the UI hung, you're getting paid
           | to sit there and look at it.
        
             | tedunangst wrote:
             | The banker trying to close a deal isn't paid by the hour.
        
             | immibis wrote:
             | Unless the user's boss complained to the programmer's boss
        
             | skissane wrote:
             | > Corporate software wouldn't care that the UI hung, you're
             | getting paid to sit there and look at it.
             | 
             | The article says the thread had been hung for 5 hours. And
             | if you understand the root cause, once it entered into the
             | hung state, then absent some rather dramatic intervention
             | (e.g. manually resuming the suspended UI thread), it would
             | remain hung indefinitely.
             | 
             | The proper solution, as Raymond Chen notes, is to move the
             | monitoring thread into a separate process, that would avoid
             | this deadlock.
        
           | saagarjha wrote:
           | Suspending threads is generally not that expensive,
           | especially if you don't do it very often. Like, it's not
           | free, and don't do it every frame, but even if it takes even
           | a millisecond (wildly overestimated) that's fine if you don't
           | do it very often. Even if you're hitting a 120 Hz deadline.
        
       | boxed wrote:
       | I had a support issue once at a well known and big US defense
       | firm. We got kernel hangs consistently in kernel space from
       | normal user-level code. Crazy shit. I opened a support issue
       | which eventually got closed because we used an old compiler. Fun
       | times.
        
         | saagarjha wrote:
         | An old compiler that was...miscompiling the kernel? It's hard
         | to imagine any other situation that would be a valid reason to
         | close the bug.
        
       | makz wrote:
       | Looking at the title, at first I thought "uh?", but then I saw
       | microsoft and it made sense.
        
       | frabona wrote:
       | Such a clean breakdown. "Don't suspend your own threads" should
       | be tattooed on every Windows dev's arm at this point
        
       | baruchthescribe wrote:
       | >Naturally, a suspended UI thread is going to manifest itself as
       | a hang.
       | 
       | The correct terminology is 'stopped responding' Raymond. You need
       | to consult the style guide.
        
       | ryao wrote:
       | Who are these customers that get developer support from Microsoft
       | engineering teams?
        
         | zoogeny wrote:
         | I worked on a team that did. We had a monthly call with a MS
         | rep and access to devs working on the platform features we were
         | working on (for MS Teams specifically). It is probably more
         | common than you think.
        
         | tgv wrote:
         | I worked for a small shop that provided something MS
         | couldn't/wouldn't, but which was essential for their
         | international business anyway. So we too had engineering
         | support.
        
         | qingcharles wrote:
         | It's expensive. Really expensive. I remember a major bank
         | calling me and my buddy's 2-man consultancy team and telling me
         | they had spent a small fortune on whatever the top-level access
         | to MS developers is, to get some outdated MS COM component to
         | interface with .NET, and MS had failed.
         | 
         | (We charged ~$20K and estimated two weeks. We had it working in
         | two hours.)
        
           | Robin_Message wrote:
           | I gotta ask, did you spend a week sucking your teeth after
           | that, or did you hand it to them and say "hey, you're paying
           | for expertise _and_ we got it to you faster than we estimated
           | "?
        
             | orthoxerox wrote:
             | The correct way is the send the customer the almost-final
             | version and wait for the bug report. This way you show how
             | quickly you can tackle the problem but don't make the task
             | look too easy.
        
         | mikaraento wrote:
         | I remember being able to file support cases just by buying one
         | for a couple of hundred dollars. They'd also promise that if it
         | turned out to be a bug in the product the fee would be
         | refunded.
         | 
         | (My case wasn't solved. It was something about variable delays
         | in getting packets off the network and into userspace but we
         | never got to the bottom of it).
        
       | saagarjha wrote:
       | > If you want to suspend a thread and capture stacks from it,
       | you'll have to do it from another process, so that you don't
       | deadlock with the thread you suspended.
       | 
       | Unfortunately sometimes you don't have the luxury of being able
       | to do this (e.g. on iOS, especially pre-MetricKit). We shipped
       | one such implementation in the Twitter app (which was still there
       | last I checked) and as far as I can tell it's safe but mostly by
       | accident-I didn't want to to pause things for very long, so the
       | code just suspends the thread, grabs register state, then writes
       | the backtrace to a stack buffer before resuming. I originally
       | wanted to grab traces without suspending the process, which is
       | something you can actually "do" because getting register state
       | doesn't require suspension and you need to put guards on your
       | frame decoding anyway ("is this address I am about to dereference
       | actually in the stack?"). But unfortunately after thinking about
       | it I added the suspension back because trying to collect a trace
       | from a running thread could give you a fragmented backtrace as it
       | modifies it out from under you.
        
       | Permik wrote:
       | I have the weirdest hunch that the customer in question was Valve
       | :D
        
       ___________________________________________________________________
       (page generated 2025-04-16 17:03 UTC)