[HN Gopher] The case of the UI thread that hung in a kernel call
___________________________________________________________________
The case of the UI thread that hung in a kernel call
Author : luu
Score : 136 points
Date : 2025-04-15 17:13 UTC (23 hours ago)
(HTM) web link (devblogs.microsoft.com)
(TXT) w3m dump (devblogs.microsoft.com)
| simscitizen wrote:
| Oh I've debugged this before. Native memory allocator had a
| scavenge function which suspended all other threads. Managed
| language runtime had a stop the world phase which suspended all
| mutator threads. They ran at about the same time and ended up
| suspending each other. To fix this you need to enforce some sort
| of hierarchy or mutual exclusion for suspension requests.
|
| > Why you should never suspend a thread in your own process.
|
| This sounds like a good general princple but suspending threads
| in your own process is kind of necessary for e.g. many GC
| algorithms. Now imagine multiple of those runtimes running in the
| same process.
| hyperpape wrote:
| > suspending threads in your own process is kind of necessary
| for e.g. many GC algorithms
|
| I think this is typically done by having the compiler/runtime
| insert safepoints, which cooperatively yield at specified
| points to allow the GC to run without mutator threads being
| active. Done correctly, this shouldn't be subject to the
| problem the original post highlighted, because it doesn't rely
| on the OS's ability to suspend threads when they aren't
| expecting it.
| achierius wrote:
| This is a good approach but can be tricky. E.g. what if your
| thread spends a lot of time in a tight loop, e.g. doing a big
| inlined matmul kernel? Since you never hit a function call
| you don't get safepoints that way -- you can add them to the
| back-edge of every loop, but that can be a bit unappetizing
| from a performance perspective.
| chipsa wrote:
| If you don't create any GC-able objects in the loop, why
| would you need to call the GC? And if you are, that should
| involve a function call.
|
| And if you do need to call the GC, you could manually
| insert function calls every x loop iterations.
| MarkSweep wrote:
| > suspending threads in your own process is kind of necessary
| for e.g. many GC algorithms
|
| True. Maybe the more precise rule is "only suspend threads for
| a short amount of time and don't acquire any locks while doing
| it"?
|
| The way the .NET runtime follows this rule is it only suspends
| threads for a very short time. After suspending, the thread is
| immediately resumed if it not running managed code (in a random
| native library or syscall). If the thread is running managed
| code, the thread is hijacked by replacing either the
| instruction pointer or the return address with a the address of
| a function that will wait for the GC to finish. The thread is
| then immediately resumed. See the details here:
|
| https://github.com/dotnet/runtime/blob/main/docs/design/core...
|
| > Now imagine multiple of those runtimes running in the same
| process.
|
| Can that possibly reliably work? Sounds messy.
| ot wrote:
| On Linux you'd do this by sending a signal to the thread you want
| to analyze, and then the signal handler would take the stack
| trace and send it back to the watchdog.
|
| The tricky part is ensuring that the signal handler code is
| async-signal-safe (which pretty much boils down to "ensure you're
| not acquiring any locks and be careful about reentrant code"),
| but at least that only has to be verified for a self-contained
| small function.
|
| Is there anything similar to signals on Windows?
| dblohm7 wrote:
| The closest thing is a special APC enqueued via QueueUserAPC2
| [1], but that's relatively new functionality in user-mode.
|
| [1] https://learn.microsoft.com/en-
| us/windows/win32/api/processt...
| jvert wrote:
| Or SetThreadContext() if you want to be hardcore. (not
| recommended)
| manwe150 wrote:
| Why not recommended? As far as things close to signals go,
| this is how you implement signals in user land on Windows
| (along with pause/resume thread). You can even take locks
| later during the process, as long as you also took them
| before sending the signal (same exact restrictions as fork
| actually, but unfortunately atfork hooks are not accessible
| and often full of fork-unsafe data race and deadlock
| implementation bugs themselves in my experience with all
| the popular libc)
| dwattttt wrote:
| The 2 implies an older API, its predecessor QueueUserAPC has
| been around since the XP days.
|
| The older API is less like signals and more like cooperative
| scheduling in that it waits for the target thread to be in an
| "alertable" state before it runs (the thread executes a sleep
| or a wait for something)
| zavec wrote:
| I knew from seeing a title like that on microsoft.com that it was
| going to be a Raymond Chen post! He writes fascinating stuff.
| eyelidlessness wrote:
| I thought the same thing. It's usually content that's well
| outside my areas of familiarity, often even outside my areas of
| interest. But I usually find _his writing_ interesting enough
| to read through anyway, and clear enough that I can usually
| follow it even without familiarity with the subject matter.
| billforsternz wrote:
| I had the same thought too. I wonder if this his role at
| Microsoft now? Kind of a human institutional knowledge
| repository, plus a kind of brand ambassador to the developer
| community, plus mentor to younger engineers, plus chronicler.
|
| I hope he keeps going, no doubt he could choose to finish up
| whenever he wants to.
| ryao wrote:
| I had the same thought. I imagine the percentage of hacker news
| links to microsoft.com that are Raymond Chen links is high.
| pitterpatter wrote:
| Reminds me of a hang in the Settings UI that was because it would
| get stuck on an RPC call to some service.
|
| Why was the service holding things up? Because it was waiting on
| acquiring a lock held by one of its other threads.
|
| What was that other thread doing? It was deadlocked because it
| tried to recursively acquire an exclusive srwlock (exactly what
| the docs say will happen if you try).
|
| Why was it even trying to reacquire said lock? Ultimately because
| of a buffer overrun that ended up overwriting some important
| structures.
| rat87 wrote:
| Reminds me of a bug that would bluescreen windows if I stopped
| Visual Studio debugging if it was in the middle of calling the
| native Ping from C#
| bob1029 wrote:
| I've been able to get managed code to BSOD my machine by simply
| having a lot of thread instances that are aggressively
| communicating with each other (i.e., via Channel<T>). It's
| probably more of a hardware thing than a software thing. My
| Spotify fails to keep the audio buffer filled when I've got it
| fully saturated. I feel like the kernel occasionally panics
| when something doesn't resolve fast enough with regard to
| threads across core complexes.
| brcmthrowaway wrote:
| Can this happen with Grand Central Dispatch ?
| immibis wrote:
| did... did you understand what the bug was?
| saagarjha wrote:
| This is a complicated question. If you "suspend" a GCD queue
| using the traditional APIs then it will happen between block
| execution, which is unlikely to cause problems, because people
| do not typically take locks between different items. But if you
| suspend the thread that backs the queue (using thread_suspend)
| you will definitely run into problems unless you're really
| careful.
| markus_zhang wrote:
| Although I understand nothing from these posts, read Raymond's
| posts somehow always "tranquil" my inner struggles.
|
| Just curious, is this customer a game studio? I have never done
| any serious system programming but the gist feels like one.
| ajkjk wrote:
| I would guess it's something corporate. They can afford to
| pause the UI and ship debugging traces home more than a real-
| time game might.
| delusional wrote:
| Id actually expect a customer facing program more. Corporate
| software wouldn't care that the UI hung, you're getting paid
| to sit there and look at it.
| tedunangst wrote:
| The banker trying to close a deal isn't paid by the hour.
| immibis wrote:
| Unless the user's boss complained to the programmer's boss
| skissane wrote:
| > Corporate software wouldn't care that the UI hung, you're
| getting paid to sit there and look at it.
|
| The article says the thread had been hung for 5 hours. And
| if you understand the root cause, once it entered into the
| hung state, then absent some rather dramatic intervention
| (e.g. manually resuming the suspended UI thread), it would
| remain hung indefinitely.
|
| The proper solution, as Raymond Chen notes, is to move the
| monitoring thread into a separate process, that would avoid
| this deadlock.
| saagarjha wrote:
| Suspending threads is generally not that expensive,
| especially if you don't do it very often. Like, it's not
| free, and don't do it every frame, but even if it takes even
| a millisecond (wildly overestimated) that's fine if you don't
| do it very often. Even if you're hitting a 120 Hz deadline.
| boxed wrote:
| I had a support issue once at a well known and big US defense
| firm. We got kernel hangs consistently in kernel space from
| normal user-level code. Crazy shit. I opened a support issue
| which eventually got closed because we used an old compiler. Fun
| times.
| saagarjha wrote:
| An old compiler that was...miscompiling the kernel? It's hard
| to imagine any other situation that would be a valid reason to
| close the bug.
| makz wrote:
| Looking at the title, at first I thought "uh?", but then I saw
| microsoft and it made sense.
| frabona wrote:
| Such a clean breakdown. "Don't suspend your own threads" should
| be tattooed on every Windows dev's arm at this point
| baruchthescribe wrote:
| >Naturally, a suspended UI thread is going to manifest itself as
| a hang.
|
| The correct terminology is 'stopped responding' Raymond. You need
| to consult the style guide.
| ryao wrote:
| Who are these customers that get developer support from Microsoft
| engineering teams?
| zoogeny wrote:
| I worked on a team that did. We had a monthly call with a MS
| rep and access to devs working on the platform features we were
| working on (for MS Teams specifically). It is probably more
| common than you think.
| tgv wrote:
| I worked for a small shop that provided something MS
| couldn't/wouldn't, but which was essential for their
| international business anyway. So we too had engineering
| support.
| qingcharles wrote:
| It's expensive. Really expensive. I remember a major bank
| calling me and my buddy's 2-man consultancy team and telling me
| they had spent a small fortune on whatever the top-level access
| to MS developers is, to get some outdated MS COM component to
| interface with .NET, and MS had failed.
|
| (We charged ~$20K and estimated two weeks. We had it working in
| two hours.)
| Robin_Message wrote:
| I gotta ask, did you spend a week sucking your teeth after
| that, or did you hand it to them and say "hey, you're paying
| for expertise _and_ we got it to you faster than we estimated
| "?
| orthoxerox wrote:
| The correct way is the send the customer the almost-final
| version and wait for the bug report. This way you show how
| quickly you can tackle the problem but don't make the task
| look too easy.
| mikaraento wrote:
| I remember being able to file support cases just by buying one
| for a couple of hundred dollars. They'd also promise that if it
| turned out to be a bug in the product the fee would be
| refunded.
|
| (My case wasn't solved. It was something about variable delays
| in getting packets off the network and into userspace but we
| never got to the bottom of it).
| saagarjha wrote:
| > If you want to suspend a thread and capture stacks from it,
| you'll have to do it from another process, so that you don't
| deadlock with the thread you suspended.
|
| Unfortunately sometimes you don't have the luxury of being able
| to do this (e.g. on iOS, especially pre-MetricKit). We shipped
| one such implementation in the Twitter app (which was still there
| last I checked) and as far as I can tell it's safe but mostly by
| accident-I didn't want to to pause things for very long, so the
| code just suspends the thread, grabs register state, then writes
| the backtrace to a stack buffer before resuming. I originally
| wanted to grab traces without suspending the process, which is
| something you can actually "do" because getting register state
| doesn't require suspension and you need to put guards on your
| frame decoding anyway ("is this address I am about to dereference
| actually in the stack?"). But unfortunately after thinking about
| it I added the suspension back because trying to collect a trace
| from a running thread could give you a fragmented backtrace as it
| modifies it out from under you.
| Permik wrote:
| I have the weirdest hunch that the customer in question was Valve
| :D
___________________________________________________________________
(page generated 2025-04-16 17:03 UTC)