[HN Gopher] Debugging memory corruption: who the hell writes "2"...
       ___________________________________________________________________
        
       Debugging memory corruption: who the hell writes "2" into my stack?
       (2016)
        
       Author : pierremenard
       Score  : 295 points
       Date   : 2024-12-28 18:06 UTC (3 days ago)
        
 (HTM) web link (unity.com)
 (TXT) w3m dump (unity.com)
        
       | saagarjha wrote:
       | Scary. I assume standard memory corruption detection tools would
       | also have trouble finding this, as the write is coming from
       | outside the application itself...
        
         | loeg wrote:
         | Yeah. Not tripping a page fault on modifying readonly
         | (userspace) pages both makes it hard for userspace tools but
         | also paints a pretty specific picture of where the write is
         | coming from.
         | 
         | I'm actually not sure if Linux would handle this in the same
         | way or what. Plausibly it sees the same leaf page tables as
         | user space, trips a fault, and _doesn 't_ scribble the pages
         | anyway. Maybe Windows translates the user-provided virtual
         | address to a physical address (or other kernel mapping that
         | happens to have write permission) upon registration.
        
       | rramadass wrote:
       | A Story for the ages. That is some hardcore debugging involving
       | everything viz. user land, system call, kernel, disassembly etc.
        
       | glandium wrote:
       | I wonder if Time Travel Debugging would have helped narrow it
       | down.
        
         | chris_wot wrote:
         | How? Throwing an exception would prevent this wouldn't it?
        
           | ben-schaaf wrote:
           | When the assertion on the stack sentinel was reached they
           | could have watched the value and then reverse continued,
           | which in theory would reveal the APC causing the issue - or
           | at least the instruction writing the value. Not sure how well
           | reverse debugging works on Windows though, I'm only familiar
           | with rr.
        
         | loeg wrote:
         | I don't think any reverse debugging system can step the kernel
         | backwards to this degree, unless they're doing something really
         | clever (slow) with virtual machines and snapshots.
        
           | dzaima wrote:
           | While not allowing stepping in the kernel, a large part of rr
           | is indeed intercepting all things the kernel may do and re-
           | implementing its actions, writing down all changes to memory
           | & etc it does (of course for Linux, not Windows). With which
           | the kernel doing an asynchronous write would necessarily end
           | up as a part of the recording stating what the kernel writes
           | at the given point in time, which a debugger could
           | deterministically reason about. (of course this relies on the
           | recording system reimplementing the things accurately enough,
           | but that's at least possible)
        
             | Veserv wrote:
             | You are correct. A time travel debugging solution that
             | supports recording the relevant system call side effects
             | would handle this. In fact, this system call is likely just
             | rewriting the program counter register and maybe a few
             | others, so it would likely be very easy to support if you
             | could hook the relevant kernel operations which may or may
             | not be possible in Windows.
             | 
             | The replay system would also be unlikely to pose a problem.
             | Replay systems usually just encode and replay the side
             | effects, so there is no need to "reimplement" the
             | operations. So, if you did some wacky system call, but all
             | it did is write 0x2 to a memory location, M, you
             | effectively just record: "at time T we issued a system call
             | that wrote 0x2 to M". Then, when you get to simulated time
             | T in the replay, you do not reissue the wacky system call,
             | you just write 0x2 to M and call it a day.
        
               | loeg wrote:
               | This system call returned and then asynchronously wrote
               | to memory some time later. How does the replay system
               | even know the write happened, without scanning all
               | memory? It can't generally. With knowledge of the
               | specific call it could put just that address on a to-be-
               | scanned list to wait for completion, but it still needs
               | to periodically poll the memory. It is far more
               | complicated to record than a synchronous syscall.
        
               | Veserv wrote:
               | You hook the kernel write. That is why I said hook the
               | relevant kernel operations.
               | 
               | The primary complexity is actually in creating a
               | consistent timeline with respect to parallel asynchronous
               | writes. Record-Replay systems like rr usually just
               | serialize multithreaded execution during recording to
               | avoid such problems. You could also do so by just
               | serializing the executing thread and the parallel
               | asynchronous write by stopping execution of the thread
               | while the write occurs.
               | 
               | Again, not really sure if that would be possible in
               | Windows, but there is nothing particularly mechanically
               | hard about doing this. It is just a question of whether
               | it matches the abstractions and hooks Windows uses and
               | supports.
        
               | dzaima wrote:
               | I don't think rr hooks actual kernel writes, but rather
               | just has hard-coded information on each syscall of how to
               | compute what regions of memory it may modify, reading
               | those on recording and writing on replay.
               | 
               | As such, for an asynchronous kernel write you'd want to
               | set up the kernel to never mutate recordee memory,
               | instead having it modify recorder-local memory, which the
               | recorder can then copy over to the real process whenever,
               | and get to record when it happens while at it (see
               | https://github.com/rr-debugger/rr/issues/2613). But such
               | can introduce large delays, thereby changing execution
               | characteristics (if not make things appear to happen in a
               | different order than the kernel would, if done
               | improperly). And you still need the recording system to
               | have accurately implemented the forwarding of whatever
               | edge-case of the asynchronous operation you hit.
               | 
               | And, if done as just that, you'd still hit the problem
               | encountered in the article of it looking like unrelated
               | code changes the memory (whereas with synchronous
               | syscalls you'd at least see the mutation happening on a
               | syscall instruction). So you'd want some extra injected
               | recordee instruction(s) to present separation of recordee
               | actions from asynchronous kernel ones. As a sibling
               | comment notes, rr as-is doesn't handle any asynchronous
               | kernel write cases (though it's certainly not entirely
               | impossible to).
        
             | loeg wrote:
             | rr does not record AIO or io_uring operations today,
             | because recording syscalls with async behavior is
             | challenging.
             | 
             | Maybe Windows TTD records async NtDeviceIoControlFile
             | acculately, maybe it doesn't; I don't know.
        
           | mark_undoio wrote:
           | It looks like they didn't actually need to step the kernel in
           | the end - it just helped understand the bug (which I'd say
           | was in user space - injecting an exception into select() and
           | this preventing it exiting normally - even though a kernel
           | behaviour was involved in how the bug manifested).
           | 
           | The time travel debugging available with WinDbg should be
           | able to wind back to the point of corruption - that'd
           | probably have taken a few days off the initial realisation
           | that an async change to the stack was causing the problem.
           | 
           | There'd still be another reasoning step required to
           | understand why that happened - but you would be able to step
           | back in time e.g. to when this buffer was _previously_ used
           | on the stack to see how select () was submitting it to the
           | kernel.
           | 
           | In fact, a data breakpoint / watchpoint could likely have
           | taken you back from the corruption to the previous valid use,
           | which may have been the missing piece.
        
       | lionkor wrote:
       | Very wild bug. I feel like this is some kind of a worst-case
       | "exceptions bad" lesson, but I've only been doing systems level
       | programming for a couple of years so I'm probably talking out my
       | ass.
        
         | usrnm wrote:
         | I don't think that the lesson here is "exceptions are bad", the
         | same kind of bug can be easily made without using exceptions.
        
           | usrnm wrote:
           | Another thing to note is that exactly the same bug can be
           | made in Rust or go, both of which officially don't have
           | exceptions. They both, of course, do have exceptions and just
           | call them a different name.
        
             | LegionMammal978 wrote:
             | As it happens, Rust 1.81 recently added a feature where it
             | aborts by default when you attempt to unwind from a
             | function with the "C" ABI [0], which mitigates this issue
             | for most FFI use cases.
             | 
             | Of course, it's not uncommon to find unsafe Rust code that
             | misbehaves badly when something panics, which is yet
             | another of its hazards that I wish were better documented.
             | 
             | In this case, I'd put down the lesson as "exceptions in C++
             | are very dangerous if you're coming out of C code", since
             | they can cause undocumented and unexpected behavior that
             | ultimately led to the use-after-return issue here.
             | 
             | [0] https://blog.rust-
             | lang.org/2024/09/05/Rust-1.81.0.html#abort...
        
             | samatman wrote:
             | Exceptions aren't bad because of the name. Exceptions are
             | bad because lower stack frames unwind higher ones, without
             | giving those frames any opportunity to do something sane
             | with any captured resources. Calling it panic/recover
             | doesn't help anything.
        
           | IX-103 wrote:
           | I'm not so sure. The bug was that when an exception occurred
           | while select was blocked then select did not properly clean
           | up after itself. But no code in select actually dealt with
           | exceptions at all, so handling it doesn't really make sense.
           | 
           | Without exceptions the possible control flow is entirely
           | explicit. It would have at least been obvious that cleanup
           | wasn't properly handled in the select function for all cases.
        
         | wat10000 wrote:
         | This experienced systems programmer agrees with you 100%. This
         | is an exceptionally bad case, but even in more normal
         | circumstances, C++ exceptions are outrageously dangerous.
         | Correct behavior requires the cooperation of not just the
         | thrower and catcher, but everything in between. And there are
         | basically no guardrails to enforce that. If you throw through
         | C++ code that hasn't been made exception safe, it just goes.
         | You can throw through code written in plain C, which doesn't
         | even _have_ exceptions.
         | 
         | It's probably feasible to use them if you draw a tight boundary
         | around the exception-using code, use RAII without fail inside
         | that boundary to ensure everything cleans up properly, and make
         | sure all code that might be called from the outside has
         | try/catch blocks. (And, obviously, don't trigger async calls to
         | throw from the middle of someone else's function!).
         | 
         | I find it a lot easier to avoid them entirely. Error handling
         | is a little more annoying, but it's worth it.
        
       | mhogomchungu wrote:
       | Raymond Cheng faced a similar situation here:
       | https://devblogs.microsoft.com/oldnewthing/20240927-00/?p=11...
       | 
       | The problem boils down to usage of stack memory after the memory
       | is given to somebody else.
        
         | musjleman wrote:
         | > The problem boils down to usage of stack memory after the
         | memory is given to somebody else.
         | 
         | While this isn't incorrect in this case the problem seems to be
         | caused by stack unwinding without the consent of lower frames
         | rather than a willful bug where the callee forgets about the
         | ownership.
        
           | layer8 wrote:
           | Yes, it's the consequence of throwing exceptions through
           | exception-unaware code, which is a problem when said code
           | needs to perform some cleanup logic before returning, like
           | releasing resources.
        
       | hun3 wrote:
       | (2016)
        
         | saagarjha wrote:
         | Added
        
       | mgaunard wrote:
       | Set a hardware breakpoint and you'll know immediately. That's
       | what he eventually did, but he should have done so sooner.
       | 
       | Then obviously, cancelling an operation is always tricky business
       | with lifetime due to asynchronicity. My approach is to always
       | design my APIs with synchronous cancel semantics, which is
       | sometimes tricky to implement. Many common libraries don't do it
       | right.
        
         | saagarjha wrote:
         | Hardware breakpoints don't work if the kernel is doing the
         | writes, because the kernel won't let you enable them globally
         | so they trigger outside of your program.
        
         | machine_coffee wrote:
         | Also surprised an async completion was writing to the stack.
         | You should normally pass a heap buffer to these functions and
         | keep it alive e.g for the lifetime of the object being watched.
        
           | muststopmyths wrote:
           | It's not an async completion. The call is synchronous.
           | 
           | Windows allows some synchronous calls to be interrupted by
           | another thread to run an APC if the called thread is in an
           | "alertable wait" state. The interrupted thread then returns
           | to the blocking call, so the pointers in the call are
           | expected to be valid.
           | 
           | Edit 2: I should clarify that the thread returns to the
           | blocking call, which then exits with WAIT_IO_COMPLETION
           | status. So you have to retry it again. but the stack context
           | is expected to be safe.
           | 
           | APC is an "Asynchronous procedure call", which is
           | asynchronous to the calling thread in that it may or may not
           | get run. Edit: May or may not run a future time.
           | 
           | (https://learn.microsoft.com/en-
           | us/windows/win32/sync/asynchr...)
           | 
           | There are very limited things you are supposed to do in an
           | APC, but these are poorly documented and need one to think
           | carefully about what is happening when a thread is executing
           | in a stack frame and you interrupt it with this horrorshow.
           | 
           | Win32 API is a plethora of footguns. For the uninitiated it
           | can be like playing Minesweeper with code. Or like that scene
           | in Galaxy Quest where the hammers are coming at you at random
           | times as you try to cross a hallway.
           | 
           | A lot of it was designed by people who, I think, would call
           | one stupid for holding it wrong.
           | 
           | I suppose it's a relic of the late 80s and 90s when you
           | crawled on broken glass because there was no other way to get
           | to the other side.
           | 
           | You learn a lot of the underlying systems this way, but these
           | days people need to get shit done and move on with their
           | lives.
           | 
           | Us olds are left behind staring at nostalgically at our
           | mangled feet while we yell at people to get off our lawns.
        
           | loeg wrote:
           | select() (written in C, a language without exceptions) is
           | synchronous, its authors just (reasonably) did not expect an
           | exception to be thrown in the middle of it invoking a
           | blocking syscall. The algorithm was correct in the absence of
           | a language feature C simply does not have and that is
           | relatively surprising (you don't expect syscalls to throw in
           | C++ either).
        
         | alexvitkov wrote:
         | He mentioned in the article that the corruption happens at a
         | seemingly random spot the middle of a large buffer, and you can
         | only have a HW breakpoint on 4 addresses in x86-64.
        
           | quotemstr wrote:
           | Reproduce the corruption under rr. Replay the rr trace.
           | Replay is totally deterministic, so you can just seek to the
           | end of the trace, set a hardware breakpoint on the damaged
           | stack location, and reverse-continue until you find the
           | culprit.
        
             | IshKebab wrote:
             | Sure let me just run `rr` on Windows...
        
             | pm215 wrote:
             | I would certainly try with a reverse debugger if I had one,
             | but where the repro instructions are "run this big complex
             | interactive program for 10 minutes" I wouldn't be super
             | confident about successfully recording a repro. At least in
             | my experience with rr the slowdown is enough to make that
             | painful, especially if you need to do multiple "chaos mode"
             | runs to get a timing sensitive bug to trigger. It might
             | still be worth spending time trying to get a faster repro
             | case to make reverse debug a bit more tractable.
        
             | zorgmonkey wrote:
             | rr is only works on Linux and the release of Windows TTD
             | was after this blog post was published. Also the huge
             | slowdown from time travel debuggers can sometimes make
             | tricky bugs like this much harder to reproduce.
        
         | PhiSchle wrote:
         | You state this like an obvious fact, but it is only obvious if
         | you either heard of something like this, or you've been through
         | it.
         | 
         | From that point on I am sure he knew to do that. What's obvious
         | to you can also just be your experience.
        
       | AshleysBrain wrote:
       | Would memory safe languages avoid these kinds of problems? It
       | seems like a good example of a nightmare bug from memory
       | corruption - 5 days to fix and the author alludes to it keeping
       | them up at night is a pretty strong motivation to avoid memory
       | unsafety IMO.
        
         | saagarjha wrote:
         | No*. This is one of the bugs that traditional memory safety
         | would not fix, because the issue crosses privilege boundaries
         | in a way that the language can't protect against.
         | 
         | *This could, in theory, be caught by fancy hardware strategies
         | like capabilities. But those are somewhat more esoteric.
        
           | quotemstr wrote:
           | Safe code definitely won't have this sort of problem. Any
           | code that could invoke a system call to scribble on arbitrary
           | memory is by definition unsafe.
        
             | saagarjha wrote:
             | That's basically all code
        
               | quotemstr wrote:
               | No it isn't. You can write safe file IO in Rust despite
               | the read and write system calls being unsafe.
        
               | saagarjha wrote:
               | I take it you are not familiar with the classic Rust meme
               | of opening /proc/self/mem and using it to completely
               | wreck your program?
        
               | IshKebab wrote:
               | That's obviously outside the scope of the language's
               | safety model, and it would be quite hard to do that
               | _accidentally_.
        
               | saagarjha wrote:
               | That is exactly my point, though: system calls are
               | completely outside the scope of a language's safety
               | model. You can say, well /proc/self/mem is stupid (it is)
               | and our file wrappers for read and write are safe
               | (...most languages have at least one), but the
               | fundamental problem remains that you can't just expect to
               | make system calls without that being implicitly unsafe.
               | In the extreme the syscall itself cannot be done safely,
               | with no possible safe wrapper around it. My point is that
               | if you are calling these Windows APIs you can't do it
               | safely from any language; Rust won't magically start
               | yelling at you that the kernel still expects you to keep
               | the buffer alive. You can design your own wrapper around
               | it and try to match the kernel's requirements but you can
               | do that in a lot of languages, and that's kind of missing
               | the point.
        
               | loeg wrote:
               | Right. And of course, it's not just Windows. For example
               | the Linux syscall aio_read() similarly registers a user
               | address with the kernel for later, asynchronous writing
               | (by the kernel). (And I'm sure you get similar lifetime
               | issues with io_uring operations.)
        
               | IshKebab wrote:
               | The bug was not because a system call was involved. It
               | was a multi threaded lifetime issue which is completely
               | withing Rust's safety model.
               | 
               | To put it another way, you can design a safe wrapper
               | around this in Rust, but you can't in C++.
        
               | saagarjha wrote:
               | No. The kernel has no idea what your lifetimes are.
               | There's nothing stopping a buggy Rust implementation from
               | handing out a pointer for the syscall (...an unsafe
               | operation!) and then accidentally dropping the owner. To
               | userspace there are no more references and this code is
               | fine. The problem is the kernel doesn't care what you
               | think, and it has a blank check to write where it wants.
        
               | IshKebab wrote:
               | That's no different to FFI with any C code. There's
               | nothing unique to this being a kernel or a syscall. There
               | are plenty of C libraries that behave in a similar way
               | and can be safely wrapped with Rust by adding the
               | lifetime requirements.
        
           | kibwen wrote:
           | To elaborate, the problem here is that it looks like the OS
           | API itself is fundamentally unsafe: it's taking a pointer to
           | a memory location and then blindly writing into it, expecting
           | that it's still valid without actually doing any sort of
           | verification. You could imagine an OS providing a safe API
           | instead (with possible performance implications depending on
           | the exact approach used), and if your OS API was written in
           | e.g. Rust then this unsafe version of the API would be marked
           | as `unsafe` with the documented invariant "the caller must
           | ensure that the pointer remains valid".
        
             | jpc0 wrote:
             | Seeing as rust has no stable ABI and likely never will. How
             | would you provide the API in rust, also in golang, also in
             | .NET, and swift, and Java, and whatever other language you
             | add without doing exactly what Win32 does and go to C which
             | has a stable ABI to tie into all those other languages?
        
               | pornel wrote:
               | Rust ecosystem solves that by providing packages that are
               | thin wrappers around underlying APIs. It's very similar
               | to providing an .h file with extra type information,
               | except it's an .rs file.
               | 
               | Correctness of the Rust wrapper can't be checked by the
               | compiler, just like correctness of C headers is
               | unchecked, and it just has to match the actual underlying
               | ABI.
               | 
               | The task of making a safe API wrapper can be relatively
               | simple, because you don't have to take into consideration
               | safety an application as a whole, you only need to
               | translate requirements of individual APIs to Rust's
               | safety requirements, function by function. In this case
               | you would need to be aware that the function call may
               | unwind, so whether someone making a dedicated safe API
               | for it would think of it or not, is only a speculation.
        
               | jpc0 wrote:
               | I seem to remember a linux kernel dev quiting and not
               | being able to specify exactly what you say this wrapper
               | should abide by as being a contributing factor.
               | 
               | If those specifications were written down clearly enough
               | then this dev wouldn't have needed to spend 5 days
               | debugging this since he spent a significant amount of
               | time reading the documentation to find any errors they
               | are making that is mentioned in the documentation.
               | 
               | And don't say that they can actually just read the rust
               | code and check that since well, I can't read low level
               | rust code and how any of the annotations ca interact with
               | each other.
               | 
               | A single line of rust code could easily need several
               | paragraphs of written documentation so that someone not
               | familier with what rust is specifying will actually
               | understand what that entails.
               | 
               | This is part of why Rust is difficult, you have to nail
               | down the specification and a small change to the
               | specification causes broad changes to the codebase. The
               | same might need to happen in C, but many times it
               | doesn't.
        
             | wat10000 wrote:
             | What would this safe API look like? The only thing I can
             | think of would be to have the kernel allocate memory in the
             | process and return that pointer, rather than having the
             | caller provide a buffer. Performance would be painful. Is
             | there a faster way that preserves safety?
        
               | LorenPechtel wrote:
               | No allocation--it returns the address of a buffer in a
               | pool. Of course this permits a resource leak. It's a
               | problem with no real solution.
        
         | IshKebab wrote:
         | Yes memory safe languages would absolutely help here. In Rust
         | you would get a compile time error about the destination
         | variable not living long enough.
         | 
         | This sort of stuff is why any productivity arguments for C++
         | over Rust are bullshit. Sure you spend a little more time
         | writing lifetime annotations, but in return you avoid spending
         | 5 days debugging one memory corruption bug.
        
         | layer8 wrote:
         | Depends. The underlying issue for this bug is that the code
         | involved crosses language boundaries (the Windows kernel and
         | win32 libraries written in C and the application in C++). The
         | code where the lifetime failure occurs is Windows code, not
         | application code. However, the Windows code is correct in the
         | context of the C language. The error is caused by an APC that
         | calls exception-throwing C++ code, being pushed onto the
         | waiting-in-C thread. This is a case of language-agnostic OS
         | mechanisms conflicting with language-specific stack unwinding
         | mechanisms.
         | 
         | This could only be made safe by the OS somehow imposing safety
         | mechanisms on the binary level, or by wrapping all OS APIs into
         | APIs of the safe language, where the wrappers have to take care
         | to ensure both the guarantees implied by the language and the
         | assumptions made by the OS APIs. (Writing the OS itself in a
         | memory-safe language isn't sufficient, for one because it very
         | likely will still require some amount of "unsafe" code, and
         | furthermore because one would still want to allow applications
         | written in a different language, which even if it also is
         | memory-safe, would need memory-correct wrappers/adapters.)
         | 
         | This is similar to the distinction between memory-safe
         | languages like Rust where the safety is established primarily
         | on the source level, not on the binary level, and memory-safe
         | runtimes like the CLR (.NET) and the JVM.
        
           | jpc0 wrote:
           | > the Windows kernel and win32 libraries written in C and the
           | application in C++
           | 
           | To my knowledge the kernel and win32 is in fact written in
           | C++ and only the interface has C linkage and follows C norms.
           | 
           | So this error occurred going C++ > C > C++ never mind
           | languages with different memory protection mechanisms like
           | Rust > C > C++.
        
             | rurban wrote:
             | No, the windows kernel is written in pure C.
        
             | layer8 wrote:
             | It's an unholy combination of C, C++, and Microsoft
             | extensions at worst. But apart possibly from some COM-
             | related DLLs, the spirit is clearly C, and C++ exceptions
             | are generally not expected. (There may be use of SEH in
             | some parts.)
             | 
             | Of course, you can write C++ without exception safety too,
             | but "C++ as a better C" and exception-safe C++ are
             | effectively like two different languages.
        
         | muststopmyths wrote:
         | In this specific type of Win32 API case, I can think of a way
         | to make this safe.
         | 
         | It would involve looking at the function pointer in
         | QueueUserAPC and making sure the function being called doesn't
         | mess with the stack frame being executed on.
         | 
         | This function will run in the context of the called thread, in
         | _that_ thread 's stack. NOT in the calling thread.
         | 
         | It's a weird execution mode where you're allowed to hijack a
         | blocked thread and run some code in its context.
         | 
         | Don't know enough about Rust or the like to say if that's
         | something that could be done in the language with
         | attributes/annotations for a function, but it seems plausible.
        
           | LegionMammal978 wrote:
           | Nothing in C can prevent your function from being abnormally
           | unwound through (whether it's via C++ exceptions or via C
           | longjmp()). The only real fix is "don't use C++ exceptions
           | unless you're 100% sure that the code in between is
           | exception-safe (and don't use C longjmp() at all outside of
           | controlled scenarios)".
        
           | loeg wrote:
           | Perhaps simpler would be to just not unwind C++ exceptions
           | through non-C++ stack frames and abort instead. (You'd run
           | into these crashes at development time, debugging them would
           | be pretty obvious, and it'd never release like this.) This
           | might not be viable on Windows, though, where there is a lot
           | of both C++ and legacy C code.
        
             | charrondev wrote:
             | As I understand this was recent stabilized in rust and is
             | now the default behaviour.
             | 
             | https://blog.rust-
             | lang.org/2024/09/05/Rust-1.81.0.html#abort...
             | 
             | You have to explicitly opt into unwinding like this now
             | otherwise the program will abort.
        
         | rramadass wrote:
         | No. The problem was in the architecture of the asynchronous api
         | w.r.t. the kernel. The last line of the article states; _Lesson
         | learnt, folks: do not throw exceptions out of asynchronous
         | procedures if you're inside a system call!_
        
           | LorenPechtel wrote:
           | More generally:
           | 
           | 1) The top level of an async routine should have a handler
           | that catches all exceptions and dies if it catches one.
           | 
           | 2) If you have a resource you have a cleanup routine for it.
        
       | alexvitkov wrote:
       | > The project was quite big (although far from the largest ones);
       | it took 40 minutes to build on my machine.
       | 
       | A bit tangential, but I've been crying about the insane Unity
       | project build times for years now, and about how they've taken
       | zero steps to fix them and are instead trying their hardest to
       | sell you cloud builds. Glad to see them having to suffer through
       | what they're inflicting on us for once!
       | 
       | Regardless, very good writeup, and yet another reason to never
       | ever under any conditions use exceptions.
        
         | noitpmeder wrote:
         | Would a ccache or similar help alleviate the pain?
        
         | yard2010 wrote:
         | This poor human being doesn't deserve to pay the price for the
         | shitty middle management actions though :(
        
       | danaris wrote:
       | _WSPSelect_ : 'Twas I who wrote "2" to your stack! And I would've
       | gotten away with it, too, if it weren't for that meddling kernel
       | debugger!
        
         | greenbit wrote:
         | .. and their hardware breakpoint!
        
       | bjornsing wrote:
       | Back in the day I used to consult for Sony Ericsson. We had like
       | 5000 engineers writing C code that ran as a single executable in
       | a single address space(!). Memory corruption was rampant. So
       | rampant in fact that when we finally got an MMU it took months
       | before we could turn it on in release builds, because there were
       | so many memory corruption bugs even in the released product. The
       | software just wouldn't work unless it could overwrite memory here
       | and there.
        
         | farmdve wrote:
         | I remember back in the day of the Sony Satio U1, the last
         | Symbian v5 phone, the software was horrendous(screen tearing,
         | random OS freezes) and later, the phone abandoned. I think it
         | was afterwards that Sony and Ericsson split?
        
       | hrtk wrote:
       | I don't know windows programming and this was a very interesting
       | (nightmare-ish) post.
       | 
       | I had a few questions I asked ChatGPT to understand better:
       | https://chatgpt.com/share/677411f9-b8a0-8013-8724-8cdff8dc4d...
       | 
       | Very interesting insights about low level programming in general
        
         | rramadass wrote:
         | That's a pretty good use case for ChatGPT. Do you do this
         | often? And if so, are your results specific to debugging
         | consistently good?
        
       | dahart wrote:
       | It's been more than a decade since I worked in games, but for my
       | entire games career, any and all use of C++ exceptions was
       | strictly disallowed due to other horror stories. I was still bit
       | in a very similar way by someone's C++ copy constructor trickery
       | - a crash that only happened in a release build after playing the
       | game for a while, with a stack corruption. Like the author, for
       | me this was one of the hardest bugs I ever had to track down, and
       | I ended up writing a tiny release mode debugger that logged call
       | stacks in order to do it. Once I was able to catch the corruption
       | (after several days of debugging during a crunch weekend),
       | someone on my team noticed the stomp values looked like floating
       | point numbers, and pretty quickly we figured out it was coming
       | from the matrix class trying to be too clever with it's reference
       | counting IIRC. There'd been a team of around a dozen people
       | trying to track this down during overtime, so it suddenly hit me
       | once we fixed it that someone's cute idea that took maybe 10
       | seconds to write cost several tens of thousands of dollars to
       | fix.
        
         | ziml77 wrote:
         | This could still happen without exceptions though, right? The
         | flow is more explicit without exceptions, but returning an
         | error code up the stack would have the same effect of causing
         | the memory that select() is referencing to possibly be used for
         | a different purpose when it writes its result.
        
           | mark_undoio wrote:
           | My reading is that the problem was specifically that they
           | were injecting an exception to recover control from the C
           | library back to their code.
           | 
           | It seems like the select() was within its rights to have
           | passed a stack allocated buffer to be written asynchronously
           | by the kernel since it, presumably, knew it couldn't
           | encounter any exceptions. But injecting one has broken that
           | assumption.
           | 
           | If the select() implementation had returned normally with an
           | error or was expecting then I'd assume this wouldn't have
           | happened.
        
           | tom_ wrote:
           | There are no error returns from an APC? The return type is
           | void and the system expects the routine (whatever it is) to
           | return: https://learn.microsoft.com/en-
           | us/windows/win32/api/winnt/nc... - whichever call put the
           | process in the alertable wait state then ends up returning
           | early. This is a little bit like the Win32 analogue of POSIX
           | signal handlers and EINTR, I suppose.
        
         | intelVISA wrote:
         | Clever code is always expensive, either you're paying for
         | somebody smart to work at their cognitive peak which is less
         | productive for them than 'simpler' code, or more likely you'll
         | instead pay multiples more down the line for someone's hubris.
         | 
         | I think this is the rare direction that more langs should
         | follow Rust in that 'clever' code can be more easily
         | quarantined for scrutiny via unsafe.
        
           | hinkley wrote:
           | The time when people see this the clearest is during an
           | outage. You're in a stressful situation and it's being made
           | worse and longer by clever code. Do you see now why maybe
           | clever code isn't a good idea?
           | 
           | Cortisol actually reduces the effectiveness of working
           | memory. Write your code like someone looking at it is already
           | having a bad day, because not only are odds very good they
           | will be, but that's the costliest time for them to be looking
           | at it. Probability x consequences.
        
         | hinkley wrote:
         | I appreciate the notion of Grug Brained development but at the
         | end of the day it's just a slightly sardonic restatement of
         | Kernighan's Law, which is easier to get people to buy into in
         | negotiations.
         | 
         | Grug brain is maybe good for 1:1 interactions or over coffee
         | with the people you vent to or are vented to.
        
       | DustinBrett wrote:
       | It was just a dream, there's no such thing as 2.
        
       | rectang wrote:
       | After trying and failing over several days to track down a
       | squirrely segfault in a C project about 15 years ago, I taught
       | myself Valgrind in order to debug the issue.
       | 
       | Valgrind flagged an "invalid write", which I eventually hunted
       | down as a fencepost error in a dependency which overwrote their
       | allocated stack array by one byte. I recall that it wrote "1"
       | rather than "2", though, haha.
       | 
       | > _Lesson learnt, folks: do not throw exceptions out of
       | asynchronous procedures if you're inside a system call!_
       | 
       | The author's debugging skills are impressive and significantly
       | better than mine, but I find this an unsatisfying takeaway. I
       | yearn for a systemic approach to either prevent such issues
       | altogether or to make them less difficult to troubleshoot. The
       | general solution is to move away from C/C++ to memory safe
       | languages whenever possible, but such choices are of course not
       | always realistic.
       | 
       | With my project, I started running most of the test suite under
       | Valgrind periodically. That took took half an hour to finish
       | rather than a few seconds, but it caught many similar memory
       | corruption issues over the next few years.
        
         | pjmlp wrote:
         | Similar experience, spending one week debugging memory
         | corruption issues in production back in 2000, with the customer
         | service pinging our team every couple of hours, due to it being
         | on an high profile customer, has been my lesson.
        
         | jart wrote:
         | No the solution isn't to rewrite it in Rust. The solution is to
         | have the option of compiling your C/C++ program with memory
         | safety whenever things go loopy. ASAN, MSAN, and UBSAN are one
         | great way to do that. Another up and coming solution that
         | promises even more memory safety is Fil-C which is being made
         | by Epic Games. https://github.com/pizlonator/llvm-project-
         | deluge/blob/delug...
        
           | kelnos wrote:
           | The solution is usually not to do a rewrite, but I think for
           | greenfield projects we should stop using C or C++ unless
           | there is a compelling reason to do so. Memory-safe systems
           | languages are available today; IMO it's professionally
           | irresponsible to not use them, without a good reason.
           | 
           | MSAN, ASAN, and UBSAN are great tools that have saved me a
           | lot of time and headaches, but they don't catch everything
           | that the compiler of a memory safe language can, at least not
           | today.
        
             | jart wrote:
             | Rust isn't standardized. Last time I checked, everyone who
             | uses it depends on its nightly build. Their toolchain is
             | enormous and isn't vendorable. The binaries it builds are
             | quite large. Programs take a very long time to compile. You
             | need to depend on web servers to do your development and
             | use lots of third party libraries maintained by people
             | you've never heard of, because Rust adopted NodeJS'
             | quadratic dependency model. Choosing Rust will greatly
             | limit your audience if you're doing an open source project,
             | since your users need to install Rust to build your
             | program, and there are many platforms Rust doesn't support.
             | 
             | Rust programs use unsafe a lot in practice. One of the
             | greatest difficulties I've had in supporting Rust with
             | Cosmopolitan Libc is that Rust libraries all try to be
             | clever by using raw assembly system calls rather than using
             | libc. So our Rust binaries will break mysteriously when I
             | run them on other OSes. Everyone who does AI or scientific
             | computing with Rust, if you profile their programs, I
             | guarantee you 99% of the time it's going to be inside C/C++
             | code. If better C/C++ tools can give us memory safety, then
             | how much difference does it really make if it's baked into
             | the language syntax. Rust can't prove everything at compile
             | time.
             | 
             | Some of the Rust programs I've used like Alacrity will
             | runtime panic all the time. Because the language doesn't
             | actually save you. What saves you is smart people spending
             | 5+ years hammering out all the bugs. That's why old tools
             | we depend on every day like GNU programs never crash and
             | their memory bugs are rare enough to be newsworthy. The
             | Rust community has a reputation for toxic behavior that
             | raises questions about its the reliability of its
             | governance. Rust evangelizes its ideas by attacking other
             | languages and socially ostracizing the developers who use
             | them. Software development is the process of manipulating
             | memory, so do you really want to be relinquishing control
             | over your memory to these kinds of people?
        
           | AlotOfReading wrote:
           | Ubsan is fantastic, but ASAN and the rest have serious
           | caveats. They're not suitable for production use and they
           | have a tendency to break in mysterious, intermittent ways.
           | For example, Ubuntu 24.04 unknowingly broke Clang <=15ish
           | when it increased mmap_rnd_bits. ASAN on Windows will
           | actually check if you have ASLR enabled, disable it, and
           | restart at entry. They interact in fun ways with LD_PRELOAD
           | too.
        
         | ch33zer wrote:
         | Doesn't C++ already support everything you need here? It
         | supports the noexcept keyword which should have been used in
         | the interface to this syscall. That would have prevented
         | throwing callbacks from being used at compile time. My guess is
         | that this is a much older syscall than noexcept though.
        
           | masspro wrote:
           | noexcept doesn't prevent any throws at compile-time, it
           | basically just wraps the function in a `catch(...)` block
           | that will call std::terminate, like a failed assert. IMHO it
           | is a stupid feature for this very confusion.
        
       | diekhans wrote:
       | Nicely written (and executed). Worse that my worst memory
       | corruption.
        
       | tomsmeding wrote:
       | Everyone here is going on about exceptions bad, but let's talk
       | about QueueUserAPC(). Yeah, let's throw an asynchronous interrupt
       | to some other thread that might be doing, you know, anything!
       | 
       | In the Unix world we have this too, and it's called signals, but
       | every documentation about signals is sure to say "in a signal
       | handler, almost nothing is safe!". You aren't supposed to call
       | printf() in a signal handler. Throwing exceptions is unthinkable.
       | 
       | I skimmed the linked QueueUserAPC() documentation page and it
       | says none of this. Exceptions aren't the handgrenade here (though
       | sure, they're nasty) -- QueueUserAPC() is.
        
         | Veserv wrote:
         | That does not seem to be correct. The documentation indicates
         | APC [1] can only occur from a waiting/blocking state. So, the
         | program is in a consistent state and can only be on a few known
         | instructions, unlike signals. As such, most functions should be
         | safe to call.
         | 
         | This is more like select() sometimes calling a user-supplied
         | function in addition to checking for I/O.
         | 
         | [1] https://learn.microsoft.com/en-
         | us/windows/win32/sync/asynchr...
        
           | tomsmeding wrote:
           | I see, I read the docs slightly too quickly. Still, though, I
           | would have expected a conspicuous warning about exceptions in
           | those calls, because MS is in on C++ (so they can't hide
           | behind "but we expected only C") and apparently(?) the APC
           | machinery doesn't catch and block exceptions in user code.
        
           | cryptonector wrote:
           | They call it an "alertable state", and that's:
           | 
           | > A thread enters an alertable state when it calls the
           | SleepEx, SignalObjectAndWait, MsgWaitForMultipleObjectsEx,
           | WaitForMultipleObjectsEx, or WaitForSingleObjectEx function.
           | 
           | So this is a lot less like Unix signals. It only really works
           | if the thread you're doing the async procedure call to is one
           | that's likely to use those.
           | 
           | So APCs are safe enough -- a lot safer than Unix signal
           | handlers.
        
         | rramadass wrote:
         | This comment has ChatGPT explain the problem which is
         | surprisingly understandable -
         | https://news.ycombinator.com/item?id=42559401
        
       | cyberax wrote:
       | A small request: please stop using automatic translation for blog
       | posts or documentation.
       | 
       | Especially when I still have English set as the second priority
       | language.
        
       | jacinda wrote:
       | Related (and hilarious):
       | https://scholar.harvard.edu/files/mickens/files/thenightwatc...
       | 
       | > What is despair? I have known it--hear my song. Despair is when
       | you're debugging a kernel driver and you look at a memory dump
       | and you see that a pointer has a value of 7. THERE IS NO HARDWARE
       | ARCHITECTURE THAT IS ALIGNED ON 7. Furthermore, 7 IS TOO SMALL
       | AND ONLY EVIL CODE WOULD TRY TO ACCESS SMALL NUMBER MEMORY.
       | Misaligned, small-number memory accesses have stolen decades from
       | my life.
       | 
       | All James Mickens' USENIX articles are fun (for a very specific
       | subset of computer scientist - the kind that would comment on
       | this thread). https://mickens.seas.harvard.edu/wisdom-james-
       | mickens
        
         | hinkley wrote:
         | I don't know if it's still a thing but there used to be
         | debugging tools that would put a page of memory marked as
         | either read only or unreadable in front of every malloc call so
         | that any pointer arithmetic with a math error would trigger a
         | page fault which could be debugged. It worked in apps that
         | didn't use too much of total memory or too many fine grained
         | allocations. I mean obviously turning every 8 byte pointer into
         | a whole memory page could consume all of memory very quickly.
         | But in front of arrays or large data structures that could
         | work.
        
         | lupire wrote:
         | I don't understand. Pointers aren't numbers, and can only be
         | compared when inside a common array. What is small number
         | memory?
         | 
         | :-)
        
           | MobiusHorizons wrote:
           | I realize you are probably referring to UB in c/c++, but of
           | course in hardware memory addresses are numbers. And when
           | debugging, it's really the hardware version of events that
           | matters, since the compiler has already done whatever
           | optimizations it wants.
        
       | hinkley wrote:
       | The second piece of code I wrote for pay was a FFI around a c
       | library, which had callbacks to send incremental data back to the
       | caller. I didn't understand why the documented examples would re-
       | acquire the object handles every iteration through the loop so I
       | dropped them. And everything seemed to work until I got to the
       | larger problems and then I was getting mutually exclusive states
       | in data that was marked as immutable, in some of the objects. I
       | pulled my hair on this for days.
       | 
       | What ended up happening is that if the GC ran inside the callback
       | then the objects the native code could see could move, and so the
       | next block of code was smashing the heap by writing to the wrong
       | spots. All the small inputs finished before a GC was called and
       | looked fine but larger ones went into undefined behavior. So
       | dumb.
        
       | jart wrote:
       | This kind of error is a right of passage with WIN32 programming.
       | For example, to do nontrivial i/o on Windows you have to create
       | an OVERLAPPED object and give it to ReadFile() and WriteFile()
       | which will return a pending status code, and write back to your
       | OVERLAPPED object once the i/o has completed. Usually it makes
       | the most sense to put that object on the stack. So if you return
       | from your function without making sure WIN32 is done with that
       | object, you're going to end up with bugs like this one. You have
       | to call GetOverlappedResult() to do that. That means no throwing
       | or returning until you do. Even if you call CancelIoEx()
       | beforehand, you still need to call the result function. When you
       | mix all that up with your WaitForMultipleObjects() call, it ends
       | up being a whole lot of if statements you could easily get wrong
       | if the ritual isn't burned into your brain.
       | 
       | UNIX system calls never do this. The kernel won't keep references
       | to pointers you pass them and write to them later. It just isn't
       | in the DNA. The only exceptions I can think of would be clone(),
       | which is abstracted by the POSIX threads runtime, and Windows-
       | inspired non-standard event i/o system calls like epoll.
        
         | dataflow wrote:
         | > UNIX system calls never do this. The kernel won't keep
         | references to pointers you pass them and write to them later.
         | It just isn't in the DNA. The only exceptions I can think of
         | would be clone(), which is abstracted by the POSIX threads
         | runtime, and Windows-inspired non-standard system calls like
         | epoll.
         | 
         | I mean, this is because the UNIX model was based on readiness
         | rather than completion. Which is slower. Hence the newer I/O
         | models.
        
           | jart wrote:
           | System calls take 10x longer on Windows than they do on UNIX.
           | 
           | That's what i/o completion ports are working around.
           | 
           | They solve a problem UNIX doesn't have.
        
           | cryptonector wrote:
           | There are evented I/O APIs for Unix, though anything other
           | than select(2) and poll(2) is non-standard, and while some do
           | let you use pointer-sized user cookies to identify the events
           | I've never seen a case where a programmer used a stack
           | address as a cookie. I _have_ seen cases where the address
           | used as the cookie was freed before the event registration
           | was deleted, or before it fired, leading to use-after-free
           | bugs.
        
       | cryptonector wrote:
       | > The fix was pretty straightforward: instead of using
       | QueueUserAPC(), we now create a loopback socket to which we send
       | a byte any time we need to interrupt select().
       | 
       | This is an absolutely standard trick that is known as "the self-
       | pipe trick". I believe DJB created and named it. It is used for
       | turning APIs not based on file handles/descriptors into events
       | for an event loop based on file handles/descriptors, especially
       | for turning signals into events for
       | select(2)/poll(2)/epoll/kqueue/...
        
       ___________________________________________________________________
       (page generated 2024-12-31 23:00 UTC)