[HN Gopher] Debugging memory corruption: who the hell writes "2"...
___________________________________________________________________
Debugging memory corruption: who the hell writes "2" into my stack?
(2016)
Author : pierremenard
Score : 295 points
Date : 2024-12-28 18:06 UTC (3 days ago)
(HTM) web link (unity.com)
(TXT) w3m dump (unity.com)
| saagarjha wrote:
| Scary. I assume standard memory corruption detection tools would
| also have trouble finding this, as the write is coming from
| outside the application itself...
| loeg wrote:
| Yeah. Not tripping a page fault on modifying readonly
| (userspace) pages both makes it hard for userspace tools but
| also paints a pretty specific picture of where the write is
| coming from.
|
| I'm actually not sure if Linux would handle this in the same
| way or what. Plausibly it sees the same leaf page tables as
| user space, trips a fault, and _doesn 't_ scribble the pages
| anyway. Maybe Windows translates the user-provided virtual
| address to a physical address (or other kernel mapping that
| happens to have write permission) upon registration.
| rramadass wrote:
| A Story for the ages. That is some hardcore debugging involving
| everything viz. user land, system call, kernel, disassembly etc.
| glandium wrote:
| I wonder if Time Travel Debugging would have helped narrow it
| down.
| chris_wot wrote:
| How? Throwing an exception would prevent this wouldn't it?
| ben-schaaf wrote:
| When the assertion on the stack sentinel was reached they
| could have watched the value and then reverse continued,
| which in theory would reveal the APC causing the issue - or
| at least the instruction writing the value. Not sure how well
| reverse debugging works on Windows though, I'm only familiar
| with rr.
| loeg wrote:
| I don't think any reverse debugging system can step the kernel
| backwards to this degree, unless they're doing something really
| clever (slow) with virtual machines and snapshots.
| dzaima wrote:
| While not allowing stepping in the kernel, a large part of rr
| is indeed intercepting all things the kernel may do and re-
| implementing its actions, writing down all changes to memory
| & etc it does (of course for Linux, not Windows). With which
| the kernel doing an asynchronous write would necessarily end
| up as a part of the recording stating what the kernel writes
| at the given point in time, which a debugger could
| deterministically reason about. (of course this relies on the
| recording system reimplementing the things accurately enough,
| but that's at least possible)
| Veserv wrote:
| You are correct. A time travel debugging solution that
| supports recording the relevant system call side effects
| would handle this. In fact, this system call is likely just
| rewriting the program counter register and maybe a few
| others, so it would likely be very easy to support if you
| could hook the relevant kernel operations which may or may
| not be possible in Windows.
|
| The replay system would also be unlikely to pose a problem.
| Replay systems usually just encode and replay the side
| effects, so there is no need to "reimplement" the
| operations. So, if you did some wacky system call, but all
| it did is write 0x2 to a memory location, M, you
| effectively just record: "at time T we issued a system call
| that wrote 0x2 to M". Then, when you get to simulated time
| T in the replay, you do not reissue the wacky system call,
| you just write 0x2 to M and call it a day.
| loeg wrote:
| This system call returned and then asynchronously wrote
| to memory some time later. How does the replay system
| even know the write happened, without scanning all
| memory? It can't generally. With knowledge of the
| specific call it could put just that address on a to-be-
| scanned list to wait for completion, but it still needs
| to periodically poll the memory. It is far more
| complicated to record than a synchronous syscall.
| Veserv wrote:
| You hook the kernel write. That is why I said hook the
| relevant kernel operations.
|
| The primary complexity is actually in creating a
| consistent timeline with respect to parallel asynchronous
| writes. Record-Replay systems like rr usually just
| serialize multithreaded execution during recording to
| avoid such problems. You could also do so by just
| serializing the executing thread and the parallel
| asynchronous write by stopping execution of the thread
| while the write occurs.
|
| Again, not really sure if that would be possible in
| Windows, but there is nothing particularly mechanically
| hard about doing this. It is just a question of whether
| it matches the abstractions and hooks Windows uses and
| supports.
| dzaima wrote:
| I don't think rr hooks actual kernel writes, but rather
| just has hard-coded information on each syscall of how to
| compute what regions of memory it may modify, reading
| those on recording and writing on replay.
|
| As such, for an asynchronous kernel write you'd want to
| set up the kernel to never mutate recordee memory,
| instead having it modify recorder-local memory, which the
| recorder can then copy over to the real process whenever,
| and get to record when it happens while at it (see
| https://github.com/rr-debugger/rr/issues/2613). But such
| can introduce large delays, thereby changing execution
| characteristics (if not make things appear to happen in a
| different order than the kernel would, if done
| improperly). And you still need the recording system to
| have accurately implemented the forwarding of whatever
| edge-case of the asynchronous operation you hit.
|
| And, if done as just that, you'd still hit the problem
| encountered in the article of it looking like unrelated
| code changes the memory (whereas with synchronous
| syscalls you'd at least see the mutation happening on a
| syscall instruction). So you'd want some extra injected
| recordee instruction(s) to present separation of recordee
| actions from asynchronous kernel ones. As a sibling
| comment notes, rr as-is doesn't handle any asynchronous
| kernel write cases (though it's certainly not entirely
| impossible to).
| loeg wrote:
| rr does not record AIO or io_uring operations today,
| because recording syscalls with async behavior is
| challenging.
|
| Maybe Windows TTD records async NtDeviceIoControlFile
| acculately, maybe it doesn't; I don't know.
| mark_undoio wrote:
| It looks like they didn't actually need to step the kernel in
| the end - it just helped understand the bug (which I'd say
| was in user space - injecting an exception into select() and
| this preventing it exiting normally - even though a kernel
| behaviour was involved in how the bug manifested).
|
| The time travel debugging available with WinDbg should be
| able to wind back to the point of corruption - that'd
| probably have taken a few days off the initial realisation
| that an async change to the stack was causing the problem.
|
| There'd still be another reasoning step required to
| understand why that happened - but you would be able to step
| back in time e.g. to when this buffer was _previously_ used
| on the stack to see how select () was submitting it to the
| kernel.
|
| In fact, a data breakpoint / watchpoint could likely have
| taken you back from the corruption to the previous valid use,
| which may have been the missing piece.
| lionkor wrote:
| Very wild bug. I feel like this is some kind of a worst-case
| "exceptions bad" lesson, but I've only been doing systems level
| programming for a couple of years so I'm probably talking out my
| ass.
| usrnm wrote:
| I don't think that the lesson here is "exceptions are bad", the
| same kind of bug can be easily made without using exceptions.
| usrnm wrote:
| Another thing to note is that exactly the same bug can be
| made in Rust or go, both of which officially don't have
| exceptions. They both, of course, do have exceptions and just
| call them a different name.
| LegionMammal978 wrote:
| As it happens, Rust 1.81 recently added a feature where it
| aborts by default when you attempt to unwind from a
| function with the "C" ABI [0], which mitigates this issue
| for most FFI use cases.
|
| Of course, it's not uncommon to find unsafe Rust code that
| misbehaves badly when something panics, which is yet
| another of its hazards that I wish were better documented.
|
| In this case, I'd put down the lesson as "exceptions in C++
| are very dangerous if you're coming out of C code", since
| they can cause undocumented and unexpected behavior that
| ultimately led to the use-after-return issue here.
|
| [0] https://blog.rust-
| lang.org/2024/09/05/Rust-1.81.0.html#abort...
| samatman wrote:
| Exceptions aren't bad because of the name. Exceptions are
| bad because lower stack frames unwind higher ones, without
| giving those frames any opportunity to do something sane
| with any captured resources. Calling it panic/recover
| doesn't help anything.
| IX-103 wrote:
| I'm not so sure. The bug was that when an exception occurred
| while select was blocked then select did not properly clean
| up after itself. But no code in select actually dealt with
| exceptions at all, so handling it doesn't really make sense.
|
| Without exceptions the possible control flow is entirely
| explicit. It would have at least been obvious that cleanup
| wasn't properly handled in the select function for all cases.
| wat10000 wrote:
| This experienced systems programmer agrees with you 100%. This
| is an exceptionally bad case, but even in more normal
| circumstances, C++ exceptions are outrageously dangerous.
| Correct behavior requires the cooperation of not just the
| thrower and catcher, but everything in between. And there are
| basically no guardrails to enforce that. If you throw through
| C++ code that hasn't been made exception safe, it just goes.
| You can throw through code written in plain C, which doesn't
| even _have_ exceptions.
|
| It's probably feasible to use them if you draw a tight boundary
| around the exception-using code, use RAII without fail inside
| that boundary to ensure everything cleans up properly, and make
| sure all code that might be called from the outside has
| try/catch blocks. (And, obviously, don't trigger async calls to
| throw from the middle of someone else's function!).
|
| I find it a lot easier to avoid them entirely. Error handling
| is a little more annoying, but it's worth it.
| mhogomchungu wrote:
| Raymond Cheng faced a similar situation here:
| https://devblogs.microsoft.com/oldnewthing/20240927-00/?p=11...
|
| The problem boils down to usage of stack memory after the memory
| is given to somebody else.
| musjleman wrote:
| > The problem boils down to usage of stack memory after the
| memory is given to somebody else.
|
| While this isn't incorrect in this case the problem seems to be
| caused by stack unwinding without the consent of lower frames
| rather than a willful bug where the callee forgets about the
| ownership.
| layer8 wrote:
| Yes, it's the consequence of throwing exceptions through
| exception-unaware code, which is a problem when said code
| needs to perform some cleanup logic before returning, like
| releasing resources.
| hun3 wrote:
| (2016)
| saagarjha wrote:
| Added
| mgaunard wrote:
| Set a hardware breakpoint and you'll know immediately. That's
| what he eventually did, but he should have done so sooner.
|
| Then obviously, cancelling an operation is always tricky business
| with lifetime due to asynchronicity. My approach is to always
| design my APIs with synchronous cancel semantics, which is
| sometimes tricky to implement. Many common libraries don't do it
| right.
| saagarjha wrote:
| Hardware breakpoints don't work if the kernel is doing the
| writes, because the kernel won't let you enable them globally
| so they trigger outside of your program.
| machine_coffee wrote:
| Also surprised an async completion was writing to the stack.
| You should normally pass a heap buffer to these functions and
| keep it alive e.g for the lifetime of the object being watched.
| muststopmyths wrote:
| It's not an async completion. The call is synchronous.
|
| Windows allows some synchronous calls to be interrupted by
| another thread to run an APC if the called thread is in an
| "alertable wait" state. The interrupted thread then returns
| to the blocking call, so the pointers in the call are
| expected to be valid.
|
| Edit 2: I should clarify that the thread returns to the
| blocking call, which then exits with WAIT_IO_COMPLETION
| status. So you have to retry it again. but the stack context
| is expected to be safe.
|
| APC is an "Asynchronous procedure call", which is
| asynchronous to the calling thread in that it may or may not
| get run. Edit: May or may not run a future time.
|
| (https://learn.microsoft.com/en-
| us/windows/win32/sync/asynchr...)
|
| There are very limited things you are supposed to do in an
| APC, but these are poorly documented and need one to think
| carefully about what is happening when a thread is executing
| in a stack frame and you interrupt it with this horrorshow.
|
| Win32 API is a plethora of footguns. For the uninitiated it
| can be like playing Minesweeper with code. Or like that scene
| in Galaxy Quest where the hammers are coming at you at random
| times as you try to cross a hallway.
|
| A lot of it was designed by people who, I think, would call
| one stupid for holding it wrong.
|
| I suppose it's a relic of the late 80s and 90s when you
| crawled on broken glass because there was no other way to get
| to the other side.
|
| You learn a lot of the underlying systems this way, but these
| days people need to get shit done and move on with their
| lives.
|
| Us olds are left behind staring at nostalgically at our
| mangled feet while we yell at people to get off our lawns.
| loeg wrote:
| select() (written in C, a language without exceptions) is
| synchronous, its authors just (reasonably) did not expect an
| exception to be thrown in the middle of it invoking a
| blocking syscall. The algorithm was correct in the absence of
| a language feature C simply does not have and that is
| relatively surprising (you don't expect syscalls to throw in
| C++ either).
| alexvitkov wrote:
| He mentioned in the article that the corruption happens at a
| seemingly random spot the middle of a large buffer, and you can
| only have a HW breakpoint on 4 addresses in x86-64.
| quotemstr wrote:
| Reproduce the corruption under rr. Replay the rr trace.
| Replay is totally deterministic, so you can just seek to the
| end of the trace, set a hardware breakpoint on the damaged
| stack location, and reverse-continue until you find the
| culprit.
| IshKebab wrote:
| Sure let me just run `rr` on Windows...
| pm215 wrote:
| I would certainly try with a reverse debugger if I had one,
| but where the repro instructions are "run this big complex
| interactive program for 10 minutes" I wouldn't be super
| confident about successfully recording a repro. At least in
| my experience with rr the slowdown is enough to make that
| painful, especially if you need to do multiple "chaos mode"
| runs to get a timing sensitive bug to trigger. It might
| still be worth spending time trying to get a faster repro
| case to make reverse debug a bit more tractable.
| zorgmonkey wrote:
| rr is only works on Linux and the release of Windows TTD
| was after this blog post was published. Also the huge
| slowdown from time travel debuggers can sometimes make
| tricky bugs like this much harder to reproduce.
| PhiSchle wrote:
| You state this like an obvious fact, but it is only obvious if
| you either heard of something like this, or you've been through
| it.
|
| From that point on I am sure he knew to do that. What's obvious
| to you can also just be your experience.
| AshleysBrain wrote:
| Would memory safe languages avoid these kinds of problems? It
| seems like a good example of a nightmare bug from memory
| corruption - 5 days to fix and the author alludes to it keeping
| them up at night is a pretty strong motivation to avoid memory
| unsafety IMO.
| saagarjha wrote:
| No*. This is one of the bugs that traditional memory safety
| would not fix, because the issue crosses privilege boundaries
| in a way that the language can't protect against.
|
| *This could, in theory, be caught by fancy hardware strategies
| like capabilities. But those are somewhat more esoteric.
| quotemstr wrote:
| Safe code definitely won't have this sort of problem. Any
| code that could invoke a system call to scribble on arbitrary
| memory is by definition unsafe.
| saagarjha wrote:
| That's basically all code
| quotemstr wrote:
| No it isn't. You can write safe file IO in Rust despite
| the read and write system calls being unsafe.
| saagarjha wrote:
| I take it you are not familiar with the classic Rust meme
| of opening /proc/self/mem and using it to completely
| wreck your program?
| IshKebab wrote:
| That's obviously outside the scope of the language's
| safety model, and it would be quite hard to do that
| _accidentally_.
| saagarjha wrote:
| That is exactly my point, though: system calls are
| completely outside the scope of a language's safety
| model. You can say, well /proc/self/mem is stupid (it is)
| and our file wrappers for read and write are safe
| (...most languages have at least one), but the
| fundamental problem remains that you can't just expect to
| make system calls without that being implicitly unsafe.
| In the extreme the syscall itself cannot be done safely,
| with no possible safe wrapper around it. My point is that
| if you are calling these Windows APIs you can't do it
| safely from any language; Rust won't magically start
| yelling at you that the kernel still expects you to keep
| the buffer alive. You can design your own wrapper around
| it and try to match the kernel's requirements but you can
| do that in a lot of languages, and that's kind of missing
| the point.
| loeg wrote:
| Right. And of course, it's not just Windows. For example
| the Linux syscall aio_read() similarly registers a user
| address with the kernel for later, asynchronous writing
| (by the kernel). (And I'm sure you get similar lifetime
| issues with io_uring operations.)
| IshKebab wrote:
| The bug was not because a system call was involved. It
| was a multi threaded lifetime issue which is completely
| withing Rust's safety model.
|
| To put it another way, you can design a safe wrapper
| around this in Rust, but you can't in C++.
| saagarjha wrote:
| No. The kernel has no idea what your lifetimes are.
| There's nothing stopping a buggy Rust implementation from
| handing out a pointer for the syscall (...an unsafe
| operation!) and then accidentally dropping the owner. To
| userspace there are no more references and this code is
| fine. The problem is the kernel doesn't care what you
| think, and it has a blank check to write where it wants.
| IshKebab wrote:
| That's no different to FFI with any C code. There's
| nothing unique to this being a kernel or a syscall. There
| are plenty of C libraries that behave in a similar way
| and can be safely wrapped with Rust by adding the
| lifetime requirements.
| kibwen wrote:
| To elaborate, the problem here is that it looks like the OS
| API itself is fundamentally unsafe: it's taking a pointer to
| a memory location and then blindly writing into it, expecting
| that it's still valid without actually doing any sort of
| verification. You could imagine an OS providing a safe API
| instead (with possible performance implications depending on
| the exact approach used), and if your OS API was written in
| e.g. Rust then this unsafe version of the API would be marked
| as `unsafe` with the documented invariant "the caller must
| ensure that the pointer remains valid".
| jpc0 wrote:
| Seeing as rust has no stable ABI and likely never will. How
| would you provide the API in rust, also in golang, also in
| .NET, and swift, and Java, and whatever other language you
| add without doing exactly what Win32 does and go to C which
| has a stable ABI to tie into all those other languages?
| pornel wrote:
| Rust ecosystem solves that by providing packages that are
| thin wrappers around underlying APIs. It's very similar
| to providing an .h file with extra type information,
| except it's an .rs file.
|
| Correctness of the Rust wrapper can't be checked by the
| compiler, just like correctness of C headers is
| unchecked, and it just has to match the actual underlying
| ABI.
|
| The task of making a safe API wrapper can be relatively
| simple, because you don't have to take into consideration
| safety an application as a whole, you only need to
| translate requirements of individual APIs to Rust's
| safety requirements, function by function. In this case
| you would need to be aware that the function call may
| unwind, so whether someone making a dedicated safe API
| for it would think of it or not, is only a speculation.
| jpc0 wrote:
| I seem to remember a linux kernel dev quiting and not
| being able to specify exactly what you say this wrapper
| should abide by as being a contributing factor.
|
| If those specifications were written down clearly enough
| then this dev wouldn't have needed to spend 5 days
| debugging this since he spent a significant amount of
| time reading the documentation to find any errors they
| are making that is mentioned in the documentation.
|
| And don't say that they can actually just read the rust
| code and check that since well, I can't read low level
| rust code and how any of the annotations ca interact with
| each other.
|
| A single line of rust code could easily need several
| paragraphs of written documentation so that someone not
| familier with what rust is specifying will actually
| understand what that entails.
|
| This is part of why Rust is difficult, you have to nail
| down the specification and a small change to the
| specification causes broad changes to the codebase. The
| same might need to happen in C, but many times it
| doesn't.
| wat10000 wrote:
| What would this safe API look like? The only thing I can
| think of would be to have the kernel allocate memory in the
| process and return that pointer, rather than having the
| caller provide a buffer. Performance would be painful. Is
| there a faster way that preserves safety?
| LorenPechtel wrote:
| No allocation--it returns the address of a buffer in a
| pool. Of course this permits a resource leak. It's a
| problem with no real solution.
| IshKebab wrote:
| Yes memory safe languages would absolutely help here. In Rust
| you would get a compile time error about the destination
| variable not living long enough.
|
| This sort of stuff is why any productivity arguments for C++
| over Rust are bullshit. Sure you spend a little more time
| writing lifetime annotations, but in return you avoid spending
| 5 days debugging one memory corruption bug.
| layer8 wrote:
| Depends. The underlying issue for this bug is that the code
| involved crosses language boundaries (the Windows kernel and
| win32 libraries written in C and the application in C++). The
| code where the lifetime failure occurs is Windows code, not
| application code. However, the Windows code is correct in the
| context of the C language. The error is caused by an APC that
| calls exception-throwing C++ code, being pushed onto the
| waiting-in-C thread. This is a case of language-agnostic OS
| mechanisms conflicting with language-specific stack unwinding
| mechanisms.
|
| This could only be made safe by the OS somehow imposing safety
| mechanisms on the binary level, or by wrapping all OS APIs into
| APIs of the safe language, where the wrappers have to take care
| to ensure both the guarantees implied by the language and the
| assumptions made by the OS APIs. (Writing the OS itself in a
| memory-safe language isn't sufficient, for one because it very
| likely will still require some amount of "unsafe" code, and
| furthermore because one would still want to allow applications
| written in a different language, which even if it also is
| memory-safe, would need memory-correct wrappers/adapters.)
|
| This is similar to the distinction between memory-safe
| languages like Rust where the safety is established primarily
| on the source level, not on the binary level, and memory-safe
| runtimes like the CLR (.NET) and the JVM.
| jpc0 wrote:
| > the Windows kernel and win32 libraries written in C and the
| application in C++
|
| To my knowledge the kernel and win32 is in fact written in
| C++ and only the interface has C linkage and follows C norms.
|
| So this error occurred going C++ > C > C++ never mind
| languages with different memory protection mechanisms like
| Rust > C > C++.
| rurban wrote:
| No, the windows kernel is written in pure C.
| layer8 wrote:
| It's an unholy combination of C, C++, and Microsoft
| extensions at worst. But apart possibly from some COM-
| related DLLs, the spirit is clearly C, and C++ exceptions
| are generally not expected. (There may be use of SEH in
| some parts.)
|
| Of course, you can write C++ without exception safety too,
| but "C++ as a better C" and exception-safe C++ are
| effectively like two different languages.
| muststopmyths wrote:
| In this specific type of Win32 API case, I can think of a way
| to make this safe.
|
| It would involve looking at the function pointer in
| QueueUserAPC and making sure the function being called doesn't
| mess with the stack frame being executed on.
|
| This function will run in the context of the called thread, in
| _that_ thread 's stack. NOT in the calling thread.
|
| It's a weird execution mode where you're allowed to hijack a
| blocked thread and run some code in its context.
|
| Don't know enough about Rust or the like to say if that's
| something that could be done in the language with
| attributes/annotations for a function, but it seems plausible.
| LegionMammal978 wrote:
| Nothing in C can prevent your function from being abnormally
| unwound through (whether it's via C++ exceptions or via C
| longjmp()). The only real fix is "don't use C++ exceptions
| unless you're 100% sure that the code in between is
| exception-safe (and don't use C longjmp() at all outside of
| controlled scenarios)".
| loeg wrote:
| Perhaps simpler would be to just not unwind C++ exceptions
| through non-C++ stack frames and abort instead. (You'd run
| into these crashes at development time, debugging them would
| be pretty obvious, and it'd never release like this.) This
| might not be viable on Windows, though, where there is a lot
| of both C++ and legacy C code.
| charrondev wrote:
| As I understand this was recent stabilized in rust and is
| now the default behaviour.
|
| https://blog.rust-
| lang.org/2024/09/05/Rust-1.81.0.html#abort...
|
| You have to explicitly opt into unwinding like this now
| otherwise the program will abort.
| rramadass wrote:
| No. The problem was in the architecture of the asynchronous api
| w.r.t. the kernel. The last line of the article states; _Lesson
| learnt, folks: do not throw exceptions out of asynchronous
| procedures if you're inside a system call!_
| LorenPechtel wrote:
| More generally:
|
| 1) The top level of an async routine should have a handler
| that catches all exceptions and dies if it catches one.
|
| 2) If you have a resource you have a cleanup routine for it.
| alexvitkov wrote:
| > The project was quite big (although far from the largest ones);
| it took 40 minutes to build on my machine.
|
| A bit tangential, but I've been crying about the insane Unity
| project build times for years now, and about how they've taken
| zero steps to fix them and are instead trying their hardest to
| sell you cloud builds. Glad to see them having to suffer through
| what they're inflicting on us for once!
|
| Regardless, very good writeup, and yet another reason to never
| ever under any conditions use exceptions.
| noitpmeder wrote:
| Would a ccache or similar help alleviate the pain?
| yard2010 wrote:
| This poor human being doesn't deserve to pay the price for the
| shitty middle management actions though :(
| danaris wrote:
| _WSPSelect_ : 'Twas I who wrote "2" to your stack! And I would've
| gotten away with it, too, if it weren't for that meddling kernel
| debugger!
| greenbit wrote:
| .. and their hardware breakpoint!
| bjornsing wrote:
| Back in the day I used to consult for Sony Ericsson. We had like
| 5000 engineers writing C code that ran as a single executable in
| a single address space(!). Memory corruption was rampant. So
| rampant in fact that when we finally got an MMU it took months
| before we could turn it on in release builds, because there were
| so many memory corruption bugs even in the released product. The
| software just wouldn't work unless it could overwrite memory here
| and there.
| farmdve wrote:
| I remember back in the day of the Sony Satio U1, the last
| Symbian v5 phone, the software was horrendous(screen tearing,
| random OS freezes) and later, the phone abandoned. I think it
| was afterwards that Sony and Ericsson split?
| hrtk wrote:
| I don't know windows programming and this was a very interesting
| (nightmare-ish) post.
|
| I had a few questions I asked ChatGPT to understand better:
| https://chatgpt.com/share/677411f9-b8a0-8013-8724-8cdff8dc4d...
|
| Very interesting insights about low level programming in general
| rramadass wrote:
| That's a pretty good use case for ChatGPT. Do you do this
| often? And if so, are your results specific to debugging
| consistently good?
| dahart wrote:
| It's been more than a decade since I worked in games, but for my
| entire games career, any and all use of C++ exceptions was
| strictly disallowed due to other horror stories. I was still bit
| in a very similar way by someone's C++ copy constructor trickery
| - a crash that only happened in a release build after playing the
| game for a while, with a stack corruption. Like the author, for
| me this was one of the hardest bugs I ever had to track down, and
| I ended up writing a tiny release mode debugger that logged call
| stacks in order to do it. Once I was able to catch the corruption
| (after several days of debugging during a crunch weekend),
| someone on my team noticed the stomp values looked like floating
| point numbers, and pretty quickly we figured out it was coming
| from the matrix class trying to be too clever with it's reference
| counting IIRC. There'd been a team of around a dozen people
| trying to track this down during overtime, so it suddenly hit me
| once we fixed it that someone's cute idea that took maybe 10
| seconds to write cost several tens of thousands of dollars to
| fix.
| ziml77 wrote:
| This could still happen without exceptions though, right? The
| flow is more explicit without exceptions, but returning an
| error code up the stack would have the same effect of causing
| the memory that select() is referencing to possibly be used for
| a different purpose when it writes its result.
| mark_undoio wrote:
| My reading is that the problem was specifically that they
| were injecting an exception to recover control from the C
| library back to their code.
|
| It seems like the select() was within its rights to have
| passed a stack allocated buffer to be written asynchronously
| by the kernel since it, presumably, knew it couldn't
| encounter any exceptions. But injecting one has broken that
| assumption.
|
| If the select() implementation had returned normally with an
| error or was expecting then I'd assume this wouldn't have
| happened.
| tom_ wrote:
| There are no error returns from an APC? The return type is
| void and the system expects the routine (whatever it is) to
| return: https://learn.microsoft.com/en-
| us/windows/win32/api/winnt/nc... - whichever call put the
| process in the alertable wait state then ends up returning
| early. This is a little bit like the Win32 analogue of POSIX
| signal handlers and EINTR, I suppose.
| intelVISA wrote:
| Clever code is always expensive, either you're paying for
| somebody smart to work at their cognitive peak which is less
| productive for them than 'simpler' code, or more likely you'll
| instead pay multiples more down the line for someone's hubris.
|
| I think this is the rare direction that more langs should
| follow Rust in that 'clever' code can be more easily
| quarantined for scrutiny via unsafe.
| hinkley wrote:
| The time when people see this the clearest is during an
| outage. You're in a stressful situation and it's being made
| worse and longer by clever code. Do you see now why maybe
| clever code isn't a good idea?
|
| Cortisol actually reduces the effectiveness of working
| memory. Write your code like someone looking at it is already
| having a bad day, because not only are odds very good they
| will be, but that's the costliest time for them to be looking
| at it. Probability x consequences.
| hinkley wrote:
| I appreciate the notion of Grug Brained development but at the
| end of the day it's just a slightly sardonic restatement of
| Kernighan's Law, which is easier to get people to buy into in
| negotiations.
|
| Grug brain is maybe good for 1:1 interactions or over coffee
| with the people you vent to or are vented to.
| DustinBrett wrote:
| It was just a dream, there's no such thing as 2.
| rectang wrote:
| After trying and failing over several days to track down a
| squirrely segfault in a C project about 15 years ago, I taught
| myself Valgrind in order to debug the issue.
|
| Valgrind flagged an "invalid write", which I eventually hunted
| down as a fencepost error in a dependency which overwrote their
| allocated stack array by one byte. I recall that it wrote "1"
| rather than "2", though, haha.
|
| > _Lesson learnt, folks: do not throw exceptions out of
| asynchronous procedures if you're inside a system call!_
|
| The author's debugging skills are impressive and significantly
| better than mine, but I find this an unsatisfying takeaway. I
| yearn for a systemic approach to either prevent such issues
| altogether or to make them less difficult to troubleshoot. The
| general solution is to move away from C/C++ to memory safe
| languages whenever possible, but such choices are of course not
| always realistic.
|
| With my project, I started running most of the test suite under
| Valgrind periodically. That took took half an hour to finish
| rather than a few seconds, but it caught many similar memory
| corruption issues over the next few years.
| pjmlp wrote:
| Similar experience, spending one week debugging memory
| corruption issues in production back in 2000, with the customer
| service pinging our team every couple of hours, due to it being
| on an high profile customer, has been my lesson.
| jart wrote:
| No the solution isn't to rewrite it in Rust. The solution is to
| have the option of compiling your C/C++ program with memory
| safety whenever things go loopy. ASAN, MSAN, and UBSAN are one
| great way to do that. Another up and coming solution that
| promises even more memory safety is Fil-C which is being made
| by Epic Games. https://github.com/pizlonator/llvm-project-
| deluge/blob/delug...
| kelnos wrote:
| The solution is usually not to do a rewrite, but I think for
| greenfield projects we should stop using C or C++ unless
| there is a compelling reason to do so. Memory-safe systems
| languages are available today; IMO it's professionally
| irresponsible to not use them, without a good reason.
|
| MSAN, ASAN, and UBSAN are great tools that have saved me a
| lot of time and headaches, but they don't catch everything
| that the compiler of a memory safe language can, at least not
| today.
| jart wrote:
| Rust isn't standardized. Last time I checked, everyone who
| uses it depends on its nightly build. Their toolchain is
| enormous and isn't vendorable. The binaries it builds are
| quite large. Programs take a very long time to compile. You
| need to depend on web servers to do your development and
| use lots of third party libraries maintained by people
| you've never heard of, because Rust adopted NodeJS'
| quadratic dependency model. Choosing Rust will greatly
| limit your audience if you're doing an open source project,
| since your users need to install Rust to build your
| program, and there are many platforms Rust doesn't support.
|
| Rust programs use unsafe a lot in practice. One of the
| greatest difficulties I've had in supporting Rust with
| Cosmopolitan Libc is that Rust libraries all try to be
| clever by using raw assembly system calls rather than using
| libc. So our Rust binaries will break mysteriously when I
| run them on other OSes. Everyone who does AI or scientific
| computing with Rust, if you profile their programs, I
| guarantee you 99% of the time it's going to be inside C/C++
| code. If better C/C++ tools can give us memory safety, then
| how much difference does it really make if it's baked into
| the language syntax. Rust can't prove everything at compile
| time.
|
| Some of the Rust programs I've used like Alacrity will
| runtime panic all the time. Because the language doesn't
| actually save you. What saves you is smart people spending
| 5+ years hammering out all the bugs. That's why old tools
| we depend on every day like GNU programs never crash and
| their memory bugs are rare enough to be newsworthy. The
| Rust community has a reputation for toxic behavior that
| raises questions about its the reliability of its
| governance. Rust evangelizes its ideas by attacking other
| languages and socially ostracizing the developers who use
| them. Software development is the process of manipulating
| memory, so do you really want to be relinquishing control
| over your memory to these kinds of people?
| AlotOfReading wrote:
| Ubsan is fantastic, but ASAN and the rest have serious
| caveats. They're not suitable for production use and they
| have a tendency to break in mysterious, intermittent ways.
| For example, Ubuntu 24.04 unknowingly broke Clang <=15ish
| when it increased mmap_rnd_bits. ASAN on Windows will
| actually check if you have ASLR enabled, disable it, and
| restart at entry. They interact in fun ways with LD_PRELOAD
| too.
| ch33zer wrote:
| Doesn't C++ already support everything you need here? It
| supports the noexcept keyword which should have been used in
| the interface to this syscall. That would have prevented
| throwing callbacks from being used at compile time. My guess is
| that this is a much older syscall than noexcept though.
| masspro wrote:
| noexcept doesn't prevent any throws at compile-time, it
| basically just wraps the function in a `catch(...)` block
| that will call std::terminate, like a failed assert. IMHO it
| is a stupid feature for this very confusion.
| diekhans wrote:
| Nicely written (and executed). Worse that my worst memory
| corruption.
| tomsmeding wrote:
| Everyone here is going on about exceptions bad, but let's talk
| about QueueUserAPC(). Yeah, let's throw an asynchronous interrupt
| to some other thread that might be doing, you know, anything!
|
| In the Unix world we have this too, and it's called signals, but
| every documentation about signals is sure to say "in a signal
| handler, almost nothing is safe!". You aren't supposed to call
| printf() in a signal handler. Throwing exceptions is unthinkable.
|
| I skimmed the linked QueueUserAPC() documentation page and it
| says none of this. Exceptions aren't the handgrenade here (though
| sure, they're nasty) -- QueueUserAPC() is.
| Veserv wrote:
| That does not seem to be correct. The documentation indicates
| APC [1] can only occur from a waiting/blocking state. So, the
| program is in a consistent state and can only be on a few known
| instructions, unlike signals. As such, most functions should be
| safe to call.
|
| This is more like select() sometimes calling a user-supplied
| function in addition to checking for I/O.
|
| [1] https://learn.microsoft.com/en-
| us/windows/win32/sync/asynchr...
| tomsmeding wrote:
| I see, I read the docs slightly too quickly. Still, though, I
| would have expected a conspicuous warning about exceptions in
| those calls, because MS is in on C++ (so they can't hide
| behind "but we expected only C") and apparently(?) the APC
| machinery doesn't catch and block exceptions in user code.
| cryptonector wrote:
| They call it an "alertable state", and that's:
|
| > A thread enters an alertable state when it calls the
| SleepEx, SignalObjectAndWait, MsgWaitForMultipleObjectsEx,
| WaitForMultipleObjectsEx, or WaitForSingleObjectEx function.
|
| So this is a lot less like Unix signals. It only really works
| if the thread you're doing the async procedure call to is one
| that's likely to use those.
|
| So APCs are safe enough -- a lot safer than Unix signal
| handlers.
| rramadass wrote:
| This comment has ChatGPT explain the problem which is
| surprisingly understandable -
| https://news.ycombinator.com/item?id=42559401
| cyberax wrote:
| A small request: please stop using automatic translation for blog
| posts or documentation.
|
| Especially when I still have English set as the second priority
| language.
| jacinda wrote:
| Related (and hilarious):
| https://scholar.harvard.edu/files/mickens/files/thenightwatc...
|
| > What is despair? I have known it--hear my song. Despair is when
| you're debugging a kernel driver and you look at a memory dump
| and you see that a pointer has a value of 7. THERE IS NO HARDWARE
| ARCHITECTURE THAT IS ALIGNED ON 7. Furthermore, 7 IS TOO SMALL
| AND ONLY EVIL CODE WOULD TRY TO ACCESS SMALL NUMBER MEMORY.
| Misaligned, small-number memory accesses have stolen decades from
| my life.
|
| All James Mickens' USENIX articles are fun (for a very specific
| subset of computer scientist - the kind that would comment on
| this thread). https://mickens.seas.harvard.edu/wisdom-james-
| mickens
| hinkley wrote:
| I don't know if it's still a thing but there used to be
| debugging tools that would put a page of memory marked as
| either read only or unreadable in front of every malloc call so
| that any pointer arithmetic with a math error would trigger a
| page fault which could be debugged. It worked in apps that
| didn't use too much of total memory or too many fine grained
| allocations. I mean obviously turning every 8 byte pointer into
| a whole memory page could consume all of memory very quickly.
| But in front of arrays or large data structures that could
| work.
| lupire wrote:
| I don't understand. Pointers aren't numbers, and can only be
| compared when inside a common array. What is small number
| memory?
|
| :-)
| MobiusHorizons wrote:
| I realize you are probably referring to UB in c/c++, but of
| course in hardware memory addresses are numbers. And when
| debugging, it's really the hardware version of events that
| matters, since the compiler has already done whatever
| optimizations it wants.
| hinkley wrote:
| The second piece of code I wrote for pay was a FFI around a c
| library, which had callbacks to send incremental data back to the
| caller. I didn't understand why the documented examples would re-
| acquire the object handles every iteration through the loop so I
| dropped them. And everything seemed to work until I got to the
| larger problems and then I was getting mutually exclusive states
| in data that was marked as immutable, in some of the objects. I
| pulled my hair on this for days.
|
| What ended up happening is that if the GC ran inside the callback
| then the objects the native code could see could move, and so the
| next block of code was smashing the heap by writing to the wrong
| spots. All the small inputs finished before a GC was called and
| looked fine but larger ones went into undefined behavior. So
| dumb.
| jart wrote:
| This kind of error is a right of passage with WIN32 programming.
| For example, to do nontrivial i/o on Windows you have to create
| an OVERLAPPED object and give it to ReadFile() and WriteFile()
| which will return a pending status code, and write back to your
| OVERLAPPED object once the i/o has completed. Usually it makes
| the most sense to put that object on the stack. So if you return
| from your function without making sure WIN32 is done with that
| object, you're going to end up with bugs like this one. You have
| to call GetOverlappedResult() to do that. That means no throwing
| or returning until you do. Even if you call CancelIoEx()
| beforehand, you still need to call the result function. When you
| mix all that up with your WaitForMultipleObjects() call, it ends
| up being a whole lot of if statements you could easily get wrong
| if the ritual isn't burned into your brain.
|
| UNIX system calls never do this. The kernel won't keep references
| to pointers you pass them and write to them later. It just isn't
| in the DNA. The only exceptions I can think of would be clone(),
| which is abstracted by the POSIX threads runtime, and Windows-
| inspired non-standard event i/o system calls like epoll.
| dataflow wrote:
| > UNIX system calls never do this. The kernel won't keep
| references to pointers you pass them and write to them later.
| It just isn't in the DNA. The only exceptions I can think of
| would be clone(), which is abstracted by the POSIX threads
| runtime, and Windows-inspired non-standard system calls like
| epoll.
|
| I mean, this is because the UNIX model was based on readiness
| rather than completion. Which is slower. Hence the newer I/O
| models.
| jart wrote:
| System calls take 10x longer on Windows than they do on UNIX.
|
| That's what i/o completion ports are working around.
|
| They solve a problem UNIX doesn't have.
| cryptonector wrote:
| There are evented I/O APIs for Unix, though anything other
| than select(2) and poll(2) is non-standard, and while some do
| let you use pointer-sized user cookies to identify the events
| I've never seen a case where a programmer used a stack
| address as a cookie. I _have_ seen cases where the address
| used as the cookie was freed before the event registration
| was deleted, or before it fired, leading to use-after-free
| bugs.
| cryptonector wrote:
| > The fix was pretty straightforward: instead of using
| QueueUserAPC(), we now create a loopback socket to which we send
| a byte any time we need to interrupt select().
|
| This is an absolutely standard trick that is known as "the self-
| pipe trick". I believe DJB created and named it. It is used for
| turning APIs not based on file handles/descriptors into events
| for an event loop based on file handles/descriptors, especially
| for turning signals into events for
| select(2)/poll(2)/epoll/kqueue/...
___________________________________________________________________
(page generated 2024-12-31 23:00 UTC)