[HN Gopher] Debugging memory corruption: who the hell writes "2"...
       ___________________________________________________________________
        
       Debugging memory corruption: who the hell writes "2" into my stack?
       (2016)
        
       Author : darknavi
       Score  : 373 points
       Date   : 2021-11-14 06:24 UTC (16 hours ago)
        
 (HTM) web link (blog.unity.com)
 (TXT) w3m dump (blog.unity.com)
        
       | EdwardDiego wrote:
       | Now that was an enjoyable debugging story, but what was their
       | callback doing that threw an exception?
        
         | gulbanana wrote:
         | They threw it on purpose as a way to interrupt select().
        
       | kajaktum wrote:
       | Damn, all these happens in a mere FIVE days? Really shows how far
       | I need to grow as a developer lol
        
         | otherme123 wrote:
         | And still you'll always find a guy that claims he would have
         | found the bug in 5 minutes.
        
           | rendall wrote:
           | Bingo
           | 
           | https://news.ycombinator.com/item?id=29216076
        
             | jacoblambda wrote:
             | Like others have already mentioned, that's a developer who
             | saw the blood on the walls. They've likely seen this issue
             | before and know the signs to look out for. It doesn't make
             | them any smarter but it does mean they are wiser/more
             | experienced with this tech stack.
             | 
             | As a personal example, my team sees me as the consult for
             | all the arcane fuckery that you find as a result of unusual
             | C++ behaviour or build system issues. This doesn't make me
             | a good dev but it does mean that I've been through my paces
             | there and bled my share. But at the same time I'm
             | practically a blind, helpless child wrt most of the JVM or
             | web tech stacks. IMHO my colleagues are wizards for being
             | able to make heads or tails of issues in those spaces.
             | 
             | Point being, this isn't a matter of "Oh I could solve that
             | in seconds", it's a matter of "Oh god I remember dealing
             | with this type of issue. Here's the cause and how to fix &
             | avoid it".
        
             | Jensson wrote:
             | And that comment is very helpful and informative so I think
             | it is great we have people like that posting! Not sure why
             | anyone would call them out.
        
             | DougBTX wrote:
             | The difference is between working out something from
             | scratch vs recognising something you've seen before. The
             | second will be much faster, but it depends on experience,
             | not talent.
        
           | praptak wrote:
           | And he's probably right. And it also doesn't mean anything.
           | 
           | There are thousands of geeks reading an interesting debug
           | report. It is not very surprising to find one who'd think of
           | the right idea as the first thing.
        
           | [deleted]
        
           | m_mueller wrote:
           | maybe, but to get there it takes a lifetime. (paraphrasing
           | Picasso here)
        
       | dane-pgp wrote:
       | "It was just a dream, Bender. There's no such thing as two."
       | 
       | https://www.tvfanatic.com/quotes/bender-what-is-it-whoa-what...
        
       | bullen wrote:
       | Who uses non-blocking networking on the client?
       | 
       | It's not like you don't have cores/threads for blocking when you
       | just have one server, how many servers are they going to connect
       | to?
       | 
       | Or is this for the Unity server part?
        
         | spacechild1 wrote:
         | Actually, the select() call was _blocking_. Non-blocking in
         | this context would mean periodically calling select() with a
         | timeout of 0.
        
           | bullen wrote:
           | Select being blocking is normal, non-blocking IO does not
           | depend on select being non-blocking.
           | 
           | Select returns the list of non-blocking sockets that have
           | data read/write pending.
        
             | spacechild1 wrote:
             | select() is just a means to multiplex network I/O. It can
             | be either blocking or non-blocking, depending on your
             | application architecture. I have worked on both kinds.
             | 
             | BTW, the sockets themselves don't even have to be non-
             | blocking.
             | 
             | Also, I'd like to stress that the concepts or blocking/non-
             | blocking I/O and synchronous/asynchronous I/O are really
             | orthogonal. You can do synchronous networking with non-
             | blocking sockets and vice versa.
        
               | spacechild1 wrote:
               | Or to put it another way: depending on the timeout value,
               | select() can achieve blocking or non-blocking I/O on
               | several sockets simultaneously. In both cases, the
               | sockets themselves can be either blocking or non-
               | blocking.
               | 
               | There is no practical difference between blocking on
               | select() for multiple sockets or blocking on recv() for a
               | single socket - in both cases the thread can't do
               | anything else.
        
         | kevin_thibedeau wrote:
         | > It's not like you don't have cores/threads for blocking when
         | you just have one server
         | 
         | Some platforms don't have these luxuries.
        
         | lyptt wrote:
         | From what I've seen, it's a common convention with game
         | development to do non-blocking networking and update your
         | network state every tick of your update loop.
         | 
         | It fits in nicely with how everything else works in your game
         | loop, and means you don't need to deal with marshalling data
         | to/from a dedicated thread.
        
           | bullen wrote:
           | You should keep the network async. from the rendering yes,
           | but why non-blocking?
        
         | fomine3 wrote:
         | Why blocking?
        
           | bullen wrote:
           | Because it's simple and on the client that does not have many
           | connections there is no point to use non-blocking just
           | because it's the latest tech.!
        
             | Jensson wrote:
             | Non blocking IO is ancient tech, it is done because
             | spinning up more threads is expensive and makes performance
             | unreliable. Thread overhead is much less of an issue today
             | so just spinning up a new thread and running everything
             | with blocking calls is the modern way of doing things, but
             | many of the idioms for games are from a long time ago when
             | not every CPU was multi core so thread switching overhead
             | was huge and therefore they had to do non blocking IO to
             | run well.
        
               | bullen wrote:
               | So if your client has one thread for networking why use
               | non-blocking?
               | 
               | The age is relative, sockets are from the 70s, non-
               | blocking is 2000+ and in most languages it's only stable
               | since 2010+.
        
               | Jensson wrote:
               | > So if your client has one thread for networking why use
               | non-blocking?
               | 
               | No, the client would have one thread for everything, not
               | one thread for networking. You shouldn't do blocking IO
               | if you only have one thread that also needs to do other
               | things.
        
               | nikanj wrote:
               | Overlapped (=async) IO has been in Windows since NT came
               | out, so mid-90s. I don't even know what you mean with
               | "most languages". Most programming languages are only
               | stable since 2010+ anyway, we get more languages every
               | year.
        
       | injidup wrote:
       | This is the sort of thing that would have had me breaking my
       | keyboard over the monitor. The sentinel trick is nice though
       | isn't it possible for compilers to insert such sentinels if
       | requested?
        
         | kevingadd wrote:
         | Yeah, they're typically called stack canaries or cookies by the
         | compilers I've used. I doubt they would have caught this
         | though.
        
       | vardump wrote:
       | Immediately guessed the culprit after seeing the stack trace,
       | because (network) I/O was occurring.
       | 
       | Windows overlapped (=async) I/O to a stack allocated struct is
       | pretty certain to blow up just like that, sooner or later.
       | 
       | If you guard for it, worse, since canceling is asynchronous as
       | well, at least in theory you could end up in a scenario you just
       | can't reliably ever let the execution to continue again. Luckily
       | this should be extremely rare... I hope.
       | 
       | Much better idea: allocate async data structures on the heap. In
       | the worst case you leak some memory. Just never deallocate until
       | the I/O has actually completed.
       | 
       | Variation of the same can happen on the hardware level as well.
       | Don't DMA into stack, unless you can really be sure it'll
       | complete fast enough or you can afford to wait for it.
        
         | quietbritishjim wrote:
         | Hmm, you immediately knew that async IO was the culprit... but
         | it wasn't, according to the article. The function that called
         | the async IO (which was a Windows API function, not in Unity)
         | was correctly waiting until the IO completed before releasing
         | the struct (by returning).
         | 
         | The actual problem was that the rug was being whipped from
         | under it by an exception being injected into a place where it
         | ought to have been impossible. Your suggestion of using heap
         | allocation instead wouldn't have fixed anything; it would've
         | turned the crash into a memory leak, which would've just masked
         | the the problem.
         | 
         | The solution they went for in the end addressed this real
         | issue, and the async IO (happening under the hood of that
         | Windows API function) remains unchanged.
        
           | vardump wrote:
           | Well, overlapped (=async) IO was what _ultimately_ overwrote
           | stack memory. The hazard pattern I recognized from previous
           | experience.
           | 
           | Of course exception in the queued APC was the reason why
           | select's WFSO was prematurely interrupted and stack rewound,
           | so the bug was of course there. It's pretty dangerous to
           | execute code in the context of a waiting thread.
           | 
           | While the solution to fix it by using heap allocation is a
           | bit ugly in some sense, if your choices are to abort
           | execution completely or to infrequently leak some bytes of
           | memory, I know I'd choose to leak and perhaps log what just
           | happened.
           | 
           | There are rare cases when the I/O never completes and
           | CancelIo doesn't complete either. Probably hardware faults or
           | kernel driver bugs. Stuff that should absolutely never
           | happen, but can still be defensively programmed around.
        
         | spockz wrote:
         | Would allocating on the heap have been better than the work
         | around that they used now? I suppose there is something to say
         | for having the same mechanism on multiple platforms.
        
           | Jare wrote:
           | I suspect it would have corrupted the heap instead of the
           | stack. If so, this would be even harder to diagnose, because
           | the corruption would show up elsewhere unrelated to the
           | actual problem.
           | 
           | The underlying problem is that you 1) queue an asynchronous
           | callback and reserve some memory for it to use 2) wait (or so
           | you think) for the function that calls the callback to be
           | done and return 3) something interrupts the wait, but the
           | callback is still queued for execution sometime in the
           | future. 4) since the wait is over, you assume the callback
           | was called, and the memory it was going to use can be safely
           | freed 5) if the memory was on the stack, it gets freed purely
           | by returning. if it was in the heap, you'd deallocate it. 6)
           | the callback that was still queued fires up and writes to
           | memory it no longer should be able to use.
           | 
           | To really fix this situation, you'd need to free the memory
           | in the callback. This is not possible in this case unless you
           | can rewrite the library select() function significantly,
           | perhaps even rewriting other parts (not familiar enough with
           | those APIs to be really sure).
           | 
           | Even in similar situations where it is possible (e.g. if you
           | have garbage collection), it will often turn a memory
           | corruption into a logical corruption, because you have a
           | callback fired related to something that is not relevant
           | anymore, and any side effects of that callback will happen
           | without the proper context. (in this case the only side
           | effect is probably the memory access, so in this particular
           | instance the problem would be "fixed")
           | 
           | [Edit: if the heap allocation was exception-aware (e.g. RAII
           | that gets called during exception unwinding, then you would
           | get heap corruption. If it was manual deallocation and thus
           | skipped during exception unwind, then indeed you would just
           | get a memory leak exactly as you said]
        
             | vardump wrote:
             | Somehow the lifetime must be at least as long as the async
             | operation takes to complete. All other scenarios could
             | result memory corruption.
             | 
             | If there's no other option, I'd rather leak that memory.
        
         | [deleted]
        
       | viktorcode wrote:
       | > Lesson learnt, folks: do not throw exceptions out of
       | asynchronous procedures if you're inside a system call!
       | 
       | Maybe unpopular opinion, but don't throw exceptions at all,
       | anywhere. It isn't worth it.
        
       | spacechild1 wrote:
       | > The fix was pretty straightforward: instead of using
       | QueueUserAPC(), we now create a loopback socket to which we send
       | a byte any time we need to interrupt select().
       | 
       | Ha, I have always been using this trick to break from a blocking
       | select() call. I feel less evil now.
       | 
       | On Linux and macOS you could also use a pipe instead since
       | select() doesn't only work with sockets.
       | 
       | On Windows you can alternatively use WSAEventSelect
       | (https://docs.microsoft.com/en-us/windows/win32/api/winsock2/...)
       | on all sockets and have another Event just for signalling, but
       | the downside is that it forces your sockets to be non-blocking.
        
         | throwaway9870 wrote:
         | This is a common enough technique that linux standardized it to
         | some degree: https://man7.org/linux/man-
         | pages/man2/eventfd.2.html
        
           | spacechild1 wrote:
           | Ah yes, forgot about those... A bit similar to the Windows
           | solution. Unfortunately, the loopback socket method is the
           | only cross-platform solution I am aware of and I always kind
           | of felt bad using it...
        
       | oshiar53-0 wrote:
       | Relevant:
       | https://devblogs.microsoft.com/oldnewthing/20110512-00/?p=10...
        
       | fxtentacle wrote:
       | I read this and couldn't shake the feeling that it's a fantasy
       | tale because it mentions the mystical Unity support engineer who
       | actually fixes bugs - the one I haven't been able to get a hold
       | on for the past 5 years, despite having critical bugs escalated
       | by support... But then I noticed this is from 2016, so before
       | they started kicking out experienced people to polish their
       | profit margin for the IPO.
       | 
       | Still an amazing debugging story, though :)
        
         | nhoughto wrote:
         | Had the same thought, my experience with unity support is its a
         | blackhole. Good that there are some good examples even if from
         | times past.
        
           | fxtentacle wrote:
           | I think it's a sort of tragedy of the commons situation.
           | 
           | When I first met some of the Unity people at the Nordic Game
           | Jam in 2012, they seemed like an awesome team making game
           | development accessible to everyone. At that time, UE was
           | still closed source and out of every indie's reach. And Unity
           | was Mac-only. Since then, Unity has become the de-factor
           | platform for cheap outsourced mobile games, reskin spam, and
           | app store gambling scams. They now have millions of
           | developers using their game engine, but it's an army of
           | people building 1EUR apps. Obviously, you can't charge them
           | much because their revenue is so low. And so they had to make
           | support cheap to compensate.
           | 
           | But the part which I don't get is why Unity doesn't offer
           | paid support for medium-sized indie studios. And why they
           | don't publicly sell source code access. I mean their biggest
           | competitor UE4/5 already has all source code on GitHub.
        
             | iandanforth wrote:
             | "De facto" - No hyphen, no 'r'.
             | 
             | https://en.wikipedia.org/wiki/De_facto
             | 
             | Post left in the spirit that the OP was a typo, but that
             | someone else might not know and be interested in the Latin
             | roots of this phrase.
        
               | fxtentacle wrote:
               | Thanks :)
        
             | Karliss wrote:
             | They sell both paid support and source code for extra money
             | in addition to the subscription. It's in the plan
             | comparison table.
        
               | fxtentacle wrote:
               | I contacted them about it in the past and they refused to
               | even tell me the price. That's why I said they need to
               | publicly sell it, because as-is I don't know anyone who
               | succeeded in buying it.
        
               | Jasper_ wrote:
               | It's determined on a per-contract basis. Expect to spend
               | quite a few months making a deal, and having it come with
               | a bunch of strings attached.
               | 
               | The company I worked for had Unity source access, but we
               | were required to submit any source changes back to Unity
               | (even if they chose not to integrate them), and announce
               | our game at UnityCon 2018.
               | 
               | They also wanted us to help write some of the cutscene
               | timeline tools. I don't know if they used any of the code
               | we wrote, but we definitely did a bunch of work on those
               | timeline tools and sent them back to them.
        
               | franknine wrote:
               | This is worse than I could possibly imagine. Did your
               | studio at least get credited for the work on timeline
               | tools?
        
               | Jasper_ wrote:
               | I don't see why we would -- I don't think Unity has any
               | individual credits listed for their engine developers.
               | That's standard for software development, AFAIK Adobe is
               | really the only mainstream software company that still
               | has credits listed in their product releases.
        
             | gfxgirl wrote:
             | They do sell support
             | 
             | https://unity.com/success-plans
        
               | franknine wrote:
               | The paid support is not cheap and mostly useless, at
               | least in Asia region. We still need to go through normal
               | bug report system, create reproduction steps for them,
               | and get denied for back-porting bug fixes every single
               | time. The only thing changed is that there's a "bug fix
               | progress update" to make you feel a little bit better. We
               | terminated the support right after the term expired.
        
             | Nition wrote:
             | > In 2012...Unity was Mac-only
             | 
             | The timeline is a bit off here. Unity was initially Mac-
             | only, but only for a short time. It ran on Windows from
             | version 2.5 in 2009.[1]
             | 
             | [1]https://blog.unity.com/technology/unity-25-for-mac-and-
             | windo...
        
         | doomlaser wrote:
         | Luckily, this particular developer is still at Unity and still
         | fixing bugs--and doing the world a service by writing
         | fascinating breakdowns about them
         | https://blog.unity.com/technology/fixing-time-deltatime-in-u...
        
         | ncmncm wrote:
         | "Mythical", I guess you mean. A mystical engineer might be even
         | less useful.
        
           | fxtentacle wrote:
           | :)
           | 
           | I now looked up "mystical" in Merriam-Webster: "having a
           | spiritual meaning [..] that is neither apparent to the senses
           | nor obvious to the intelligence"
           | 
           | That sounds like a pretty good description of my relationship
           | with Unity's support engineers. I hope they exist, but to me
           | it's more of a religious thing because I have never met one.
           | 
           | "mythical" says: "existing only in the imagination:
           | fictitious, imaginary, having qualities suitable to myth:
           | legendary".
           | 
           | That also fits well, because so far they exist only in my
           | imagination but reading this story, I am impressed by their
           | legendary problem-solving power.
           | 
           | Thanks to you, today I learned the difference between two
           | similar words :) And now I can even more clearly articulate
           | someone's progress on being less useful ^_^
        
             | nikanj wrote:
             | You can pray to god, and you can pray to unity support.
             | Many people claim they exist and perform miracles, but
             | hardly anyone can point to fungible proof
        
               | dzdt wrote:
               | Which makes _tangible_ vs. _fungible_ your next
               | dictionary dive :^)
        
               | wiredfool wrote:
               | Well, non tangible token makes a lot of sense.
        
       | azalemeth wrote:
       | Out of curiosity, is it possible to disable ASLR easily on
       | Windows? It's trivial to change on Linux and that (plus Valgrind)
       | has occasionally made obscure debugging life easier.
        
         | kevingadd wrote:
         | The executable has to have ASLR enabled as a flag, so you just
         | compile without the flag
        
           | kuroguro wrote:
           | Also can disable globally, for ex. if source isn't available.
           | 
           | https://stackoverflow.com/a/9561263/1895684
        
             | pongo1231 wrote:
             | Alternatively it's also possible to disable ASLR in the
             | executable itself by removing the flag 0x40 (opt-in flag
             | for ASLR) in the DllCharacteristics field of the header.
             | There's a neat little tool called SetDllCharacteristics[0]
             | which can do that.
             | 
             | [0] https://blog.didierstevens.com/2010/10/17/setdllcharact
             | erist...
        
       | Const-me wrote:
       | Saw the red flag on the very first page:
       | 
       | > the socket polling thread then dequeues these requests one by
       | one, calls select()
       | 
       | The select() API is not good regardless on the platform.
       | 
       | If you like the semantics and developing for Linux, use poll()
       | instead. It does exactly the same thing, but the API is good.
       | 
       | Windows API supports thread pools, it's usually a good idea to
       | use StartThreadpoolIo instead.
        
         | zbentley wrote:
         | How would poll() have helped in this case? It seems like the
         | error condition in the article (interrupting syscalls in
         | unexpected ways can trash whatever state your syscall wrapper
         | manages for e.g. return values) is a risk regardless of the
         | blocking operation being performed.
        
           | Const-me wrote:
           | poll is not available for Windows. The semantic is similar to
           | WaitForMultipleObjects with overlapped IO but not quite.
           | 
           | Thread pool API doesn't have this class of bugs. One doesn't
           | need to wake up any threads to change the set of handles
           | being polled.
        
       | josalhor wrote:
       | I won't fully remember the details of these articles, but it has
       | happened in the past that reading these has helped future me
       | debug very big and complicated issues.
       | 
       | I believe this comes from the fact that I am presented with a new
       | _kind_ of bug: async syscall corrupting memory because the stack
       | was unrolled (by an exception). And I integrate this is my mental
       | debug checklist.
        
       | dekhn wrote:
       | mixing threads and select? there's your problem right there!
       | Don't use select in threaded programs.
        
       | oshiar53-0 wrote:
       | > How does one interrupt select() function? Apparently, we used
       | QueueUserAPC() to execute asynchronous procedure while select()
       | was blocked... and threw an exception out of it!
       | 
       | <rant>
       | 
       | Let's just ignore not using overlapped I/O and completion ports
       | for a moment. How has using APCs even _come_ as a good idea?
       | Like, perhaps a dedicated thread pool for async jobs would work a
       | little better?
       | 
       | If you need to interrupt _that_ specific thread though, sorry for
       | that. Perhaps uhh CancelSynchronousIo() would work since there 's
       | IO_STATUS_BLOCK? Or maybe call CancelIo() on the socket in that
       | same APC. However it goes, throwing exceptions in an APC isn't
       | something I'd like to see in production code.
       | 
       | Also, even in Unix world, the select(2) system call is
       | _infamously_ known as a buffer overflow landmine [1]. Just FD_SET
       | any fd  >= FD_SETSIZE and BOOM. If you're on Unix, at least use
       | poll() or better. If you're on Windows (and can't afford
       | switching to proper async I/O), maybe WSAEventSelect() would work
       | better?
       | 
       | </rant>
       | 
       | [1]: This does not apply to Windows Sockets though, where fd_set
       | is not a bit array but internally just an FD array in disguise.
       | You get FD_SETSIZE = 64 though, which is conspiciously close to
       | MAXIMUM_WAIT_OBJECTS in WaitForMultupleObjects.
        
       | anaisbetts wrote:
       | Problems like this (though for this particular app it might be
       | hard because it's Too Big), Time-Travel Debugging is practically
       | made for it - https://docs.microsoft.com/en-us/windows-
       | hardware/drivers/de... - hit your first corruption, set the
       | memory breakpoint, then just go backwards
        
         | drmeister wrote:
         | This should be the top response. With a time traveling debugger
         | this would take 15 minutes to solve (minus the time to make the
         | crash happen). On linux I use the Undo debugger - I don't own
         | stock - I just love the product.
        
       | chris_wot wrote:
       | "Somebody had been touching my sentinel's privates - and it
       | definitely wasn't a friend."
       | 
       | A non-C++ programmer might find this weird.
        
       | [deleted]
        
       | rsp1984 wrote:
       | _When another thread pushes a socket to be processed by the
       | socket polling thread, the socket polling thread calls select()
       | on that function. Since select() is a blocking call, when another
       | socket is pushed to the socket polling thread queue it has to
       | somehow interrupt select() so the new socket gets processed ASAP.
       | How does one interrupt select() function? Apparently, we used
       | QueueUserAPC() to execute asynchronous procedure while select()
       | was blocked... and threw an exception out of it!_
       | 
       | Not taking anything away from the great debugging, but that is
       | just incredibly bad design and reeks of bugs from a mile away.
       | System calls block for a reason. If you can't handle the
       | blocking, how about a non-blocking architecture? But in this
       | case, the engineer thought _" hey, I don't want to block on my
       | precious main thread, so let's just give this blocking thing to
       | another thread, that'll surely solve it!"_. Except it doesn't
       | when you use that same thread as a worker to execute all other
       | thread's blocking crap ... but then write the queue push()
       | function to be blocking as well!
       | 
       | Keep it simple folks. Bad design always breeds more bad design.
        
         | syntheticnature wrote:
         | It's particularly surprising to me because the whole point of
         | select() is to handle these sorts of cases easily. The
         | technique I've often seen is just to have a pipe or other local
         | connection to the thread using select, and then you wake up
         | select when you need it to add something to the queue.
        
           | aftbit wrote:
           | That was their fix.
        
             | syntheticnature wrote:
             | My over-editing of my comment got rid of the sense of "they
             | knew it was the fix, why not do that in the first place?"
             | 
             | Which, admittedly, sometimes folks just don't know, but
             | usually I more often run into complete cluelesness that
             | select()/poll() and friends exist.
        
         | robflynn wrote:
         | I was once brought over to a project that used exceptions for
         | flow control, sometimes nested several layers deep. It was a
         | nightmare to debug and untangle.
        
           | akira2501 wrote:
           | What's funny to me is their solution is very natural and
           | commonly used in unix. My guess is because signal handling on
           | unix is much more prevalent and brings with it many of the
           | same control flow issues you'd see in exceptions.
        
         | jstimpfle wrote:
         | I'd go for a different approach: Add some cancellation event
         | for select() to listen to as well. Then, when another thread
         | submits a new event source, it's only a matter of signalling
         | the cancellation event, so the select() thread can update its
         | poll set.
         | 
         | "Non-blocking" sounds very special but for anything except
         | batch programs it should be used by default because whenever
         | there is more than one possible input (inputs can be trivial as
         | a cancellation signal) one can not afford to be blocked.
         | 
         | Then again, that doesn't mean that one shouldn't use select().
         | Unless one goes ultra-high frequency where the CPU is busy
         | almost 100% of the time, select() saves CPU cycles, and
         | decreases response time. Without select(), finding an
         | appropriate polling frequency is a tradeoff between wasted
         | compute cycles and increased response time.
        
           | zbentley wrote:
           | > I'd go for a different approach: Add some cancellation
           | event for select() to listen to as well.
           | 
           | That's exactly what the author of the article did.
        
             | [deleted]
        
         | oshiar53-0 wrote:
         | https://www.youtube.com/watch?v=k238XpMMn38 but it's the
         | rapidly dwindling sanity of unity programmers
        
       | donatj wrote:
       | The people with the skill and knowledge to debug something like
       | this truly inspire me.
       | 
       | Reading low level debugging war stories like this it always
       | amazes me that computers work at all let alone as well as they
       | do. It's surely due to people like this carrying the torch for
       | the rest of us.
        
       | ptsneves wrote:
       | I come from Linux land so some of the details flew over my head.
       | With that said, is there not a valgrind equivalent for Windows?
       | The sentinel thing seems to a default functionality to the
       | valgrind virtual machine. I would expect that the initial part of
       | the investigation would be automatically done by valgrind. Of
       | course reading the whole post it looks like it would not get very
       | far.
       | 
       | The other thing is, how can a system call block a thread but
       | somehow that same thread runs other stuff in between that
       | blocking call? Even if that is possible that sounds like a
       | dangerous game to play, and this class of bugs seems natural.
       | 
       | I can imagine the scenario where a thread is cancelled mid system
       | call, but it will be the kernel that will cancel the thread and
       | responsible to know that half way syscalls need to be cleaned up.
       | This way the application does not get system call call leaks?
       | Also if i remember one debugging session correctly, when you have
       | a blocking syscall interrupted you will get a sigabort on linux.
        
         | CodesInChaos wrote:
         | The syscall writing the data is async, which requires the
         | memory it uses to be available until until it completes. Since
         | this was on the stack here, it means that the function that
         | contains this memory on its stack must not return until IO
         | completes
         | 
         | But the code that was supposed to wait until completion was
         | interrupted by an exception, which unrolled the stack beyond
         | the function owning the relevant memory. So the completing
         | async operation writes into memory owned by somebody else now.
         | 
         | This is a general complication with completion based async IO.
         | You could get the same problems without any interrupts, if you
         | just don't wait for async IO to complete in some circumstances.
        
           | CamperBob2 wrote:
           | More like a general complication with SEH. One of several
           | reasons why I don't allow it in any code I touch.
           | 
           | Anything that obscures the flow of control is a violation of
           | literate programming principles, and exceptions are pretty
           | high on that particular shit list.
        
           | _nalply wrote:
           | Thank you. Now I understand why Rust async is so complicated.
           | It's about lifetimes. It's dangerous to write data
           | asynchronously because... eh... that location might be
           | already invalid.
           | 
           | I know this is not about Rust. Forgive me that I mentally
           | connect this story with async Rust.
        
             | CodesInChaos wrote:
             | There are two ways how async IO can work:
             | 
             | * Readiness based: You use `select` (or preferably its
             | successors) to figure out which file descriptor has data
             | available. Then you synchronously read the data, which
             | completes instantly because there is data available. This
             | is the traditional way on Linux, and also the way Rust's
             | async works. This approach doesn't run into any lifetime
             | issues, because you only need to provide the memory (`&mut
             | [u8]`) to a sync call once data is available. This approach
             | works well from sequential streams, like sockets or pipes,
             | but isn't a natural match for random access file IO.
             | 
             | * Completion based: You trigger an IO operation specifying
             | the target memory. The system notifies you when it
             | completes. This approach needs you to keep the memory
             | available until the operation finishes. This is the
             | traditional way on Windows (IO completion ports) and also
             | used by the more recent io_uring on Linux. This one runs
             | into the lifetime issues that caused the bug in the linked
             | article. Thus in Rust it can't safely be used with a `&mut
             | [u8]`, you have to transfer ownership of the buffer to the
             | IO library, so it can keep it alive until the operation
             | completes. The ringbahn crate is an example of this
             | approach in Rust.
        
               | oshiar53-0 wrote:
               | Note that what you're referring to as "readiness-based
               | I/O" is not asynchronous I/O in the strictiest sense--
               | it's actually _non-blocking, synchronous_ I /O.
               | 
               | I find this article helpful in clearing up confusion of
               | asynchronous vs non-blocking I/O: http://blog.omega-
               | prime.co.uk/?p=155
        
               | yxhuvud wrote:
               | Good writeup. One thing to note is that Go, Crystal and
               | other systems based on green threads tend to not run into
               | the problem as they have separate stacks for different
               | execution contexts.
        
         | fxtentacle wrote:
         | There's IBM Rational PurifyPlus, which is kinda the same as
         | valgrind, but with a Visual Studio plugin and fancy GUI. But
         | when we bought it, we only got one license for the entire
         | company because it cost around $10k per seat.
        
           | dzdt wrote:
           | PurifyPlus was an old enterprise software similar to valgrind
           | for Sun/Solaris. It eventually got Linux and Windows support,
           | but for more than a decade has been one of those undeveloped
           | systems milking legacy enterprise subscriptions as it slowly
           | bitrots into uselessness. It was bought by IBM and then
           | passed on to Unicom. These days it is not a reasonable choice
           | for anyone not a legacy user.
        
         | anaisbetts wrote:
         | > With that said, is there not a valgrind equivalent for
         | Windows?
         | 
         | Yep, it's called Application Verifier and it will check all
         | kinds of system calls and make the memory allocator incredibly
         | aggro towards checking validity
         | 
         | > The other thing is, how can a system call block a thread but
         | somehow that same thread runs other stuff in between that
         | blocking call?
         | 
         | When you call select(), your thread is in a special state
         | called Alertable Wait; one way to get out of it is to have an
         | APC queued to your thread, which has a similar vibe to a UNIX
         | signal (with all of the similar caveats about how easy it is to
         | fuck things up in it). Basically, OP's scenario is similar to
         | corruption caused by a badly written signal handler
        
         | apocalypses wrote:
         | There's address sanitizer on newer versions of visual studio
         | now but in my experience getting it to actually work with all
         | of your projects' dependencies can be very hit and miss.
         | Windows Debug build config also does a lot more memory checking
         | (part of the reason it's so slow) so you're not totally screwed
         | for automated tools.
        
         | mike_hock wrote:
         | > The other thing is, how can a system call block a thread but
         | somehow that same thread runs other stuff in between that
         | blocking call? Even if that is possible that sounds like a
         | dangerous game to play, and this class of bugs seems natural.
         | 
         | I mean, signal handlers exist on Linux, too ...
         | 
         | > when you have a blocking syscall interrupted you will get a
         | sigabort on linux
         | 
         | Doesn't the syscall just return EINTR?
        
           | nyc_pizzadev wrote:
           | The select() was not interrupted in a traditional sense. It
           | looks like a special windows API was used (QueueUserAPC)
           | which allows a callback to be called in user space while
           | blocked in certain kernel calls. My guess is that when used
           | normally, the callback returns and control goes back to
           | select() being blocked, but in this case an exception was
           | thrown which unrolled the stack and left the select() call
           | holding onto an invalid stack address. Jumping out of a
           | syscall using an exception like this will probably cause all
           | kinds of problems and my guess this is completely
           | unsupported.
           | 
           | https://docs.microsoft.com/en-
           | us/windows/win32/api/processth...
        
             | mike_hock wrote:
             | Throwing from a signal handler on Linux is also UB. Or
             | rather, it will blow up on Linux too, and that's why it's
             | UB in the standard, because it'll blow up on any platform.
        
       | afr0ck wrote:
       | On Linux, a conceptually similar idea to the loopback socket
       | could safely and efficiently be implemented with epoll(). The
       | epoll instance itself is a poll-able file descriptor that can be
       | fed to epoll() alongside the sockets.
       | 
       | When a new socket is added, epoll() gets triggered (by the
       | loopback epoll file descriptor that was used to add the new
       | socket, since it's also actively being polled by epoll). epoll()
       | immediately returns to userspace, the event is handled (to do
       | necessary socket bookkeeping) then go back to polling again with
       | the updated list of sockets.
        
       | trinovantes wrote:
       | Reminds me of a memory corruption bug we encountered in our OS
       | class. We were building a basic OS for a Cortex M3 board and we
       | occasionally printed the wrong strings to the console despite
       | having correct code and assembly. I forgot most of the details
       | but basically we discovered our hardware (?) was not reentrant
       | safe and our bug was caused by an interrupt handler overwriting
       | our callstack. We "fixed" it by disabling interrupts while
       | writing to the console.
        
       | ncmncm wrote:
       | Answer: Windows does!
        
         | Jensson wrote:
         | Technically their program told windows to write to their stack
         | at an arbitrary time in the future. Windows then did what it
         | was told to do causing this bug.
        
       | thr0w__4w4y wrote:
       | About 15 years ago I was debugging an ARM7 memory corruption
       | issue on an embedded target. Chip was running at 40 MHz but the
       | instructions were ARM 32 bit instructions, but the external data
       | bus was only 8 bits wide -- reading instructions from external
       | NOR flash, required 4 bus cycles per instruction. So an effective
       | rate of ~10 MHz.
       | 
       | We were good about doing code reviews, stacks weren't
       | overflowing, etc. So it was puzzling. Finally, just like the
       | article said, I figured the only way to find it was to catch it
       | "red handed", in the act.
       | 
       | The good news is that memory locations getting corrupted were
       | always the same.
       | 
       | Long story short, I set up a FIQ [1] -- some of you the FIQ --
       | which would check the location each interrup. I forget if it
       | checked "for" a value or that it "wasn't" an expected value, ugh,
       | sorry... If the FIQ detected corruption, it did a while (1) that
       | would trigger a breakpoint in the emulator. Then I'd be able to
       | look at the task ID -- we were running Micrium u/C OS-II as I
       | recall -- the call stack, etc.
       | 
       | Originally I set up a timer at 1 MHz to trigger the FIQ, but the
       | overhead of going in & out of the ISR 1 million times per second,
       | at essentially a 10 MHz rate, brought the processor to its knees.
       | 
       | So I slowed the timer interrupt down to 100 kHz (!!), which still
       | soaked up a lot of the CPU slack that we'd been running with. And
       | time after time I'd hit the breakpoint in the FIQ, but the damage
       | had been done usecs earlier and the breadcrumbs didn't finger a
       | victim.
       | 
       | Then it happened. Remember, the hardware timer is running
       | completely asynchronously with respect to the application.
       | Finally, the FIQ timer ISR had interrupted some task's code in
       | exactly the function, at exactly the place (maybe a couple
       | instructions later) where the corruption had occurred.
       | 
       | Took about a day start to finish, I'd never seen or heard of
       | using a high speed timer to try to "catch memory corruption in
       | the act", but as they say, necessity is mother of invention.
       | 
       | And to non-embedded developers, this is an embedded CPU. No MMU
       | or MPU, etc. just a flat, wild-west open memory map. Read or
       | write whatever you want. Literally every part of the code was
       | suspect.
       | 
       | Good times.
       | 
       | [1] On ARM 7/9, maybe 11, I think also Cortex R -- the Fast
       | Interrupt Request, or FIQ, uses banked registers and doesn't
       | stack anything on entry -- so it's the lowest-latency, lowest
       | overhead ISR you can have. But you can only have one FIQ I
       | believe, so you have to use it judiciously.
        
       | aetherspawn wrote:
       | Throw an exception to escape a syscall block sounds crazy evil.
       | Allocate syscall I/O buffer to stack, even more so. I think they
       | deserved this bug.
       | 
       | Good diagnosis though. I think I would have given up at the
       | kernel debugger and just refactored that entire module until it
       | went away. I had no idea kernel debugging was even possible.
        
         | iforgotpassword wrote:
         | > Allocate syscall I/O buffer to stack
         | 
         | That part wasn't done by them though, but by select(). Still a
         | good lesson why exceptions should only be used if you know
         | exactly how everything that allocates any resources in between
         | works. And sometimes you can't know until it bites your like
         | this.
         | 
         | On another note, getting old isn't _that_ bad. When I saw the
         | title I knew I read this post before but couldn 't remember how
         | it went, so the second read was just as exciting. ;)
        
         | xfer wrote:
         | Crazy thing is to use alertable state, iocp doesn't work with
         | it as you expect. I expected frameworks like unity engine to
         | use completion ports.
        
       | nikanj wrote:
       | From the article: "Surprisingly, setting up kernel debugging is
       | extremely easy on Windows."
       | 
       | Microsoft (at least used to) take developers very seriously. The
       | OS actually plays along when you're trying to debug things, and
       | Windows error reporting actually sends you data that you can work
       | with.
       | 
       | Say what you want about their business practices, but I'd deal
       | with Microsoft's developer support over Apple's anytime. And
       | mandatory Youtube: https://www.youtube.com/watch?v=Vhh_GeBPOhs
        
         | 29athrowaway wrote:
         | And Linux makes it even easier.
        
           | oshiar53-0 wrote:
           | Pros and cons
           | 
           | Pros: no PatchGuard sh*t, complete source available
           | 
           | Cons: your out-of-box experience may vary depending on your
           | distro
        
         | saagarjha wrote:
         | Doing debugging work on macOS is actually not particularly
         | difficult either; it's just that very few people do so and the
         | number of online resources is much smaller.
        
         | gavinray wrote:
         | I will give Windows this
         | 
         | You have Event Viewer, Windbg built-in.
         | 
         | Windbg Preview blows away other debugging tools.
         | 
         | And then there's freeware "API Monitor". This thing is
         | incredible, it's like a GUI for looking at all syscalls and COM
         | events on every program without needing to explicitly debug.
         | 
         | The closest thing I could approximate it to would be something
         | like BPF
         | 
         | http://www.rohitab.com/apimonitor
        
           | phone8675309 wrote:
           | I'm (among other things) a Linux system engineer and an
           | occasionally reluctant Linux plumber on my team. Debugging
           | kernel panics / bluescreens is night and day between Linux
           | and Windows.
           | 
           | I have a Windows gaming rig that I was getting BSODs on. It
           | took about ten minutes to reconfigure Windows to write full
           | crash dumps, get WinDbg set up, and determine where and what
           | was crashing the system (bad display driver install - DDU and
           | reinstall fixed it). There were plenty of guides as to how to
           | get it set up and how to get it to pull Windows debugging
           | symbols from the Internet. It was completely painless.
           | 
           | I have to keep a well-worn grimoire covered with notes and
           | scribbles on how to get kdump working correctly to do the
           | same sort of work on the Linux machines that I support -
           | assuming that devops even set them up in the right
           | configuration for kdump to work.
        
             | BenjiWiebe wrote:
             | I mess around with ~5 Linux boxes, so this is a small
             | sample size, and also all my neighbors' Windows PCs.
             | 
             | Seems like the Windows PCs are more likely to show you a
             | blue screen/crash than a Linux one.
             | 
             | Would that match your experience too?
        
             | gavinray wrote:
             | I also am a lifelong Linux user and prefer it to Windows.
             | 
             | But I have to hand it to Windows in this realm, despite how
             | I might feel about other parts of it.
             | 
             | Linux has strace, valgrind/kcachegrind, eBPF, etc
             | 
             | And those tools are incredibly powerful, but have a far
             | steeper learning curve and (personal opinion) UX than
             | similar Windows tools.
             | 
             | I've used Windbg Preview to debug code written in niche
             | languages like D and it's been able to sync source code
             | from .d files and let me set breakpoints.
             | 
             | That really blew me away. I know that debug info is
             | standardized in PE/COFF and in Linux with DWARF, but still.
             | 
             | Neat (at least I think so) screenshot of Windbg Preview
             | debugging a D .dll with source synced:
             | 
             | https://media.discordapp.net/attachments/625407836473524246
             | /...
             | 
             | The JS scripting is pretty wild too.
             | 
             | Repo full of them here. Example for generating a mermaid
             | diagram image of callgraph, given a function name:
             | 
             | https://github.com/hugsy/windbg_js_scripts/blob/main/script
             | s...
             | 
             | (Material from a Defcon workshop by same author):
             | 
             | https://github.com/hugsy/defcon_27_windbg_workshop
        
           | sys_64738 wrote:
           | API Monitor sounds like SnoopDOS.
        
           | 29athrowaway wrote:
           | In Linux this would be like using one of the many tracing
           | tools such as strace, and a dbus viewer.
        
           | oshiar53-0 wrote:
           | To clarify: WinDbg is not built-in in the _literal_ sense
           | (you have to download it separately), but it does come with a
           | plenty of extension commands which lets you peek over Windows
           | internals with ease (e.g. PEB, kernel process /threads, I/O
           | Request Packets with all their Stack Locations, etc.). And it
           | integrates with the operating system easily as a Just-In-Time
           | debugger.
        
             | Const-me wrote:
             | > WinDbg is not built-in in the literal sense (you have to
             | download it separately)
             | 
             | I'm not sure about that. I think you downloading the GUI,
             | not the debugger. The symbolic debugger engine is in
             | DbgEng.dll and is included in Windows.
        
               | oshiar53-0 wrote:
               | Perhaps we don't agree on certain terminology.
               | 
               | I would opine that a debugger in the usual sense is a
               | software that interacts with both the end user (frontend)
               | and the debugee program (debugging/tracing engine),
               | allowing the user to study, analyze, and manipulate the
               | dynamic execution of said program.
               | 
               | The debugger stacks on e.g. Linux and Windows are as
               | follows:
               | 
               | - Linux: GDB, LLDB (frontend + debugging framework);
               | libbfd, libdwarf (object file/symbol engine); ptrace
               | (debugging/tracing engine)
               | 
               | - Windows: WinDbg, cdb, ntsd, kd (frontend); DbgEng.dll
               | (debugging framework); DbgHelp.dll (object file/symbol
               | engine); DbgUi*, Nt* syscalls, kd stubs
               | (debugging/tracing engine)
               | 
               | I have a question: would you call gdbserver a debugger?
        
               | Const-me wrote:
               | Modern debuggers are complicated. There're good reasons
               | to name all of these components "a debugger". But I think
               | the symbolic engine deserves the name "debugger" more
               | than the GUI.
               | 
               | Note that in Windows, the GUI is the only piece missing
               | from the default OS installation. The rest of the stack
               | (DbgEng.dll, DbgHelp.dll, kernel calls, tracing, and
               | more) is shipped with the OS.
               | 
               | > would you call gdbserver a debugger?
               | 
               | I'm not an expert on Linux. But based on this page https:
               | //www.tutorialspoint.com/unix_commands/gdbserver.htm
               | probably not. I think it says that thing's an IPC server
               | which doesn't do debugging on its own, doesn't even need
               | symbols, instead relies on the host to implement the
               | actual debugger.
        
       | hungrybrit wrote:
       | Fantastic write-up. I was gripped!
        
       | baybal2 wrote:
       | The author needs to discover JTAG for himself
        
         | secondcoming wrote:
         | And how much they cost.
         | 
         | They are great though, I used use a Lauterbach occasionally
         | back when I work for a mobile OS company.
        
       | z3t4 wrote:
       | Async is hard. But fun when you can do it intuitively.
        
       | schinchan wrote:
       | Ask HN: How do I reach this level of understanding of what to do
       | in these situations..? Really inspired by the post on how we can
       | use different tools and pin down the exact issue
        
         | secondcoming wrote:
         | For these types of situations you can really only learn by
         | doing. You can litter code with printf() [although this can
         | actually hide bugs too if you're very unlucky], comment out
         | code until you come to a minimal program that exhibits the
         | buggy behaviour and then just persevere.
        
       | JoeAltmaier wrote:
       | Oh data breakpoints! Our savior. In the bad old days Intel had a
       | big 'blue box' that ran ISIS and had a $1200 cable and circuit
       | board called 'ICE' for 'In-Circuit Emulator'. It could monitor
       | the bus and trap on certain memory operations. My brother worked
       | on that right out of college. He met his wife over that circuit
       | board!
       | 
       | Anyway fast forward a bit, I'm working at Convergent which was a
       | nascent Intel's biggest single consumer of x86 chips. Intel was
       | coming out with a shiny new version, the 80286, showed our
       | engineers the spec. They complained "Hey! We're your biggest chip
       | buyer but you didn't consult us on features!"
       | 
       | Intel guy said "Ok, what features do you want?" Our folks said
       | "We'll get back to you..." Then they circulated a questionnaire
       | among us software types for ideas.
       | 
       | I suggested "Let's get rid of that Intel Blue Box, have a feature
       | in the new CPU that traps on bus condition mask and value
       | registers." I even drew out what I wanted.
       | 
       | Well, what do you know, Intel did it. And it remains to this day,
       | and is called "Data breakpoints". Probably the only thing I ever
       | did that will last any length of time, since software has a
       | really short sell-by date as a rule.
        
       ___________________________________________________________________
       (page generated 2021-11-14 23:01 UTC)