[HN Gopher] Debugging memory corruption: who the hell writes "2"...
___________________________________________________________________
Debugging memory corruption: who the hell writes "2" into my stack?
(2016)
Author : darknavi
Score : 373 points
Date : 2021-11-14 06:24 UTC (16 hours ago)
(HTM) web link (blog.unity.com)
(TXT) w3m dump (blog.unity.com)
| EdwardDiego wrote:
| Now that was an enjoyable debugging story, but what was their
| callback doing that threw an exception?
| gulbanana wrote:
| They threw it on purpose as a way to interrupt select().
| kajaktum wrote:
| Damn, all these happens in a mere FIVE days? Really shows how far
| I need to grow as a developer lol
| otherme123 wrote:
| And still you'll always find a guy that claims he would have
| found the bug in 5 minutes.
| rendall wrote:
| Bingo
|
| https://news.ycombinator.com/item?id=29216076
| jacoblambda wrote:
| Like others have already mentioned, that's a developer who
| saw the blood on the walls. They've likely seen this issue
| before and know the signs to look out for. It doesn't make
| them any smarter but it does mean they are wiser/more
| experienced with this tech stack.
|
| As a personal example, my team sees me as the consult for
| all the arcane fuckery that you find as a result of unusual
| C++ behaviour or build system issues. This doesn't make me
| a good dev but it does mean that I've been through my paces
| there and bled my share. But at the same time I'm
| practically a blind, helpless child wrt most of the JVM or
| web tech stacks. IMHO my colleagues are wizards for being
| able to make heads or tails of issues in those spaces.
|
| Point being, this isn't a matter of "Oh I could solve that
| in seconds", it's a matter of "Oh god I remember dealing
| with this type of issue. Here's the cause and how to fix &
| avoid it".
| Jensson wrote:
| And that comment is very helpful and informative so I think
| it is great we have people like that posting! Not sure why
| anyone would call them out.
| DougBTX wrote:
| The difference is between working out something from
| scratch vs recognising something you've seen before. The
| second will be much faster, but it depends on experience,
| not talent.
| praptak wrote:
| And he's probably right. And it also doesn't mean anything.
|
| There are thousands of geeks reading an interesting debug
| report. It is not very surprising to find one who'd think of
| the right idea as the first thing.
| [deleted]
| m_mueller wrote:
| maybe, but to get there it takes a lifetime. (paraphrasing
| Picasso here)
| dane-pgp wrote:
| "It was just a dream, Bender. There's no such thing as two."
|
| https://www.tvfanatic.com/quotes/bender-what-is-it-whoa-what...
| bullen wrote:
| Who uses non-blocking networking on the client?
|
| It's not like you don't have cores/threads for blocking when you
| just have one server, how many servers are they going to connect
| to?
|
| Or is this for the Unity server part?
| spacechild1 wrote:
| Actually, the select() call was _blocking_. Non-blocking in
| this context would mean periodically calling select() with a
| timeout of 0.
| bullen wrote:
| Select being blocking is normal, non-blocking IO does not
| depend on select being non-blocking.
|
| Select returns the list of non-blocking sockets that have
| data read/write pending.
| spacechild1 wrote:
| select() is just a means to multiplex network I/O. It can
| be either blocking or non-blocking, depending on your
| application architecture. I have worked on both kinds.
|
| BTW, the sockets themselves don't even have to be non-
| blocking.
|
| Also, I'd like to stress that the concepts or blocking/non-
| blocking I/O and synchronous/asynchronous I/O are really
| orthogonal. You can do synchronous networking with non-
| blocking sockets and vice versa.
| spacechild1 wrote:
| Or to put it another way: depending on the timeout value,
| select() can achieve blocking or non-blocking I/O on
| several sockets simultaneously. In both cases, the
| sockets themselves can be either blocking or non-
| blocking.
|
| There is no practical difference between blocking on
| select() for multiple sockets or blocking on recv() for a
| single socket - in both cases the thread can't do
| anything else.
| kevin_thibedeau wrote:
| > It's not like you don't have cores/threads for blocking when
| you just have one server
|
| Some platforms don't have these luxuries.
| lyptt wrote:
| From what I've seen, it's a common convention with game
| development to do non-blocking networking and update your
| network state every tick of your update loop.
|
| It fits in nicely with how everything else works in your game
| loop, and means you don't need to deal with marshalling data
| to/from a dedicated thread.
| bullen wrote:
| You should keep the network async. from the rendering yes,
| but why non-blocking?
| fomine3 wrote:
| Why blocking?
| bullen wrote:
| Because it's simple and on the client that does not have many
| connections there is no point to use non-blocking just
| because it's the latest tech.!
| Jensson wrote:
| Non blocking IO is ancient tech, it is done because
| spinning up more threads is expensive and makes performance
| unreliable. Thread overhead is much less of an issue today
| so just spinning up a new thread and running everything
| with blocking calls is the modern way of doing things, but
| many of the idioms for games are from a long time ago when
| not every CPU was multi core so thread switching overhead
| was huge and therefore they had to do non blocking IO to
| run well.
| bullen wrote:
| So if your client has one thread for networking why use
| non-blocking?
|
| The age is relative, sockets are from the 70s, non-
| blocking is 2000+ and in most languages it's only stable
| since 2010+.
| Jensson wrote:
| > So if your client has one thread for networking why use
| non-blocking?
|
| No, the client would have one thread for everything, not
| one thread for networking. You shouldn't do blocking IO
| if you only have one thread that also needs to do other
| things.
| nikanj wrote:
| Overlapped (=async) IO has been in Windows since NT came
| out, so mid-90s. I don't even know what you mean with
| "most languages". Most programming languages are only
| stable since 2010+ anyway, we get more languages every
| year.
| injidup wrote:
| This is the sort of thing that would have had me breaking my
| keyboard over the monitor. The sentinel trick is nice though
| isn't it possible for compilers to insert such sentinels if
| requested?
| kevingadd wrote:
| Yeah, they're typically called stack canaries or cookies by the
| compilers I've used. I doubt they would have caught this
| though.
| vardump wrote:
| Immediately guessed the culprit after seeing the stack trace,
| because (network) I/O was occurring.
|
| Windows overlapped (=async) I/O to a stack allocated struct is
| pretty certain to blow up just like that, sooner or later.
|
| If you guard for it, worse, since canceling is asynchronous as
| well, at least in theory you could end up in a scenario you just
| can't reliably ever let the execution to continue again. Luckily
| this should be extremely rare... I hope.
|
| Much better idea: allocate async data structures on the heap. In
| the worst case you leak some memory. Just never deallocate until
| the I/O has actually completed.
|
| Variation of the same can happen on the hardware level as well.
| Don't DMA into stack, unless you can really be sure it'll
| complete fast enough or you can afford to wait for it.
| quietbritishjim wrote:
| Hmm, you immediately knew that async IO was the culprit... but
| it wasn't, according to the article. The function that called
| the async IO (which was a Windows API function, not in Unity)
| was correctly waiting until the IO completed before releasing
| the struct (by returning).
|
| The actual problem was that the rug was being whipped from
| under it by an exception being injected into a place where it
| ought to have been impossible. Your suggestion of using heap
| allocation instead wouldn't have fixed anything; it would've
| turned the crash into a memory leak, which would've just masked
| the the problem.
|
| The solution they went for in the end addressed this real
| issue, and the async IO (happening under the hood of that
| Windows API function) remains unchanged.
| vardump wrote:
| Well, overlapped (=async) IO was what _ultimately_ overwrote
| stack memory. The hazard pattern I recognized from previous
| experience.
|
| Of course exception in the queued APC was the reason why
| select's WFSO was prematurely interrupted and stack rewound,
| so the bug was of course there. It's pretty dangerous to
| execute code in the context of a waiting thread.
|
| While the solution to fix it by using heap allocation is a
| bit ugly in some sense, if your choices are to abort
| execution completely or to infrequently leak some bytes of
| memory, I know I'd choose to leak and perhaps log what just
| happened.
|
| There are rare cases when the I/O never completes and
| CancelIo doesn't complete either. Probably hardware faults or
| kernel driver bugs. Stuff that should absolutely never
| happen, but can still be defensively programmed around.
| spockz wrote:
| Would allocating on the heap have been better than the work
| around that they used now? I suppose there is something to say
| for having the same mechanism on multiple platforms.
| Jare wrote:
| I suspect it would have corrupted the heap instead of the
| stack. If so, this would be even harder to diagnose, because
| the corruption would show up elsewhere unrelated to the
| actual problem.
|
| The underlying problem is that you 1) queue an asynchronous
| callback and reserve some memory for it to use 2) wait (or so
| you think) for the function that calls the callback to be
| done and return 3) something interrupts the wait, but the
| callback is still queued for execution sometime in the
| future. 4) since the wait is over, you assume the callback
| was called, and the memory it was going to use can be safely
| freed 5) if the memory was on the stack, it gets freed purely
| by returning. if it was in the heap, you'd deallocate it. 6)
| the callback that was still queued fires up and writes to
| memory it no longer should be able to use.
|
| To really fix this situation, you'd need to free the memory
| in the callback. This is not possible in this case unless you
| can rewrite the library select() function significantly,
| perhaps even rewriting other parts (not familiar enough with
| those APIs to be really sure).
|
| Even in similar situations where it is possible (e.g. if you
| have garbage collection), it will often turn a memory
| corruption into a logical corruption, because you have a
| callback fired related to something that is not relevant
| anymore, and any side effects of that callback will happen
| without the proper context. (in this case the only side
| effect is probably the memory access, so in this particular
| instance the problem would be "fixed")
|
| [Edit: if the heap allocation was exception-aware (e.g. RAII
| that gets called during exception unwinding, then you would
| get heap corruption. If it was manual deallocation and thus
| skipped during exception unwind, then indeed you would just
| get a memory leak exactly as you said]
| vardump wrote:
| Somehow the lifetime must be at least as long as the async
| operation takes to complete. All other scenarios could
| result memory corruption.
|
| If there's no other option, I'd rather leak that memory.
| [deleted]
| viktorcode wrote:
| > Lesson learnt, folks: do not throw exceptions out of
| asynchronous procedures if you're inside a system call!
|
| Maybe unpopular opinion, but don't throw exceptions at all,
| anywhere. It isn't worth it.
| spacechild1 wrote:
| > The fix was pretty straightforward: instead of using
| QueueUserAPC(), we now create a loopback socket to which we send
| a byte any time we need to interrupt select().
|
| Ha, I have always been using this trick to break from a blocking
| select() call. I feel less evil now.
|
| On Linux and macOS you could also use a pipe instead since
| select() doesn't only work with sockets.
|
| On Windows you can alternatively use WSAEventSelect
| (https://docs.microsoft.com/en-us/windows/win32/api/winsock2/...)
| on all sockets and have another Event just for signalling, but
| the downside is that it forces your sockets to be non-blocking.
| throwaway9870 wrote:
| This is a common enough technique that linux standardized it to
| some degree: https://man7.org/linux/man-
| pages/man2/eventfd.2.html
| spacechild1 wrote:
| Ah yes, forgot about those... A bit similar to the Windows
| solution. Unfortunately, the loopback socket method is the
| only cross-platform solution I am aware of and I always kind
| of felt bad using it...
| oshiar53-0 wrote:
| Relevant:
| https://devblogs.microsoft.com/oldnewthing/20110512-00/?p=10...
| fxtentacle wrote:
| I read this and couldn't shake the feeling that it's a fantasy
| tale because it mentions the mystical Unity support engineer who
| actually fixes bugs - the one I haven't been able to get a hold
| on for the past 5 years, despite having critical bugs escalated
| by support... But then I noticed this is from 2016, so before
| they started kicking out experienced people to polish their
| profit margin for the IPO.
|
| Still an amazing debugging story, though :)
| nhoughto wrote:
| Had the same thought, my experience with unity support is its a
| blackhole. Good that there are some good examples even if from
| times past.
| fxtentacle wrote:
| I think it's a sort of tragedy of the commons situation.
|
| When I first met some of the Unity people at the Nordic Game
| Jam in 2012, they seemed like an awesome team making game
| development accessible to everyone. At that time, UE was
| still closed source and out of every indie's reach. And Unity
| was Mac-only. Since then, Unity has become the de-factor
| platform for cheap outsourced mobile games, reskin spam, and
| app store gambling scams. They now have millions of
| developers using their game engine, but it's an army of
| people building 1EUR apps. Obviously, you can't charge them
| much because their revenue is so low. And so they had to make
| support cheap to compensate.
|
| But the part which I don't get is why Unity doesn't offer
| paid support for medium-sized indie studios. And why they
| don't publicly sell source code access. I mean their biggest
| competitor UE4/5 already has all source code on GitHub.
| iandanforth wrote:
| "De facto" - No hyphen, no 'r'.
|
| https://en.wikipedia.org/wiki/De_facto
|
| Post left in the spirit that the OP was a typo, but that
| someone else might not know and be interested in the Latin
| roots of this phrase.
| fxtentacle wrote:
| Thanks :)
| Karliss wrote:
| They sell both paid support and source code for extra money
| in addition to the subscription. It's in the plan
| comparison table.
| fxtentacle wrote:
| I contacted them about it in the past and they refused to
| even tell me the price. That's why I said they need to
| publicly sell it, because as-is I don't know anyone who
| succeeded in buying it.
| Jasper_ wrote:
| It's determined on a per-contract basis. Expect to spend
| quite a few months making a deal, and having it come with
| a bunch of strings attached.
|
| The company I worked for had Unity source access, but we
| were required to submit any source changes back to Unity
| (even if they chose not to integrate them), and announce
| our game at UnityCon 2018.
|
| They also wanted us to help write some of the cutscene
| timeline tools. I don't know if they used any of the code
| we wrote, but we definitely did a bunch of work on those
| timeline tools and sent them back to them.
| franknine wrote:
| This is worse than I could possibly imagine. Did your
| studio at least get credited for the work on timeline
| tools?
| Jasper_ wrote:
| I don't see why we would -- I don't think Unity has any
| individual credits listed for their engine developers.
| That's standard for software development, AFAIK Adobe is
| really the only mainstream software company that still
| has credits listed in their product releases.
| gfxgirl wrote:
| They do sell support
|
| https://unity.com/success-plans
| franknine wrote:
| The paid support is not cheap and mostly useless, at
| least in Asia region. We still need to go through normal
| bug report system, create reproduction steps for them,
| and get denied for back-porting bug fixes every single
| time. The only thing changed is that there's a "bug fix
| progress update" to make you feel a little bit better. We
| terminated the support right after the term expired.
| Nition wrote:
| > In 2012...Unity was Mac-only
|
| The timeline is a bit off here. Unity was initially Mac-
| only, but only for a short time. It ran on Windows from
| version 2.5 in 2009.[1]
|
| [1]https://blog.unity.com/technology/unity-25-for-mac-and-
| windo...
| doomlaser wrote:
| Luckily, this particular developer is still at Unity and still
| fixing bugs--and doing the world a service by writing
| fascinating breakdowns about them
| https://blog.unity.com/technology/fixing-time-deltatime-in-u...
| ncmncm wrote:
| "Mythical", I guess you mean. A mystical engineer might be even
| less useful.
| fxtentacle wrote:
| :)
|
| I now looked up "mystical" in Merriam-Webster: "having a
| spiritual meaning [..] that is neither apparent to the senses
| nor obvious to the intelligence"
|
| That sounds like a pretty good description of my relationship
| with Unity's support engineers. I hope they exist, but to me
| it's more of a religious thing because I have never met one.
|
| "mythical" says: "existing only in the imagination:
| fictitious, imaginary, having qualities suitable to myth:
| legendary".
|
| That also fits well, because so far they exist only in my
| imagination but reading this story, I am impressed by their
| legendary problem-solving power.
|
| Thanks to you, today I learned the difference between two
| similar words :) And now I can even more clearly articulate
| someone's progress on being less useful ^_^
| nikanj wrote:
| You can pray to god, and you can pray to unity support.
| Many people claim they exist and perform miracles, but
| hardly anyone can point to fungible proof
| dzdt wrote:
| Which makes _tangible_ vs. _fungible_ your next
| dictionary dive :^)
| wiredfool wrote:
| Well, non tangible token makes a lot of sense.
| azalemeth wrote:
| Out of curiosity, is it possible to disable ASLR easily on
| Windows? It's trivial to change on Linux and that (plus Valgrind)
| has occasionally made obscure debugging life easier.
| kevingadd wrote:
| The executable has to have ASLR enabled as a flag, so you just
| compile without the flag
| kuroguro wrote:
| Also can disable globally, for ex. if source isn't available.
|
| https://stackoverflow.com/a/9561263/1895684
| pongo1231 wrote:
| Alternatively it's also possible to disable ASLR in the
| executable itself by removing the flag 0x40 (opt-in flag
| for ASLR) in the DllCharacteristics field of the header.
| There's a neat little tool called SetDllCharacteristics[0]
| which can do that.
|
| [0] https://blog.didierstevens.com/2010/10/17/setdllcharact
| erist...
| Const-me wrote:
| Saw the red flag on the very first page:
|
| > the socket polling thread then dequeues these requests one by
| one, calls select()
|
| The select() API is not good regardless on the platform.
|
| If you like the semantics and developing for Linux, use poll()
| instead. It does exactly the same thing, but the API is good.
|
| Windows API supports thread pools, it's usually a good idea to
| use StartThreadpoolIo instead.
| zbentley wrote:
| How would poll() have helped in this case? It seems like the
| error condition in the article (interrupting syscalls in
| unexpected ways can trash whatever state your syscall wrapper
| manages for e.g. return values) is a risk regardless of the
| blocking operation being performed.
| Const-me wrote:
| poll is not available for Windows. The semantic is similar to
| WaitForMultipleObjects with overlapped IO but not quite.
|
| Thread pool API doesn't have this class of bugs. One doesn't
| need to wake up any threads to change the set of handles
| being polled.
| josalhor wrote:
| I won't fully remember the details of these articles, but it has
| happened in the past that reading these has helped future me
| debug very big and complicated issues.
|
| I believe this comes from the fact that I am presented with a new
| _kind_ of bug: async syscall corrupting memory because the stack
| was unrolled (by an exception). And I integrate this is my mental
| debug checklist.
| dekhn wrote:
| mixing threads and select? there's your problem right there!
| Don't use select in threaded programs.
| oshiar53-0 wrote:
| > How does one interrupt select() function? Apparently, we used
| QueueUserAPC() to execute asynchronous procedure while select()
| was blocked... and threw an exception out of it!
|
| <rant>
|
| Let's just ignore not using overlapped I/O and completion ports
| for a moment. How has using APCs even _come_ as a good idea?
| Like, perhaps a dedicated thread pool for async jobs would work a
| little better?
|
| If you need to interrupt _that_ specific thread though, sorry for
| that. Perhaps uhh CancelSynchronousIo() would work since there 's
| IO_STATUS_BLOCK? Or maybe call CancelIo() on the socket in that
| same APC. However it goes, throwing exceptions in an APC isn't
| something I'd like to see in production code.
|
| Also, even in Unix world, the select(2) system call is
| _infamously_ known as a buffer overflow landmine [1]. Just FD_SET
| any fd >= FD_SETSIZE and BOOM. If you're on Unix, at least use
| poll() or better. If you're on Windows (and can't afford
| switching to proper async I/O), maybe WSAEventSelect() would work
| better?
|
| </rant>
|
| [1]: This does not apply to Windows Sockets though, where fd_set
| is not a bit array but internally just an FD array in disguise.
| You get FD_SETSIZE = 64 though, which is conspiciously close to
| MAXIMUM_WAIT_OBJECTS in WaitForMultupleObjects.
| anaisbetts wrote:
| Problems like this (though for this particular app it might be
| hard because it's Too Big), Time-Travel Debugging is practically
| made for it - https://docs.microsoft.com/en-us/windows-
| hardware/drivers/de... - hit your first corruption, set the
| memory breakpoint, then just go backwards
| drmeister wrote:
| This should be the top response. With a time traveling debugger
| this would take 15 minutes to solve (minus the time to make the
| crash happen). On linux I use the Undo debugger - I don't own
| stock - I just love the product.
| chris_wot wrote:
| "Somebody had been touching my sentinel's privates - and it
| definitely wasn't a friend."
|
| A non-C++ programmer might find this weird.
| [deleted]
| rsp1984 wrote:
| _When another thread pushes a socket to be processed by the
| socket polling thread, the socket polling thread calls select()
| on that function. Since select() is a blocking call, when another
| socket is pushed to the socket polling thread queue it has to
| somehow interrupt select() so the new socket gets processed ASAP.
| How does one interrupt select() function? Apparently, we used
| QueueUserAPC() to execute asynchronous procedure while select()
| was blocked... and threw an exception out of it!_
|
| Not taking anything away from the great debugging, but that is
| just incredibly bad design and reeks of bugs from a mile away.
| System calls block for a reason. If you can't handle the
| blocking, how about a non-blocking architecture? But in this
| case, the engineer thought _" hey, I don't want to block on my
| precious main thread, so let's just give this blocking thing to
| another thread, that'll surely solve it!"_. Except it doesn't
| when you use that same thread as a worker to execute all other
| thread's blocking crap ... but then write the queue push()
| function to be blocking as well!
|
| Keep it simple folks. Bad design always breeds more bad design.
| syntheticnature wrote:
| It's particularly surprising to me because the whole point of
| select() is to handle these sorts of cases easily. The
| technique I've often seen is just to have a pipe or other local
| connection to the thread using select, and then you wake up
| select when you need it to add something to the queue.
| aftbit wrote:
| That was their fix.
| syntheticnature wrote:
| My over-editing of my comment got rid of the sense of "they
| knew it was the fix, why not do that in the first place?"
|
| Which, admittedly, sometimes folks just don't know, but
| usually I more often run into complete cluelesness that
| select()/poll() and friends exist.
| robflynn wrote:
| I was once brought over to a project that used exceptions for
| flow control, sometimes nested several layers deep. It was a
| nightmare to debug and untangle.
| akira2501 wrote:
| What's funny to me is their solution is very natural and
| commonly used in unix. My guess is because signal handling on
| unix is much more prevalent and brings with it many of the
| same control flow issues you'd see in exceptions.
| jstimpfle wrote:
| I'd go for a different approach: Add some cancellation event
| for select() to listen to as well. Then, when another thread
| submits a new event source, it's only a matter of signalling
| the cancellation event, so the select() thread can update its
| poll set.
|
| "Non-blocking" sounds very special but for anything except
| batch programs it should be used by default because whenever
| there is more than one possible input (inputs can be trivial as
| a cancellation signal) one can not afford to be blocked.
|
| Then again, that doesn't mean that one shouldn't use select().
| Unless one goes ultra-high frequency where the CPU is busy
| almost 100% of the time, select() saves CPU cycles, and
| decreases response time. Without select(), finding an
| appropriate polling frequency is a tradeoff between wasted
| compute cycles and increased response time.
| zbentley wrote:
| > I'd go for a different approach: Add some cancellation
| event for select() to listen to as well.
|
| That's exactly what the author of the article did.
| [deleted]
| oshiar53-0 wrote:
| https://www.youtube.com/watch?v=k238XpMMn38 but it's the
| rapidly dwindling sanity of unity programmers
| donatj wrote:
| The people with the skill and knowledge to debug something like
| this truly inspire me.
|
| Reading low level debugging war stories like this it always
| amazes me that computers work at all let alone as well as they
| do. It's surely due to people like this carrying the torch for
| the rest of us.
| ptsneves wrote:
| I come from Linux land so some of the details flew over my head.
| With that said, is there not a valgrind equivalent for Windows?
| The sentinel thing seems to a default functionality to the
| valgrind virtual machine. I would expect that the initial part of
| the investigation would be automatically done by valgrind. Of
| course reading the whole post it looks like it would not get very
| far.
|
| The other thing is, how can a system call block a thread but
| somehow that same thread runs other stuff in between that
| blocking call? Even if that is possible that sounds like a
| dangerous game to play, and this class of bugs seems natural.
|
| I can imagine the scenario where a thread is cancelled mid system
| call, but it will be the kernel that will cancel the thread and
| responsible to know that half way syscalls need to be cleaned up.
| This way the application does not get system call call leaks?
| Also if i remember one debugging session correctly, when you have
| a blocking syscall interrupted you will get a sigabort on linux.
| CodesInChaos wrote:
| The syscall writing the data is async, which requires the
| memory it uses to be available until until it completes. Since
| this was on the stack here, it means that the function that
| contains this memory on its stack must not return until IO
| completes
|
| But the code that was supposed to wait until completion was
| interrupted by an exception, which unrolled the stack beyond
| the function owning the relevant memory. So the completing
| async operation writes into memory owned by somebody else now.
|
| This is a general complication with completion based async IO.
| You could get the same problems without any interrupts, if you
| just don't wait for async IO to complete in some circumstances.
| CamperBob2 wrote:
| More like a general complication with SEH. One of several
| reasons why I don't allow it in any code I touch.
|
| Anything that obscures the flow of control is a violation of
| literate programming principles, and exceptions are pretty
| high on that particular shit list.
| _nalply wrote:
| Thank you. Now I understand why Rust async is so complicated.
| It's about lifetimes. It's dangerous to write data
| asynchronously because... eh... that location might be
| already invalid.
|
| I know this is not about Rust. Forgive me that I mentally
| connect this story with async Rust.
| CodesInChaos wrote:
| There are two ways how async IO can work:
|
| * Readiness based: You use `select` (or preferably its
| successors) to figure out which file descriptor has data
| available. Then you synchronously read the data, which
| completes instantly because there is data available. This
| is the traditional way on Linux, and also the way Rust's
| async works. This approach doesn't run into any lifetime
| issues, because you only need to provide the memory (`&mut
| [u8]`) to a sync call once data is available. This approach
| works well from sequential streams, like sockets or pipes,
| but isn't a natural match for random access file IO.
|
| * Completion based: You trigger an IO operation specifying
| the target memory. The system notifies you when it
| completes. This approach needs you to keep the memory
| available until the operation finishes. This is the
| traditional way on Windows (IO completion ports) and also
| used by the more recent io_uring on Linux. This one runs
| into the lifetime issues that caused the bug in the linked
| article. Thus in Rust it can't safely be used with a `&mut
| [u8]`, you have to transfer ownership of the buffer to the
| IO library, so it can keep it alive until the operation
| completes. The ringbahn crate is an example of this
| approach in Rust.
| oshiar53-0 wrote:
| Note that what you're referring to as "readiness-based
| I/O" is not asynchronous I/O in the strictiest sense--
| it's actually _non-blocking, synchronous_ I /O.
|
| I find this article helpful in clearing up confusion of
| asynchronous vs non-blocking I/O: http://blog.omega-
| prime.co.uk/?p=155
| yxhuvud wrote:
| Good writeup. One thing to note is that Go, Crystal and
| other systems based on green threads tend to not run into
| the problem as they have separate stacks for different
| execution contexts.
| fxtentacle wrote:
| There's IBM Rational PurifyPlus, which is kinda the same as
| valgrind, but with a Visual Studio plugin and fancy GUI. But
| when we bought it, we only got one license for the entire
| company because it cost around $10k per seat.
| dzdt wrote:
| PurifyPlus was an old enterprise software similar to valgrind
| for Sun/Solaris. It eventually got Linux and Windows support,
| but for more than a decade has been one of those undeveloped
| systems milking legacy enterprise subscriptions as it slowly
| bitrots into uselessness. It was bought by IBM and then
| passed on to Unicom. These days it is not a reasonable choice
| for anyone not a legacy user.
| anaisbetts wrote:
| > With that said, is there not a valgrind equivalent for
| Windows?
|
| Yep, it's called Application Verifier and it will check all
| kinds of system calls and make the memory allocator incredibly
| aggro towards checking validity
|
| > The other thing is, how can a system call block a thread but
| somehow that same thread runs other stuff in between that
| blocking call?
|
| When you call select(), your thread is in a special state
| called Alertable Wait; one way to get out of it is to have an
| APC queued to your thread, which has a similar vibe to a UNIX
| signal (with all of the similar caveats about how easy it is to
| fuck things up in it). Basically, OP's scenario is similar to
| corruption caused by a badly written signal handler
| apocalypses wrote:
| There's address sanitizer on newer versions of visual studio
| now but in my experience getting it to actually work with all
| of your projects' dependencies can be very hit and miss.
| Windows Debug build config also does a lot more memory checking
| (part of the reason it's so slow) so you're not totally screwed
| for automated tools.
| mike_hock wrote:
| > The other thing is, how can a system call block a thread but
| somehow that same thread runs other stuff in between that
| blocking call? Even if that is possible that sounds like a
| dangerous game to play, and this class of bugs seems natural.
|
| I mean, signal handlers exist on Linux, too ...
|
| > when you have a blocking syscall interrupted you will get a
| sigabort on linux
|
| Doesn't the syscall just return EINTR?
| nyc_pizzadev wrote:
| The select() was not interrupted in a traditional sense. It
| looks like a special windows API was used (QueueUserAPC)
| which allows a callback to be called in user space while
| blocked in certain kernel calls. My guess is that when used
| normally, the callback returns and control goes back to
| select() being blocked, but in this case an exception was
| thrown which unrolled the stack and left the select() call
| holding onto an invalid stack address. Jumping out of a
| syscall using an exception like this will probably cause all
| kinds of problems and my guess this is completely
| unsupported.
|
| https://docs.microsoft.com/en-
| us/windows/win32/api/processth...
| mike_hock wrote:
| Throwing from a signal handler on Linux is also UB. Or
| rather, it will blow up on Linux too, and that's why it's
| UB in the standard, because it'll blow up on any platform.
| afr0ck wrote:
| On Linux, a conceptually similar idea to the loopback socket
| could safely and efficiently be implemented with epoll(). The
| epoll instance itself is a poll-able file descriptor that can be
| fed to epoll() alongside the sockets.
|
| When a new socket is added, epoll() gets triggered (by the
| loopback epoll file descriptor that was used to add the new
| socket, since it's also actively being polled by epoll). epoll()
| immediately returns to userspace, the event is handled (to do
| necessary socket bookkeeping) then go back to polling again with
| the updated list of sockets.
| trinovantes wrote:
| Reminds me of a memory corruption bug we encountered in our OS
| class. We were building a basic OS for a Cortex M3 board and we
| occasionally printed the wrong strings to the console despite
| having correct code and assembly. I forgot most of the details
| but basically we discovered our hardware (?) was not reentrant
| safe and our bug was caused by an interrupt handler overwriting
| our callstack. We "fixed" it by disabling interrupts while
| writing to the console.
| ncmncm wrote:
| Answer: Windows does!
| Jensson wrote:
| Technically their program told windows to write to their stack
| at an arbitrary time in the future. Windows then did what it
| was told to do causing this bug.
| thr0w__4w4y wrote:
| About 15 years ago I was debugging an ARM7 memory corruption
| issue on an embedded target. Chip was running at 40 MHz but the
| instructions were ARM 32 bit instructions, but the external data
| bus was only 8 bits wide -- reading instructions from external
| NOR flash, required 4 bus cycles per instruction. So an effective
| rate of ~10 MHz.
|
| We were good about doing code reviews, stacks weren't
| overflowing, etc. So it was puzzling. Finally, just like the
| article said, I figured the only way to find it was to catch it
| "red handed", in the act.
|
| The good news is that memory locations getting corrupted were
| always the same.
|
| Long story short, I set up a FIQ [1] -- some of you the FIQ --
| which would check the location each interrup. I forget if it
| checked "for" a value or that it "wasn't" an expected value, ugh,
| sorry... If the FIQ detected corruption, it did a while (1) that
| would trigger a breakpoint in the emulator. Then I'd be able to
| look at the task ID -- we were running Micrium u/C OS-II as I
| recall -- the call stack, etc.
|
| Originally I set up a timer at 1 MHz to trigger the FIQ, but the
| overhead of going in & out of the ISR 1 million times per second,
| at essentially a 10 MHz rate, brought the processor to its knees.
|
| So I slowed the timer interrupt down to 100 kHz (!!), which still
| soaked up a lot of the CPU slack that we'd been running with. And
| time after time I'd hit the breakpoint in the FIQ, but the damage
| had been done usecs earlier and the breadcrumbs didn't finger a
| victim.
|
| Then it happened. Remember, the hardware timer is running
| completely asynchronously with respect to the application.
| Finally, the FIQ timer ISR had interrupted some task's code in
| exactly the function, at exactly the place (maybe a couple
| instructions later) where the corruption had occurred.
|
| Took about a day start to finish, I'd never seen or heard of
| using a high speed timer to try to "catch memory corruption in
| the act", but as they say, necessity is mother of invention.
|
| And to non-embedded developers, this is an embedded CPU. No MMU
| or MPU, etc. just a flat, wild-west open memory map. Read or
| write whatever you want. Literally every part of the code was
| suspect.
|
| Good times.
|
| [1] On ARM 7/9, maybe 11, I think also Cortex R -- the Fast
| Interrupt Request, or FIQ, uses banked registers and doesn't
| stack anything on entry -- so it's the lowest-latency, lowest
| overhead ISR you can have. But you can only have one FIQ I
| believe, so you have to use it judiciously.
| aetherspawn wrote:
| Throw an exception to escape a syscall block sounds crazy evil.
| Allocate syscall I/O buffer to stack, even more so. I think they
| deserved this bug.
|
| Good diagnosis though. I think I would have given up at the
| kernel debugger and just refactored that entire module until it
| went away. I had no idea kernel debugging was even possible.
| iforgotpassword wrote:
| > Allocate syscall I/O buffer to stack
|
| That part wasn't done by them though, but by select(). Still a
| good lesson why exceptions should only be used if you know
| exactly how everything that allocates any resources in between
| works. And sometimes you can't know until it bites your like
| this.
|
| On another note, getting old isn't _that_ bad. When I saw the
| title I knew I read this post before but couldn 't remember how
| it went, so the second read was just as exciting. ;)
| xfer wrote:
| Crazy thing is to use alertable state, iocp doesn't work with
| it as you expect. I expected frameworks like unity engine to
| use completion ports.
| nikanj wrote:
| From the article: "Surprisingly, setting up kernel debugging is
| extremely easy on Windows."
|
| Microsoft (at least used to) take developers very seriously. The
| OS actually plays along when you're trying to debug things, and
| Windows error reporting actually sends you data that you can work
| with.
|
| Say what you want about their business practices, but I'd deal
| with Microsoft's developer support over Apple's anytime. And
| mandatory Youtube: https://www.youtube.com/watch?v=Vhh_GeBPOhs
| 29athrowaway wrote:
| And Linux makes it even easier.
| oshiar53-0 wrote:
| Pros and cons
|
| Pros: no PatchGuard sh*t, complete source available
|
| Cons: your out-of-box experience may vary depending on your
| distro
| saagarjha wrote:
| Doing debugging work on macOS is actually not particularly
| difficult either; it's just that very few people do so and the
| number of online resources is much smaller.
| gavinray wrote:
| I will give Windows this
|
| You have Event Viewer, Windbg built-in.
|
| Windbg Preview blows away other debugging tools.
|
| And then there's freeware "API Monitor". This thing is
| incredible, it's like a GUI for looking at all syscalls and COM
| events on every program without needing to explicitly debug.
|
| The closest thing I could approximate it to would be something
| like BPF
|
| http://www.rohitab.com/apimonitor
| phone8675309 wrote:
| I'm (among other things) a Linux system engineer and an
| occasionally reluctant Linux plumber on my team. Debugging
| kernel panics / bluescreens is night and day between Linux
| and Windows.
|
| I have a Windows gaming rig that I was getting BSODs on. It
| took about ten minutes to reconfigure Windows to write full
| crash dumps, get WinDbg set up, and determine where and what
| was crashing the system (bad display driver install - DDU and
| reinstall fixed it). There were plenty of guides as to how to
| get it set up and how to get it to pull Windows debugging
| symbols from the Internet. It was completely painless.
|
| I have to keep a well-worn grimoire covered with notes and
| scribbles on how to get kdump working correctly to do the
| same sort of work on the Linux machines that I support -
| assuming that devops even set them up in the right
| configuration for kdump to work.
| BenjiWiebe wrote:
| I mess around with ~5 Linux boxes, so this is a small
| sample size, and also all my neighbors' Windows PCs.
|
| Seems like the Windows PCs are more likely to show you a
| blue screen/crash than a Linux one.
|
| Would that match your experience too?
| gavinray wrote:
| I also am a lifelong Linux user and prefer it to Windows.
|
| But I have to hand it to Windows in this realm, despite how
| I might feel about other parts of it.
|
| Linux has strace, valgrind/kcachegrind, eBPF, etc
|
| And those tools are incredibly powerful, but have a far
| steeper learning curve and (personal opinion) UX than
| similar Windows tools.
|
| I've used Windbg Preview to debug code written in niche
| languages like D and it's been able to sync source code
| from .d files and let me set breakpoints.
|
| That really blew me away. I know that debug info is
| standardized in PE/COFF and in Linux with DWARF, but still.
|
| Neat (at least I think so) screenshot of Windbg Preview
| debugging a D .dll with source synced:
|
| https://media.discordapp.net/attachments/625407836473524246
| /...
|
| The JS scripting is pretty wild too.
|
| Repo full of them here. Example for generating a mermaid
| diagram image of callgraph, given a function name:
|
| https://github.com/hugsy/windbg_js_scripts/blob/main/script
| s...
|
| (Material from a Defcon workshop by same author):
|
| https://github.com/hugsy/defcon_27_windbg_workshop
| sys_64738 wrote:
| API Monitor sounds like SnoopDOS.
| 29athrowaway wrote:
| In Linux this would be like using one of the many tracing
| tools such as strace, and a dbus viewer.
| oshiar53-0 wrote:
| To clarify: WinDbg is not built-in in the _literal_ sense
| (you have to download it separately), but it does come with a
| plenty of extension commands which lets you peek over Windows
| internals with ease (e.g. PEB, kernel process /threads, I/O
| Request Packets with all their Stack Locations, etc.). And it
| integrates with the operating system easily as a Just-In-Time
| debugger.
| Const-me wrote:
| > WinDbg is not built-in in the literal sense (you have to
| download it separately)
|
| I'm not sure about that. I think you downloading the GUI,
| not the debugger. The symbolic debugger engine is in
| DbgEng.dll and is included in Windows.
| oshiar53-0 wrote:
| Perhaps we don't agree on certain terminology.
|
| I would opine that a debugger in the usual sense is a
| software that interacts with both the end user (frontend)
| and the debugee program (debugging/tracing engine),
| allowing the user to study, analyze, and manipulate the
| dynamic execution of said program.
|
| The debugger stacks on e.g. Linux and Windows are as
| follows:
|
| - Linux: GDB, LLDB (frontend + debugging framework);
| libbfd, libdwarf (object file/symbol engine); ptrace
| (debugging/tracing engine)
|
| - Windows: WinDbg, cdb, ntsd, kd (frontend); DbgEng.dll
| (debugging framework); DbgHelp.dll (object file/symbol
| engine); DbgUi*, Nt* syscalls, kd stubs
| (debugging/tracing engine)
|
| I have a question: would you call gdbserver a debugger?
| Const-me wrote:
| Modern debuggers are complicated. There're good reasons
| to name all of these components "a debugger". But I think
| the symbolic engine deserves the name "debugger" more
| than the GUI.
|
| Note that in Windows, the GUI is the only piece missing
| from the default OS installation. The rest of the stack
| (DbgEng.dll, DbgHelp.dll, kernel calls, tracing, and
| more) is shipped with the OS.
|
| > would you call gdbserver a debugger?
|
| I'm not an expert on Linux. But based on this page https:
| //www.tutorialspoint.com/unix_commands/gdbserver.htm
| probably not. I think it says that thing's an IPC server
| which doesn't do debugging on its own, doesn't even need
| symbols, instead relies on the host to implement the
| actual debugger.
| hungrybrit wrote:
| Fantastic write-up. I was gripped!
| baybal2 wrote:
| The author needs to discover JTAG for himself
| secondcoming wrote:
| And how much they cost.
|
| They are great though, I used use a Lauterbach occasionally
| back when I work for a mobile OS company.
| z3t4 wrote:
| Async is hard. But fun when you can do it intuitively.
| schinchan wrote:
| Ask HN: How do I reach this level of understanding of what to do
| in these situations..? Really inspired by the post on how we can
| use different tools and pin down the exact issue
| secondcoming wrote:
| For these types of situations you can really only learn by
| doing. You can litter code with printf() [although this can
| actually hide bugs too if you're very unlucky], comment out
| code until you come to a minimal program that exhibits the
| buggy behaviour and then just persevere.
| JoeAltmaier wrote:
| Oh data breakpoints! Our savior. In the bad old days Intel had a
| big 'blue box' that ran ISIS and had a $1200 cable and circuit
| board called 'ICE' for 'In-Circuit Emulator'. It could monitor
| the bus and trap on certain memory operations. My brother worked
| on that right out of college. He met his wife over that circuit
| board!
|
| Anyway fast forward a bit, I'm working at Convergent which was a
| nascent Intel's biggest single consumer of x86 chips. Intel was
| coming out with a shiny new version, the 80286, showed our
| engineers the spec. They complained "Hey! We're your biggest chip
| buyer but you didn't consult us on features!"
|
| Intel guy said "Ok, what features do you want?" Our folks said
| "We'll get back to you..." Then they circulated a questionnaire
| among us software types for ideas.
|
| I suggested "Let's get rid of that Intel Blue Box, have a feature
| in the new CPU that traps on bus condition mask and value
| registers." I even drew out what I wanted.
|
| Well, what do you know, Intel did it. And it remains to this day,
| and is called "Data breakpoints". Probably the only thing I ever
| did that will last any length of time, since software has a
| really short sell-by date as a rule.
___________________________________________________________________
(page generated 2021-11-14 23:01 UTC)