[HN Gopher] "Unexplainable" core dump (2011)
___________________________________________________________________
"Unexplainable" core dump (2011)
Author : curling_grad
Score : 84 points
Date : 2023-01-03 13:02 UTC (9 hours ago)
(HTM) web link (stackoverflow.com)
(TXT) w3m dump (stackoverflow.com)
| markus_zhang wrote:
| Sounds fun, wish I had the talent to do some serious debugging.
| Izkata wrote:
| > Our code and compilers are constantly changing, and the problem
| disappeared as suddenly as it appeared ... only to happen again 2
| years later in a completely unrelated executable.
|
| It does not encourage me how much this sounds like the short
| story "Coding Machines". The original post even happened right
| about 2 years after the short story was posted, then in that
| comment reoccured after another 2 years.
|
| https://www.teamten.com/lawrence/writings/coding-machines/
| moffkalast wrote:
| Damn, what a story.
|
| Though in reality it definitely would've been some guy on the
| compiler dev team adding it before publishing binaries. I
| wonder if you could set it up to inject a halt into compiled
| code that runs if some conditions are met, crashing most of the
| word's infrastructure on a predetermined date.
| highspeedbus wrote:
| That was a great read, thanks.
| ericbarrett wrote:
| This is great. Reminds me of a crash I saw early in my career. It
| was a null-pointer exception, except it occurred right after
| confirming the address was non-null. This was on a single core
| with a non-preemptible kernel. So the processor just took the
| wrong branch! There was simply no other explanation.
| mcculley wrote:
| What hardware/platform was this? I worked on AIX on POWER a
| long time ago and it had to map the zero page read-only just to
| support speculative execution of dereferencing the NULL
| pointer, if I remember right.
|
| If you were on a platform that did this wrong, speculative
| execution could have been dereferencing the NULL pointer.
| jrpelkonen wrote:
| Interesting, how did you fix it? Negate the comparison with an
| appropriate comment?
| Jiro wrote:
| Are you sure the compiler didn't say "since having a null
| pointer gives undefined behavior, we can optimize out the part
| that confirms the address is non-null"?
| logicchop wrote:
| This is likely your answer. C++ story. I worked at a large
| company that had a "no exceptions" policy and a custom
| operator new. If a new expression failed it would return
| nullptr instead of throwing. So lots of people wrote
| "checking" code to make sure the result wasn't nullptr,
| except that the compiler would always just elide that code
| since the standard mandates that the result cannot be
| nullptr. Many weird crashes ensued.
| aw1621107 wrote:
| There are non-throwing operator new overloads that can
| return nullptr, but I'm not sure if those are a relatively
| recent development. Did the non-throwing operator new
| overloads not exist at the time?
| logicchop wrote:
| Hard to say. Most of the uses probably predated the
| custom operator new and so nobody thought about it. Not
| to mention the places you cannot sneak into to switch to
| std::nothrow.
| JoeAltmaier wrote:
| Got to read the assembly to really know what happened.
|
| E.g. if the architecture has pointers with non-address bits
| (modes or segments or whatever) and those bits were set yet the
| rest of the address was 'null', and the check was for 'all bits
| zero' then you could conceivably get that situation.
| dekhn wrote:
| One of the best bugs I've seen had a description fairly similar
| to this. Hot routine run at scale (floating point math for ads ML
| training) fails at a rate about 0.000000001. Turned out to be a
| very obscure bug in the context switching code in the linux
| kernel, the FP registers weren't being restored properly.
| logicchop wrote:
| I suspect that windows still has a subtle FP restoration bug.
| We do large scale validation of floating point data and
| occasionally get ever so subtly different results.
| tremon wrote:
| Given that you say "subtly", have you ruled out
| rounding/precision errors? I wouldn't be surprised if some
| processors would play fast-and-loose with the number of
| significant bits they really honour.
| jacooper wrote:
| Debugging that must've been a PITA for sure.
| cybrox wrote:
| If you want to experience this in a slightly more controlled
| way, I'd recommend you give the game "Turing complete" a spin.
|
| It lets you build your own turing complete processor, and
| define a simple assembly language, starting from NAND gates and
| you can create your own arbitrarily wild edge cases for
| specific opcode combinations.
| dekhn wrote:
| One of the best bugs I've seen had a description fairly similar
| to this. Hot routine run at scale (floating point math for ads ML
| training) fails at a rate about 0.000000001. Turned out to be a
| very obscure bug in the context switching code in the linux
| kernel, the FP registers weren't being restored properly.
|
| Another one, the debugging was aided by the fact the developers
| ensured that everything was accessed through const pointers, so
| it wasn't their code corrupting their memory.
| sidewndr46 wrote:
| It has been a while, but a switch to kernel mode followed by a
| switch back to the same user mode process doesn't actually mess
| with FP registers. The idea being, the kernel should not be
| using those anyways.
|
| Also minor point: a const pointer is a pointer which always
| points at the same address. You can still change what is
| pointed at. You probably meant "a pointer to const"
| lisper wrote:
| I had one of these back in the 90s that turned out to be a
| compiler bug. It was code that ran a mobile robot with an arm.
| Exact same code running on a Sun workstation never failed, but
| running on an embedded system running vxWorks crashed
| intermittently, but only when the arm was moving. Entire heap
| was corrupted, so by the time the crash occurred there was no
| hope of getting a stack trace or any hint of what went wrong
| upstream. Turned out to be two mis-ordered instructions that
| accessed a value on the stack after the stack pointer had been
| popped. On vxWorks, interrupts used the same stack as the
| currently running process, so if an interrupt occurred exactly
| between these two instructions it would clobber that value, and
| chaos ensued.
|
| Took a full year to figure it out. Good times.
| aw1621107 wrote:
| How did you end up piecing together what happened?
| lisper wrote:
| Long story but the tldr is that it happened in two stages.
| First someone figured out a way to reliably reproduce the
| problem. And then I spent a very long time single stepping
| through machine instructions until I had a eureka moment.
___________________________________________________________________
(page generated 2023-01-03 23:00 UTC)