https://stackoverflow.com/questions/4703844/unexplainable-core-dump Stack Overflow 1. About 2. Products 3. For Teams 1. Stack Overflow Public questions & answers 2. Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers 3. Talent Build your employer brand 4. Advertising Reach developers & technologists worldwide 5. About the company [ ] Loading... 1. current community + Stack Overflow help chat + Meta Stack Overflow your communities Sign up or log in to customize your list. more stack exchange communities company blog 2. 3. Log in 4. Sign up 1. Home 2. 1. Public 2. Questions 3. Tags 4. Users 5. Companies 6. Collectives 7. Explore Collectives 3. 1. Teams Stack Overflow for Teams - Start collaborating and sharing organizational knowledge. [teams-illo-free-si] Create a free Team Why Teams? 2. Teams 3. Create free Team Collectives(tm) on Stack Overflow Find centralized, trusted content and collaborate around the technologies you use most. Learn more about Collectives Teams Q&A for work Connect and share knowledge within a single location that is structured and easy to search. Learn more about Teams "Unexplainable" core dump Ask Question Asked 11 years, 11 months ago Modified today Viewed 8k times 39 I've seen many core dumps in my life, but this one has me stumped. Context: * multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs * the code that crashes is executed a lot * running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour * the crashes happen on different machines (but the machines themselves are pretty identical) * the crashes all look the same (same exact address, same call stack) Here are the details of the crash: Program terminated with signal 11, Segmentation fault. #0 0x00000000017bd9fd in Foo() (gdb) x/i $pc => 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15) (gdb) x/6i $pc-12 0x17bd9f1 <_Z3Foov+337>: mov (%rbx),%eax 0x17bd9f3 <_Z3Foov+339>: mov %rbx,%rdi 0x17bd9f6 <_Z3Foov+342>: callq *0x70(%rax) 0x17bd9f9 <_Z3Foov+345>: cmp %eax,%r12d 0x17bd9fc <_Z3Foov+348>: mov %eax,-0x80(%rbp) 0x17bd9ff <_Z3Foov+351>: jge 0x17bd97e <_Z3Foov+222> You'll notice that the crash happened in the middle of instruction at 0x17bd9fc, which is after return from a call at 0x17bd9f6 to a virtual function. When I examine the virtual table, I see that it is not corrupted in any way: (gdb) x/a $rbx 0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16> (gdb) x/a 0x3f8c550+0x70 0x3f8c5c0 <_ZTI4Foo1+128>: 0x2d3d7b0 <_ZN4Foo13GetEv> and that it points to this trivial function (as expected by looking at the source): (gdb) disas 0x2d3d7b0 Dump of assembler code for function _ZN4Foo13GetEv: 0x0000000002d3d7b0 <+0>: push %rbp 0x0000000002d3d7b1 <+1>: mov 0x70(%rdi),%eax 0x0000000002d3d7b4 <+4>: mov %rsp,%rbp 0x0000000002d3d7b7 <+7>: leaveq 0x0000000002d3d7b8 <+8>: retq End of assembler dump. Further, when I look at the return address that Foo1::Get() should have returned to: (gdb) x/a $rsp-8 0x2afa55602048: 0x17bd9f9 <_Z3Foov+345> I see that it points to the right instruction, so it's as if during the return from Foo1::Get(), some gremlin came along and incremented %rip by 4. Plausible explanations? * linux * segmentation-fault * x86-64 Share Improve this question Follow asked Jan 16, 2011 at 4:42 Employed Russian's user avatar Employed RussianEmployed Russian 190k3030 gold badges285285 silver badges348348 bronze badges 2 * Did you ever find out what caused this? If so, I'd be very interested to hear what it was! - us2012 Mar 24, 2013 at 16:11 * 1 @us2012 I believe we did find the cause. See my answer. - Employed Russian Apr 6, 2013 at 19:00 Add a comment | 2 Answers 2 Sorted by: Reset to default [Highest score (default) ] 50 So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug. https://web.archive.org/web/20130228081435/http://support.amd.com/us/ Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721: 721 Processor May Incorrectly Update Stack Pointer Description Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the stack pointer after a long series of push and/or near-call instructions, or a long series of pop and/or near-return instructions. The processor must be in 64-bit mode for this erratum to occur. Potential Effect on System The stack pointer value jumps by a value of approximately 1024, either in the positive or negative direction. This incorrect stack pointer causes unpredictable program or system behavior, usually observed as a program exception or crash (for example, a #GP or #UD). Suggested Workaround System software may set MSRC001_1029[0] = 1b. Share Improve this answer Follow edited 2 hours ago user3840170's user avatar user3840170 25.5k33 gold badges2626 silver badges5858 bronze badges answered Apr 6, 2013 at 18:59 Employed Russian's user avatar Employed RussianEmployed Russian 190k3030 gold badges285285 silver badges348348 bronze badges 2 * Ouch. Is it actually a "highly specific" condition - i.e., did you manage to fix it by slightly changing the code produced at the problematic point? - us2012 Apr 6, 2013 at 20:37 * 13 @us2012 Our code and compilers are constantly changing, and the problem disappeared as suddenly as it appeared ... only to happen again 2 years later in a completely unrelated executable. - Employed Russian Apr 6, 2013 at 21:46 Add a comment | 6 I've once seen an "illegal opcode" crash right in the middle of an instruction. I was working on a Linux port. Long story short, Linux subtracts from the instruction pointer in order to restart a syscall, and in my case this was happening twice (if two signals arrived at the same time). So that's one possible culprit: the kernel fiddling with your instruction pointer. There may be some other cause in your case. Bear in mind that sometimes the processor will understand the data it's processing as an instruction, even when it's not supposed to be. So the processor may have executed the "instruction" at 0x17bd9fa and then moved on to 0x17bd9fd and then generated an illegal opcode exception. (I just made that number up, but experimenting with a disassembler can show you where the processor might have "entered" the instruction stream.) Happy debugging! Share Improve this answer Follow answered Jan 16, 2011 at 4:56 Artelius's user avatar ArteliusArtelius 47.7k1212 gold badges8989 silver badges104104 bronze badges 2 * I have considered signals, but there are several "strikes" against them being the cause: 1. note that there are no system calls anywhere around this code; 2. this thread should not be receiving any async signals; 3. if a signal was causing this, how do you explain the crash happening on exact same address in all crashed programs? - Employed Russian Jan 16, 2011 at 5:07 * 2 I didn't suggest your problem may be signals. (That was just the bug in the port that was behind my problem.) My point was that factors completely external to your program - like a kernel bug - may be causing this problem. Another thing that can mess with your instruction pointer is exception handling. - Artelius Jan 22, 2011 at 0:19 Add a comment | Your Answer [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] Thanks for contributing an answer to Stack Overflow! * Please be sure to answer the question. Provide details and share your research! But avoid ... * Asking for help, clarification, or responding to other answers. * Making statements based on opinion; back them up with references or personal experience. To learn more, see our tips on writing great answers. Draft saved Draft discarded [ ] Sign up or log in Sign up using Google Sign up using Facebook Sign up using Email and Password Submit Post as a guest Name [ ] Email Required, but never shown [ ] Post as a guest Name [ ] Email Required, but never shown [ ] Post Your Answer Discard By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy Not the answer you're looking for? Browse other questions tagged * linux * segmentation-fault * x86-64 or ask your own question. * The Overflow Blog * You should be reading academic computer science papers * From life without parole to startup CTO (Ep. 522) * Featured on Meta * Navigation and UI research starting soon * Temporary policy: ChatGPT is banned * The [choice] tag is being burninated Linked 3 virtual method wrong (0x0) address Related 4 Stack resident buffer overflow on 64-bit? 17 Why does the x86-64 GCC function prologue allocate less stack than the local variables? 5 Segfault shows up only in GDB 6 Tracing call stack in disassembled code 7 Linux x86_64 assembly socket programming 3 How can I read arguments in _start function? 0 Memory allocation and addressing in Assembly 0 If C file only has one function, why is the pushq and movq still exist at the beginning? 2 gdb not stopping at all breakpoints with NASM Hot Network Questions * What is the meaning of 'clear' in the context? * Can lawyers ask judges questions? * Why was the VIC-II restricted to a hard-coded palette? * Whirlpool Over the Range Microwave suddenly lost power after messing with door switch * What is a good way to compute successive primorials with Mathematica? * What is this large insect? * Tic-Tac-Toe game in C++ * On the (Equi)Potency of Each Organic Law of the United States * How to spot abusive/incompetent supervisors in advance * Do faculties look at h-index including or excluding self-citations? * Creating half normal probability distribution * My hands don't move naturally on the piano because I'm constantly trying to figure out which notes to play * Moving/Rotating a bathroom toilet * Idiom for a schoolboy being purposely overly verbose only to make an essay look longer * Evil In Clear River - Two Versions? * Blender camera sensor size physical equivalent? * Students confusing "object types" in introductory proofs class * Can someone please clarify what exactly is meant by magnification? * Is Analytic Philosophy really just Language Philosophy * Is it okay to upload code I wrote for replicating someone else's simulation study? * Novel or short story about space-travellers tapping in to stars for energy and it turns out that stars are living things * Can an adjective be a subject * What is the Perrin-Riou logarithm (or regulator)? * Are hypermodern openings not recommended for beginners? more hot questions Question feed Subscribe to RSS Question feed To subscribe to this RSS feed, copy and paste this URL into your RSS reader. [https://stackoverflo] * Stack Overflow * Questions * Help Products * Teams * Advertising * Collectives * Talent Company * About * Press * Work Here * Legal * Privacy Policy * Terms of Service * Contact Us * Cookie Settings * Cookie Policy Stack Exchange Network * Technology * Culture & recreation * Life & arts * Science * Professional * Business * API * Data * Blog * Facebook * Twitter * LinkedIn * Instagram Site design / logo (c) 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA. rev 2023.1.3.43129 Your privacy By clicking "Accept all cookies", you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy. Accept all cookies Customize settings