Subj : Re: How to diagnose/resolve libthread panic? To : comp.programming.threads From : David Butenhof Date : Fri Jun 03 2005 02:17 am Paul F. Pearson wrote: > I've searched Google, and this group specifically, and can't find any > good pointers on how to identify what's causing a libthread panic. > Could someone point me in the right direction? > > Specifically, the panic is occurring in a mostly-Ada program (with some > C linked in) on Solaris 8, patches up-to-date as of a couple of months > ago. Our compiler is gnat 3.15p1, and the C code is compiled using the > gcc included in that gnat. The process usually has about 277 lwps > running. We're using the Solaris native threads. The problem isn't > vert repeatable, but happens just often enough to be troubling. > > I don't have handy the specific output of the libthread panic, but it's > a SIGSEGV (signal 1). This may not be reassuring, but there's approximately a 99.99999% chance that a SIGSEGV anywhere in a threaded process indicates a memory corruption bug. There is no reason at all to believe (or in fact even to hypothetically suspect) that there's a bug in the thread library (or libc, or any other library) just because code within that library was the victim. Shared data in a threaded process is shared. The biggest cause of memory corruption bugs is writes through uninitialized pointers on the stack. (There are many others; but that's the biggest.) And they're most likely (statistically speaking) caused by the main application program rather than the system shared libraries. (Not that they don't have bugs, too; and they DO sometimes show up first for YOU, and they CAN sometimes have this result... but I've dealt with customers having this sort of problem for a number of years, and that's usually the safe bet at first glance.) Problems like this usually show up under load, and are not easily repeatable. That's because the symptom is asynchronous. The thread that CAUSES it does the damage and moves on. It may have trashed some other thread's stack, but the favorite is heap data. That's because when you interpret some random location on a thread's stack as a pointer, it most likely points into the calling thread's stack, the application's code, or heap. Most likely, the address trashed is either free memory, or sitting on a queue somewhere, passive... like a time bomb ticking away. The problem shows up later (likely MUCH later, and in a "thread far, far away") when some innocent victim touches the corrupted data... to pull a free block off the heap cache, or following a perfectly innocent queue that ought to (and once did) point to a blocked thread. The thread library and libc are expecially vulnerable, because they tend to be called a lot; so the addresses of their internal data will frequently appear on the stack. (I've often thought that a useful, though expensive, diagnostic mode would be a compiler switch that would "scrub" the stack in the procedure epilogue to prevent this... then again, you wouldn't compile that way for production code, so it'd just be less likely that you'd hit the problem in testing and even more likely that it'll show up at the big customer site on their busiest day of the year.) The Tru64 UNIX "ATOM" tool could help to catch a lot of this sort of problem with low-overhead binary instrumentation. There are other tools of that nature. The problem is that if didn't you catch it the first time, it may be really difficult to find it again. And rebuilding (or even instrumenting the binary you were running as ATOM could do) can change the timing and stack layout enough that you'll blow right by the symptom you were trying to catch. This class of bug is notoriously difficult to track down. As a co-worker of mine once said when asked by a manager to estimate how long it'd take to fix a critical problem... "I'm 99% certain it'll take 2 seconds to fix. The problem is that it may take a month to figure out WHAT to fix." Often the best strategy is to figure out the rough area of code where it's likely to be happening, and then sit down with a couple of good friends and a pizza and do a thorough line-by-line code review. -- Dave Butenhof, David.Butenhof@hp.com HP Utility Pricing software, POSIX thread consultant Manageability Solutions Lab (MSL), Hewlett-Packard Company 110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062 .