Subj : Re: How to diagnose/resolve libthread panic?
To   : comp.programming.threads
From : David Butenhof
Date : Fri Jun 03 2005 02:17 am

Paul F. Pearson wrote:

> I've searched Google, and this group specifically, and can't find any
> good pointers on how to identify what's causing a libthread panic.
> Could someone point me in the right direction?
> 
> Specifically, the panic is occurring in a mostly-Ada program (with some
> C linked in) on Solaris 8, patches up-to-date as of a couple of months
> ago.  Our compiler is gnat 3.15p1, and the C code is compiled using the
> gcc included in that gnat.  The process usually has about 277 lwps
> running.  We're using the Solaris native threads.  The problem isn't
> vert repeatable, but happens just often enough to be troubling.
> 
> I don't have handy the specific output of the libthread panic, but it's
> a SIGSEGV (signal 1).

This may not be reassuring, but there's approximately a 99.99999% chance 
that a SIGSEGV anywhere in a threaded process indicates a memory 
corruption bug. There is no reason at all to believe (or in fact even to 
hypothetically suspect) that there's a bug in the thread library (or 
libc, or any other library) just because code within that library was 
the victim. Shared data in a threaded process is shared.

The biggest cause of memory corruption bugs is writes through 
uninitialized pointers on the stack. (There are many others; but that's 
the biggest.) And they're most likely (statistically speaking) caused by 
the main application program rather than the system shared libraries. 
(Not that they don't have bugs, too; and they DO sometimes show up first 
for YOU, and they CAN sometimes have this result... but I've dealt with 
customers having this sort of problem for a number of years, and that's 
usually the safe bet at first glance.)

Problems like this usually show up under load, and are not easily 
repeatable. That's because the symptom is asynchronous. The thread that 
CAUSES it does the damage and moves on. It may have trashed some other 
thread's stack, but the favorite is heap data. That's because when you 
interpret some random location on a thread's stack as a pointer, it most 
likely points into the calling thread's stack, the application's code, 
or heap. Most likely, the address trashed is either free memory, or 
sitting on a queue somewhere, passive... like a time bomb ticking away. 
The problem shows up later (likely MUCH later, and in a "thread far, far 
away") when some innocent victim touches the corrupted data... to pull a 
free block off the heap cache, or following a perfectly innocent queue 
that ought to (and once did) point to a blocked thread.

The thread library and libc are expecially vulnerable, because they tend 
to be called a lot; so the addresses of their internal data will 
frequently appear on the stack. (I've often thought that a useful, 
though expensive, diagnostic mode would be a compiler switch that would 
"scrub" the stack in the procedure epilogue to prevent this... then 
again, you wouldn't compile that way for production code, so it'd just 
be less likely that you'd hit the problem in testing and even more 
likely that it'll show up at the big customer site on their busiest day 
of the year.)

The Tru64 UNIX "ATOM" tool could help to catch a lot of this sort of 
problem with low-overhead binary instrumentation. There are other tools 
of that nature. The problem is that if didn't you catch it the first 
time, it may be really difficult to find it again. And rebuilding (or 
even instrumenting the binary you were running as ATOM could do) can 
change the timing and stack layout enough that you'll blow right by the 
symptom you were trying to catch.

This class of bug is notoriously difficult to track down. As a co-worker 
of mine once said when asked by a manager to estimate how long it'd take 
to fix a critical problem... "I'm 99% certain it'll take 2 seconds to 
fix. The problem is that it may take a month to figure out WHAT to fix."

Often the best strategy is to figure out the rough area of code where 
it's likely to be happening, and then sit down with a couple of good 
friends and a pizza and do a thorough line-by-line code review.

-- 
Dave Butenhof, David.Butenhof@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

.