Subj : Re: lwp_mutex_lock hang? To : comp.programming.threads From : Johan Piculell Date : Wed Mar 23 2005 02:25 pm wrote in message news:1111555329.030413.248920@l41g2000cwc.googlegroups.com... > > Johan Piculell wrote: > > Hi. > > I have a strange lockup problem in our application that I just cannot > > understand. We have only experienced this at one customer and not > been able > > to reproduce so far. Problem seems to occur several times per week > however, > > and it hangs on a shared memory mutex in several threads: > > > [snip] > > > > The reason for having a shared mutex is because we can parallelize > using > > threads and/or processes, but I cannot see any reference to f9800000 > in any > > other process either and not even a lwp_mutex_lock. > > The hardware they are running on is a Sunfire 6800 with Solaris 8. We > are > > linking with the alternate thread library (/usr/lib/lwp), not sure > about the > > patch status but can easily be checked. > > > > Even if we have messed something up in our code (which usually is the > case > > :-( ), is it really possible to have 3 threads in lwp_mutex_lock() on > the > > same mutex? > > Yes, if the original owner thread (or process containing the thread) > terminated while it was still holding the lock. > > Also, the shared memory need not be mapped to the same address > in all of your processes. > > You need to attach a debugger to the wedged process and examine > the contents of the mutex at f9800000. See the declaration of > lwp_mutex_t in /usr/include/sys/synch.h. > > Roger Faulkner > Sun Microsystems > Thanks for replying Roger, not the first time you help me out... Debugging will be hard since the problem is on customer system, I have however written a small prog. that extracts the information that I will ask them to try. I have tested this here but would be good to have some more info on the contents of the mutex struct. Anywhere I can find this? When I run my tool on a locked shared mutex I get something like this: flag1 : 0 flag2 : ceiling : bcptype : 0 count_type1 : count_type2 : magic : 4d58 pad : ownerpid : 0 lockword : ff000000 owner64 : ff000000 data : ff030a00 Does this seem feasible? Why can't I see the owner pid for example? I'm quite sure that the memory area is mapped correctly in all processes since we have had this code running for at least 4 years now and I'm 100% sure that we would have noticed any problems at all with this if there was something wrong. So you are saying that if the process cores or a thread terminates without doing unlock, all other threads trying to lock can end up in this way? Shouldn't a mutex_lock() to such a mutex return an error code (like EOWNERDEAD), and then we reinitialize it with mutex_init()? This is at least what we depend upon in our code today. regards /Johan Ericsson AB - Sweden .