Subj : Re: lwp_mutex_lock hang? To : comp.programming.threads From : Johan Piculell Date : Wed Mar 23 2005 03:25 pm "Johan Piculell" wrote in message news:d1rqfn$l8g$1@newstree.wise.edt.ericsson.se... > > wrote in message > news:1111555329.030413.248920@l41g2000cwc.googlegroups.com... > > > > Johan Piculell wrote: > > > Hi. > > > I have a strange lockup problem in our application that I just cannot > > > understand. We have only experienced this at one customer and not > > been able > > > to reproduce so far. Problem seems to occur several times per week > > however, > > > and it hangs on a shared memory mutex in several threads: > > > > > [snip] > > > > > > The reason for having a shared mutex is because we can parallelize > > using > > > threads and/or processes, but I cannot see any reference to f9800000 > > in any > > > other process either and not even a lwp_mutex_lock. > > > The hardware they are running on is a Sunfire 6800 with Solaris 8. We > > are > > > linking with the alternate thread library (/usr/lib/lwp), not sure > > about the > > > patch status but can easily be checked. > > > > > > Even if we have messed something up in our code (which usually is the > > case > > > :-( ), is it really possible to have 3 threads in lwp_mutex_lock() on > > the > > > same mutex? > > > > Yes, if the original owner thread (or process containing the thread) > > terminated while it was still holding the lock. > > > > Also, the shared memory need not be mapped to the same address > > in all of your processes. > > > > You need to attach a debugger to the wedged process and examine > > the contents of the mutex at f9800000. See the declaration of > > lwp_mutex_t in /usr/include/sys/synch.h. > > > > Roger Faulkner > > Sun Microsystems > > > > Thanks for replying Roger, not the first time you help me out... > Debugging will be hard since the problem is on customer system, I have > however written a small prog. that extracts the information that I will ask > them to try. I have tested this here but would be good to have some more > info on the contents of the mutex struct. Anywhere I can find this? > When I run my tool on a locked shared mutex I get something like this: > flag1 : 0 > flag2 : > ceiling : > bcptype : 0 > count_type1 : > count_type2 : > magic : 4d58 > pad : > ownerpid : 0 > lockword : ff000000 > owner64 : ff000000 > data : ff030a00 > > Does this seem feasible? Why can't I see the owner pid for example? > > I'm quite sure that the memory area is mapped correctly in all processes > since we have had this code running for at least 4 years now and I'm 100% > sure that we would have noticed any problems at all with this if there was > something wrong. > > So you are saying that if the process cores or a thread terminates without > doing unlock, all other threads trying to lock can end up in this way? > Shouldn't a mutex_lock() to such a mutex return an error code (like > EOWNERDEAD), and then we reinitialize it with mutex_init()? This is at least > what we depend upon in our code today. > > regards > /Johan > Ericsson AB - Sweden > > Just a small comment, the output above was on Solaris 9. Just tried on Solaris 8 and then the output made more sense, the ownerpid matched for some reason. /Johan .