Subj : Re: lwp_mutex_lock hang? To : comp.programming.threads From : roger.faulkner Date : Wed Mar 23 2005 09:09 am Johan Piculell wrote: > wrote in message > news:1111555329.030413.248920@l41g2000cwc.googlegroups.com... > > > > > Even if we have messed something up in our code (which usually is the > > case > > > :-( ), is it really possible to have 3 threads in lwp_mutex_lock() on > > the > > > same mutex? > > > > Yes, if the original owner thread (or process containing the thread) > > terminated while it was still holding the lock. > > > > Also, the shared memory need not be mapped to the same address > > in all of your processes. > > > > You need to attach a debugger to the wedged process and examine > > the contents of the mutex at f9800000. See the declaration of > > lwp_mutex_t in /usr/include/sys/synch.h. > > Thanks for replying Roger, not the first time you help me out... > Debugging will be hard since the problem is on customer system, I have > however written a small prog. that extracts the information that I will ask > them to try. I have tested this here but would be good to have some more > info on the contents of the mutex struct. Anywhere I can find this? > When I run my tool on a locked shared mutex I get something like this: > flag1 : 0 > flag2 : > ceiling : > bcptype : 0 > count_type1 : > count_type2 : > magic : 4d58 > pad : > ownerpid : 0 > lockword : ff000000 > owner64 : ff000000 > data : ff030a00 > > Does this seem feasible? Why can't I see the owner pid for example? > > I'm quite sure that the memory area is mapped correctly in all processes > since we have had this code running for at least 4 years now and I'm 100% > sure that we would have noticed any problems at all with this if there was > something wrong. > > So you are saying that if the process cores or a thread terminates without > doing unlock, all other threads trying to lock can end up in this way? > Shouldn't a mutex_lock() to such a mutex return an error code (like > EOWNERDEAD), and then we reinitialize it with mutex_init()? This is at least > what we depend upon in our code today. By default in this case, you will block forever. You can get EOWNERDEAD only if you initialize the mutex to be a priority inheritance mutex (see the pthread_mutexattr_getrobust_np() man page) or if you are using Solaris threads, it must be initialized with the USYNC_PROCESS_ROBUST flag (see the mutex_init() man page). This, however, makes pthread_mutex_lock()/mutex_lock() always be a slow operation because the kernel must be informed about the lock and it must keep track of all such locks owned by every thread. Your dump of the mutex looks correct. 'ff000000' in the 'lockword' field indicates that the mutex is locked. The 'data' field is the address in the owning process of the thread struct of the thread that owns the mutex. The 'ownerpid' field should hold the process-id of the owning process (there is a small race condition in mutex_unlock() where 'ownerpid' is cleared before the lock byte is cleared and in mutex_lock() where the lock byte is set before the 'ownerpid' field is set, but that is probably not relevant for your test case, only for your customer's application). Of course, if the mutex is not initialized as a process-shared mutex that 'ownerpid' is never touched. Roger Faulkner Sun Microsystems .