Subj : Re: lwp_mutex_lock hang?
To   : comp.programming.threads
From : roger.faulkner
Date : Wed Mar 23 2005 09:09 am


Johan Piculell wrote:
> <roger.faulkner@sun.com> wrote in message
> news:1111555329.030413.248920@l41g2000cwc.googlegroups.com...
> >
> > > Even if we have messed something up in our code (which usually is
the
> > case
> > > :-( ), is it really possible to have 3 threads in
lwp_mutex_lock() on
> > the
> > > same mutex?
> >
> > Yes, if the original owner thread (or process containing the
thread)
> > terminated while it was still holding the lock.
> >
> > Also, the shared memory need not be mapped to the same address
> > in all of your processes.
> >
> > You need to attach a debugger to the wedged process and examine
> > the contents of the mutex at f9800000.  See the declaration of
> > lwp_mutex_t in /usr/include/sys/synch.h.
>
> Thanks for replying Roger, not the first time you help me out...
> Debugging will be hard since the problem is on customer system, I
have
> however written a small prog. that extracts the information that I
will ask
> them to try. I have tested this here but would be good to have some
more
> info on the contents of the mutex struct. Anywhere I can find this?
> When I run my tool on a locked shared mutex I get something like
this:
> flag1       : 0
> flag2       :
> ceiling     :
> bcptype     : 0
> count_type1 :
> count_type2 :
> magic       : 4d58
> pad         :
> ownerpid    : 0
> lockword    : ff000000
> owner64     : ff000000
> data        : ff030a00
>
> Does this seem feasible? Why can't I see the owner pid for example?
>
> I'm quite sure that the memory area is mapped correctly in all
processes
> since we have had this code running for at least 4 years now and I'm
100%
> sure that we would have noticed any problems at all with this if
there was
> something wrong.
>
> So you are saying that if the process cores or a thread terminates
without
> doing unlock, all other threads trying to lock can end up in this
way?
> Shouldn't a mutex_lock() to such a mutex return an error code (like
> EOWNERDEAD), and then we reinitialize it with mutex_init()? This is
at least
> what we depend upon in our code today.

By default in this case, you will block forever.

You can get EOWNERDEAD only if you initialize the mutex to be
a priority inheritance mutex (see the pthread_mutexattr_getrobust_np()
man page) or if you are using Solaris threads, it must be initialized
with the USYNC_PROCESS_ROBUST flag (see the mutex_init() man page).

This, however, makes pthread_mutex_lock()/mutex_lock() always be a slow
operation because the kernel must be informed about the lock and it
must keep track of all such locks owned by every thread.

Your dump of the mutex looks correct.  'ff000000' in the 'lockword'
field indicates that the mutex is locked.  The 'data' field is
the address in the owning process of the thread struct of the
thread that owns the mutex.  The 'ownerpid' field should hold
the process-id of the owning process (there is a small race condition
in mutex_unlock() where 'ownerpid' is cleared before the lock byte
is cleared and in mutex_lock() where the lock byte is set before
the 'ownerpid' field is set, but that is probably not relevant for
your test case, only for your customer's application).
Of course, if the mutex is not initialized as a process-shared
mutex that 'ownerpid' is never touched.

Roger Faulkner
Sun Microsystems

.