Subj : Re: lwp_mutex_lock hang?
To   : comp.programming.threads
From : Johan Piculell
Date : Wed Mar 23 2005 06:45 pm


<roger.faulkner@sun.com> wrote in message
news:1111597775.582037.290480@l41g2000cwc.googlegroups.com...
>
> Johan Piculell wrote:
> > <roger.faulkner@sun.com> wrote in message
> > news:1111555329.030413.248920@l41g2000cwc.googlegroups.com...
> > >
> > > > Even if we have messed something up in our code (which usually is
> the
> > > case
> > > > :-( ), is it really possible to have 3 threads in
> lwp_mutex_lock() on
> > > the
> > > > same mutex?
> > >
> > > Yes, if the original owner thread (or process containing the
> thread)
> > > terminated while it was still holding the lock.
> > >
> > > Also, the shared memory need not be mapped to the same address
> > > in all of your processes.
> > >
> > > You need to attach a debugger to the wedged process and examine
> > > the contents of the mutex at f9800000.  See the declaration of
> > > lwp_mutex_t in /usr/include/sys/synch.h.
> >
> > Thanks for replying Roger, not the first time you help me out...
> > Debugging will be hard since the problem is on customer system, I
> have
> > however written a small prog. that extracts the information that I
> will ask
> > them to try. I have tested this here but would be good to have some
> more
> > info on the contents of the mutex struct. Anywhere I can find this?
> > When I run my tool on a locked shared mutex I get something like
> this:
> > flag1       : 0
> > flag2       :
> > ceiling     :
> > bcptype     : 0
> > count_type1 :
> > count_type2 :
> > magic       : 4d58
> > pad         :
> > ownerpid    : 0
> > lockword    : ff000000
> > owner64     : ff000000
> > data        : ff030a00
> >
> > Does this seem feasible? Why can't I see the owner pid for example?
> >
> > I'm quite sure that the memory area is mapped correctly in all
> processes
> > since we have had this code running for at least 4 years now and I'm
> 100%
> > sure that we would have noticed any problems at all with this if
> there was
> > something wrong.
> >
> > So you are saying that if the process cores or a thread terminates
> without
> > doing unlock, all other threads trying to lock can end up in this
> way?
> > Shouldn't a mutex_lock() to such a mutex return an error code (like
> > EOWNERDEAD), and then we reinitialize it with mutex_init()? This is
> at least
> > what we depend upon in our code today.
>
> By default in this case, you will block forever.
>
> You can get EOWNERDEAD only if you initialize the mutex to be
> a priority inheritance mutex (see the pthread_mutexattr_getrobust_np()
> man page) or if you are using Solaris threads, it must be initialized
> with the USYNC_PROCESS_ROBUST flag (see the mutex_init() man page).
>
> This, however, makes pthread_mutex_lock()/mutex_lock() always be a slow
> operation because the kernel must be informed about the lock and it
> must keep track of all such locks owned by every thread.
>
> Your dump of the mutex looks correct.  'ff000000' in the 'lockword'
> field indicates that the mutex is locked.  The 'data' field is
> the address in the owning process of the thread struct of the
> thread that owns the mutex.  The 'ownerpid' field should hold
> the process-id of the owning process (there is a small race condition
> in mutex_unlock() where 'ownerpid' is cleared before the lock byte
> is cleared and in mutex_lock() where the lock byte is set before
> the 'ownerpid' field is set, but that is probably not relevant for
> your test case, only for your customer's application).
> Of course, if the mutex is not initialized as a process-shared
> mutex that 'ownerpid' is never touched.
>
> Roger Faulkner
> Sun Microsystems
>

We are for sure initializing with USYNC_PROCESS_ROBUST, performance is not
very critical here since we lock for quite long periods. But I just got
another evidence that the mutex seems to fail. Got two stack traces where
two threads in different process was in the critical region and another
thread seemed to hold the lock. Don't know what happened here but I guess I
need to go through our code and see what changes we made recently because
this always worked flawlessly before. Will probably have some info on the
memory dump tomorrow, will see what that leads to.

/Johan

.