Subj : Re: lwp_mutex_lock hang?
To   : comp.programming.threads
From : Johan Piculell
Date : Wed Mar 23 2005 02:25 pm


<roger.faulkner@sun.com> wrote in message
news:1111555329.030413.248920@l41g2000cwc.googlegroups.com...
>
> Johan Piculell wrote:
> > Hi.
> > I have a strange lockup problem in our application that I just cannot
> > understand. We have only experienced this at one customer and not
> been able
> > to reproduce so far. Problem seems to occur several times per week
> however,
> > and it hangs on a shared memory mutex in several threads:
> >
> [snip]
> >
> > The reason for having a shared mutex is because we can parallelize
> using
> > threads and/or processes, but I cannot see any reference to f9800000
> in any
> > other process either and not even a lwp_mutex_lock.
> > The hardware they are running on is a Sunfire 6800 with Solaris 8. We
> are
> > linking with the alternate thread library (/usr/lib/lwp), not sure
> about the
> > patch status but can easily be checked.
> >
> > Even if we have messed something up in our code (which usually is the
> case
> > :-( ), is it really possible to have 3 threads in lwp_mutex_lock() on
> the
> > same mutex?
>
> Yes, if the original owner thread (or process containing the thread)
> terminated while it was still holding the lock.
>
> Also, the shared memory need not be mapped to the same address
> in all of your processes.
>
> You need to attach a debugger to the wedged process and examine
> the contents of the mutex at f9800000.  See the declaration of
> lwp_mutex_t in /usr/include/sys/synch.h.
>
> Roger Faulkner
> Sun Microsystems
>

Thanks for replying Roger, not the first time you help me out...
Debugging will be hard since the problem is on customer system, I have
however written a small prog. that extracts the information that I will ask
them to try. I have tested this here but would be good to have some more
info on the contents of the mutex struct. Anywhere I can find this?
When I run my tool on a locked shared mutex I get something like this:
flag1       : 0
flag2       :
ceiling     :
bcptype     : 0
count_type1 :
count_type2 :
magic       : 4d58
pad         :
ownerpid    : 0
lockword    : ff000000
owner64     : ff000000
data        : ff030a00

Does this seem feasible? Why can't I see the owner pid for example?

I'm quite sure that the memory area is mapped correctly in all processes
since we have had this code running for at least 4 years now and I'm 100%
sure that we would have noticed any problems at all with this if there was
something wrong.

So you are saying that if the process cores or a thread terminates without
doing unlock, all other threads trying to lock can end up in this way?
Shouldn't a mutex_lock() to such a mutex return an error code (like
EOWNERDEAD), and then we reinitialize it with mutex_init()? This is at least
what we depend upon in our code today.

regards
/Johan
Ericsson AB - Sweden

.