Subj : Re: lwp_mutex_lock hang?
To   : comp.programming.threads
From : Johan Piculell
Date : Wed Mar 23 2005 03:25 pm


"Johan Piculell" <johan.p@iname.com> wrote in message
news:d1rqfn$l8g$1@newstree.wise.edt.ericsson.se...
>
> <roger.faulkner@sun.com> wrote in message
> news:1111555329.030413.248920@l41g2000cwc.googlegroups.com...
> >
> > Johan Piculell wrote:
> > > Hi.
> > > I have a strange lockup problem in our application that I just cannot
> > > understand. We have only experienced this at one customer and not
> > been able
> > > to reproduce so far. Problem seems to occur several times per week
> > however,
> > > and it hangs on a shared memory mutex in several threads:
> > >
> > [snip]
> > >
> > > The reason for having a shared mutex is because we can parallelize
> > using
> > > threads and/or processes, but I cannot see any reference to f9800000
> > in any
> > > other process either and not even a lwp_mutex_lock.
> > > The hardware they are running on is a Sunfire 6800 with Solaris 8. We
> > are
> > > linking with the alternate thread library (/usr/lib/lwp), not sure
> > about the
> > > patch status but can easily be checked.
> > >
> > > Even if we have messed something up in our code (which usually is the
> > case
> > > :-( ), is it really possible to have 3 threads in lwp_mutex_lock() on
> > the
> > > same mutex?
> >
> > Yes, if the original owner thread (or process containing the thread)
> > terminated while it was still holding the lock.
> >
> > Also, the shared memory need not be mapped to the same address
> > in all of your processes.
> >
> > You need to attach a debugger to the wedged process and examine
> > the contents of the mutex at f9800000.  See the declaration of
> > lwp_mutex_t in /usr/include/sys/synch.h.
> >
> > Roger Faulkner
> > Sun Microsystems
> >
>
> Thanks for replying Roger, not the first time you help me out...
> Debugging will be hard since the problem is on customer system, I have
> however written a small prog. that extracts the information that I will
ask
> them to try. I have tested this here but would be good to have some more
> info on the contents of the mutex struct. Anywhere I can find this?
> When I run my tool on a locked shared mutex I get something like this:
> flag1       : 0
> flag2       :
> ceiling     :
> bcptype     : 0
> count_type1 :
> count_type2 :
> magic       : 4d58
> pad         :
> ownerpid    : 0
> lockword    : ff000000
> owner64     : ff000000
> data        : ff030a00
>
> Does this seem feasible? Why can't I see the owner pid for example?
>
> I'm quite sure that the memory area is mapped correctly in all processes
> since we have had this code running for at least 4 years now and I'm 100%
> sure that we would have noticed any problems at all with this if there was
> something wrong.
>
> So you are saying that if the process cores or a thread terminates without
> doing unlock, all other threads trying to lock can end up in this way?
> Shouldn't a mutex_lock() to such a mutex return an error code (like
> EOWNERDEAD), and then we reinitialize it with mutex_init()? This is at
least
> what we depend upon in our code today.
>
> regards
> /Johan
> Ericsson AB - Sweden
>
>

Just a small comment, the output above was on Solaris 9. Just tried on
Solaris 8 and then the output made more sense, the ownerpid matched for some
reason.

/Johan

.