Subj : Re: C++ desired features
To   : (Usenet)
From : Joe Seigh
Date : Fri Feb 25 2005 09:34 pm

On 24 Feb 2005 20:10:45 -0500, Hans Boehm <Hans_Boehm@hp.com> wrote:

> Joe Seigh wrote:
>> Multithreading usage patterns tend to fall into producer/consumer or
> reader/writer
>> patterns.   Message passing works well with the former but doesn't
> work well
>> with the latter.  Mutexes don't scale well with the latter by
> definition.
>> rwlocks don't scale so well either.  Lock-free stuff seems to do
> quite well.
>> You can crank up the contention level to 11 which will kill mutex and
> rwlock
>> based solutions and the lock-free stuff doesn't even seem to breath
> heavy.
>>
> In my experience, lock-free solutions are great, especially when
> someone else has already dealt with the tricky correctness issues.  In
> particular, they can safely be used from signal/interrupt handlers,
> etc.
>
> But if you look simply at performance, in my experience at least, they
> can win or lose, depending on implementation details, hardware
> characteristics, etc.  For example, I reported a small experiment in
> http://portal.acm.org/citation.cfm?doid=1011767.1011774 (or
> http://www.hpl.hp.com/techreports/2004/HPL-2004-105.html).  Beside the
> actual content of the paper, the last section compares a simple
> lock-free stack vs. a solution based on simple spin-sleep locks.  The
> winners are different for a 2xPII/266 and a 4xPPro machine.  And that's
> pretty similar hardware.  (There were also OS differences, but I'm not
> sure they matter for this particular comparison.  If so, I don't
> understand the effect.)

Even though I did an eventcount implementation in Linux which is
designed to be used with lock-free algorithms, I haven't really
messed around with it much.  But it seems that unless you're going
to rarely block you might as well use mutexes w/ condvars.  I could
only do 10% better than a fast condvar implementation when blocking
was occurring.

>
> As many people have pointed out, the real problem with contention is
> typically that a critical cache line (either holding the lock, or the
> lock-free data structure) is transferred back and forth between
> processor caches for every access.  That's inherently not cheap.
> (Inappropriate lock implementations may of course make it much worse,
> as you can also see in the above test.)
>

The stuff I was experimenting with was reader/writer with the writers using
mutexes and the readers being lock-free.  The lock-free algorithms were
an RCU for preemptive user threads implemetation, and a proxy GC based
on atomic_ptr, atomic lock-free reference counted smart pointer.  The
testcase has reader threads traversing a linked list while writer threads
are modifiying the list.  The testcase exits after the writer threads
do a fixed number of modifications, so it's the read and write rates, not
the testcase run time, that are significant.  There's a -t option that
prints out the running counts and rates so that if you crank up the
contention you can verify that the mutex and rwlock cases aren't going
anywhere.  It's at http://atomic-ptr-plus.sourceforge.net/
The project is on hiatus for a while, though I might port atomic_ptr
to powerpc using a different algorithm (no cmpxchg8b, just lwarx/stwcx).


-- 
Joe Seigh

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

.