Subj : Re: Lock Free -- where to start
To   : comp.programming.threads
From : David Butenhof
Date : Mon Oct 03 2005 02:42 am

chris noonan wrote:
> David Schwartz wrote (in various posts):
> 
>> Contention is bad. Lock free algorithms are about maximizing contention
> 
> If locks are efficient, how is it that many programs on a quad
> processor machine give more throughput if only one thread is
> started, as opposed to four?

Largely because many programs are written badly. It IS the contention 
that's killing these applications. And a complication (and one of the 
big wrenches in lock-free design) is that unless you're coding your own 
embedded/monolithic system from scratch, there's an enormous amount of 
contention in the system (both on behalf of and independent of your 
application) that you can't control or even see.

> If locks are efficient, why does the technique called I/O Completion
> Port exist?

That's a completely different issue. It's not locking vs lock-free. It's 
because while threads may be much lighter weight than processes, they're 
not free. If you're using processes only to get I/O parallelism, you may 
switch to threads because they're lighter weight (smaller resource 
footprint, quicker to switch); but if you're pushing the boundaries 
you'll soon discover that even threads are heavier than you really want 
or need. I/O completions (a form of asynchronous I/O) is just a way to 
parallelize your I/O operations at a lower level, without any scheduling 
overhead that's not actually needed for the purpose.

> This thread (in the other sense of the word) has been vague
> about the mechanism by which contention between threads
> for a lock can affect performance. Allow me to rectify
> that:
> 
> Imagine a uniprocessor machine running 1000 threads (not
> performing I/O). Contrary to popular belief, performance
> is not affected in the slightest by this number of threads.

At a simple level, that's true. However it's not entirely true, either. 
For example, most schedulers don't scale perfectly with long run lists. 
On a 16-way multiprocessor, a 10 thread application many never suffer a 
context switch... but 1000 threads almost certainly will even if "most" 
are "usually" waiting for I/O. That's not even getting into subtle 
effects on hardware, such as cache associativity.

> One context switch takes place every timeslice (say 20
> milliseconds) whether there are two threads or 1000.
> This frequency (50 Hz) of context switching reduces
> throughput slightly but not greatly. How do I know?
> Because the operating system timeslice is determined
> as a tradeoff between throughput and latency.

It's a tradeoff based on arbitrary data that doesn't apply to ANY real 
application except occasionally and purely by accident. In many 
application workloads the impact can indeed be "great". On average, no 
application is average...

> Now introduce a lock, contended for by the threads,
> such that the lock is held for 1% of a thread's
> running time. Every hundredth timeslice, a thread
> is switched out when it is holding the lock. During
> the subsequent 20ms timeslice, all the other threads
> run up against the lock, leading to 999 context
> switches before the original thread gets the processor
> again. Accordingly, every hundred timeslices there are
> 1099 switches instead of 100, an eleven-fold increase.
> That *is* going to sap performance.
> 
> That example had threads holding a lock 1% of the
> time. Things can be much worse. The literature suggests
> that typical programs may spend 20% of their running
> time inside the allocator (heap manager) routines.
> It's lucky there are lock-free allocators available ;)

And 20% probably optimistic as an average. A lot of code is really badly 
designed, and ends up being pretty much entirely serialized.

> If it seems possible by clever design to reduce the time
> spent holding a lock to almost nothing, bear in mind
> that on recent Pentiums just acquiring the lock is going
> to take over a hundred machine cycles; there is a
> limit.

Also remember that MOST (though not all) of the time required to acquire 
a lock is the exact same time (and machine instructions) necessary for a 
lock-free transaction. There ARE real savings, especially on machines 
like Alpha, PowerPC, Itanium, where you can often avoid expensive 
"fence" or "barrier" instructions that are critical for locks but not 
necessarily for (some) lock-free operations; but that isn't much of a 
factor on a Pentium.

Most (practical) lock-free really could be described more as "reducing 
the time spend holding the lock to nothing" than as "no locking". 
Sometimes you can reliably read without any memory system 
synchronization, but you can't ever reliably write that way if the 
result needs to be shared.

"Lock-free" really just recognizes that the hardware doesn't usually do 
"lock" and "unlock" -- it's performing a synchronized operation on data. 
A lock merely performs that operation on a special flag and "proper use" 
implicitly extends the synchronization properties to your data. 
Lock-free uses the same operations but performs them on the data you 
really wanted to change in the first place. The "lock" model is 
conceptually simpler and more flexible; but it also adds overhead 
because you're working through the proxy of the "lock" data.

The first SMP machine I programmed (a short-lived and long gone variant 
of the PDP-11) had only one instruction that synchronized memory across 
processors: ASRB (arithmetic shift right, byte). That was it. The only 
"lock free" operation you could do was an arithmetic divide by a power 
of two (and only for very small ranges of data). That's as close as I've 
come to a machine that only implements "a lock" in hardware. Once you 
get to "compare and swap", you gain an extremely powerful tool to 
operate on real data instead of a lock -- and we've finally reached the 
point where this flexibility is pretty nearly universal. And that 
enables lock-free programming to become something more than a curiosity.

-- 
Dave Butenhof, David.Butenhof@hp.com
HP Utility Pricing software, POSIX thread consultant
Manageability Solutions Lab (MSL), Hewlett-Packard Company
110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062

.