Subj : Re: Lock Free -- where to start To : comp.programming.threads From : David Butenhof Date : Mon Oct 03 2005 02:42 am chris noonan wrote: > David Schwartz wrote (in various posts): > >> Contention is bad. Lock free algorithms are about maximizing contention > > If locks are efficient, how is it that many programs on a quad > processor machine give more throughput if only one thread is > started, as opposed to four? Largely because many programs are written badly. It IS the contention that's killing these applications. And a complication (and one of the big wrenches in lock-free design) is that unless you're coding your own embedded/monolithic system from scratch, there's an enormous amount of contention in the system (both on behalf of and independent of your application) that you can't control or even see. > If locks are efficient, why does the technique called I/O Completion > Port exist? That's a completely different issue. It's not locking vs lock-free. It's because while threads may be much lighter weight than processes, they're not free. If you're using processes only to get I/O parallelism, you may switch to threads because they're lighter weight (smaller resource footprint, quicker to switch); but if you're pushing the boundaries you'll soon discover that even threads are heavier than you really want or need. I/O completions (a form of asynchronous I/O) is just a way to parallelize your I/O operations at a lower level, without any scheduling overhead that's not actually needed for the purpose. > This thread (in the other sense of the word) has been vague > about the mechanism by which contention between threads > for a lock can affect performance. Allow me to rectify > that: > > Imagine a uniprocessor machine running 1000 threads (not > performing I/O). Contrary to popular belief, performance > is not affected in the slightest by this number of threads. At a simple level, that's true. However it's not entirely true, either. For example, most schedulers don't scale perfectly with long run lists. On a 16-way multiprocessor, a 10 thread application many never suffer a context switch... but 1000 threads almost certainly will even if "most" are "usually" waiting for I/O. That's not even getting into subtle effects on hardware, such as cache associativity. > One context switch takes place every timeslice (say 20 > milliseconds) whether there are two threads or 1000. > This frequency (50 Hz) of context switching reduces > throughput slightly but not greatly. How do I know? > Because the operating system timeslice is determined > as a tradeoff between throughput and latency. It's a tradeoff based on arbitrary data that doesn't apply to ANY real application except occasionally and purely by accident. In many application workloads the impact can indeed be "great". On average, no application is average... > Now introduce a lock, contended for by the threads, > such that the lock is held for 1% of a thread's > running time. Every hundredth timeslice, a thread > is switched out when it is holding the lock. During > the subsequent 20ms timeslice, all the other threads > run up against the lock, leading to 999 context > switches before the original thread gets the processor > again. Accordingly, every hundred timeslices there are > 1099 switches instead of 100, an eleven-fold increase. > That *is* going to sap performance. > > That example had threads holding a lock 1% of the > time. Things can be much worse. The literature suggests > that typical programs may spend 20% of their running > time inside the allocator (heap manager) routines. > It's lucky there are lock-free allocators available ;) And 20% probably optimistic as an average. A lot of code is really badly designed, and ends up being pretty much entirely serialized. > If it seems possible by clever design to reduce the time > spent holding a lock to almost nothing, bear in mind > that on recent Pentiums just acquiring the lock is going > to take over a hundred machine cycles; there is a > limit. Also remember that MOST (though not all) of the time required to acquire a lock is the exact same time (and machine instructions) necessary for a lock-free transaction. There ARE real savings, especially on machines like Alpha, PowerPC, Itanium, where you can often avoid expensive "fence" or "barrier" instructions that are critical for locks but not necessarily for (some) lock-free operations; but that isn't much of a factor on a Pentium. Most (practical) lock-free really could be described more as "reducing the time spend holding the lock to nothing" than as "no locking". Sometimes you can reliably read without any memory system synchronization, but you can't ever reliably write that way if the result needs to be shared. "Lock-free" really just recognizes that the hardware doesn't usually do "lock" and "unlock" -- it's performing a synchronized operation on data. A lock merely performs that operation on a special flag and "proper use" implicitly extends the synchronization properties to your data. Lock-free uses the same operations but performs them on the data you really wanted to change in the first place. The "lock" model is conceptually simpler and more flexible; but it also adds overhead because you're working through the proxy of the "lock" data. The first SMP machine I programmed (a short-lived and long gone variant of the PDP-11) had only one instruction that synchronized memory across processors: ASRB (arithmetic shift right, byte). That was it. The only "lock free" operation you could do was an arithmetic divide by a power of two (and only for very small ranges of data). That's as close as I've come to a machine that only implements "a lock" in hardware. Once you get to "compare and swap", you gain an extremely powerful tool to operate on real data instead of a lock -- and we've finally reached the point where this flexibility is pretty nearly universal. And that enables lock-free programming to become something more than a curiosity. -- Dave Butenhof, David.Butenhof@hp.com HP Utility Pricing software, POSIX thread consultant Manageability Solutions Lab (MSL), Hewlett-Packard Company 110 Spit Brook Road, ZK2/3-Q18, Nashua, NH 03062 .