[HN Gopher] An Introduction to Lockless Algorithms
___________________________________________________________________
An Introduction to Lockless Algorithms
Author : signa11
Score : 89 points
Date : 2021-03-04 18:19 UTC (4 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| dragontamer wrote:
| pthread_join is a blocking call, and therefore against the spirit
| of lockless programing. pthread_join will block until the child-
| thread exists, which means there's a lock in there somewhere. I
| personally think the first example they show is a terrible
| example as a result.
|
| ------------
|
| However, the smp_store_release and smp_load_acquire discussion is
| good... very good. But I'm not sure if there's enough discussion
| or theory to discuss why this is required.
|
| The gist is: L1 caches on POWER9 or ARM are allowed to reorder
| loads/stores. thread 1
| -------------------------------- a.x = 1;
| message = &a;
|
| "a.x = 1" should happens-before "message = &a". However, L1
| caches do NOT always follow this logic. "message=&a" could happen
| first from the perspective of DDR4 RAM.
|
| How? Why? Well, maybe message=&a is least-recently-used, while
| the thread continues to read/write to a.x variable. So later on,
| when the cache updates DDR4 RAM, the message=&a happens-before
| a.x=1 (from a DDR4 RAM perspective).
|
| This never happens on x86 (the x86 Cores preserves all store
| orderings). But it is potentially possible on ARM or POWER9.
| (There are other possible reasons for reorderings: Out-of-order
| computations on the CPU, Load/store forwarding on the Core, as
| well as compiler optimizations).
|
| It turns out that violating the total-store-ordering requirement
| allows for L1 caches and CPU-cores to operate slightly faster.
| Thanks to overly aggressive multicore processors and
| architectures (like ARM or POWER9), we have to deal with this
| non-obvious ordering violation.
|
| ---------
|
| The barrier forces the happens-before relationship to be
| preserved. You need a barrier on two sides: the write side
| (release) and the read-side (acquire).
|
| ---------
|
| Anyway... "happens-before" is about as bloody simple as you think
| it is. The issue is that "happens-before" is now violated on a
| regular basis on modern CPUs in multithreaded environments.
|
| Forcing "happens-before" relationships to happen in the expected
| order is now a requirement for multithreaded programmers.
|
| Note: all lock() and unlock() functions innately have the proper
| happens-before relationships in the right place. So a typical
| programmer who uses mutex_lock() will never have to worry about
| this happens-before reordering. All synchronization mechanisms:
| locks, semaphores, condition variables, etc. etc. need to be
| carefully written with these acquire / release barriers to work
| on modern architectures.
|
| However, if you wish to step into the realm of high-performance
| lock-free algorithms, you must absolutely understand the modern
| processor, and how it violates happens-before relationships. (as
| well as understanding the memory barriers that RESTORE the
| happens-before relationship).
| anonymousDan wrote:
| But isn't there also the issue that the compiler could reorder
| (and not just the hardware)? Edit: just noticed you also
| mentioned that
| CodeMage wrote:
| > _pthread_join is a blocking call, and therefore against the
| spirit of lockless programing. pthread_join will block until
| the child-thread exists, which means there 's a lock in there
| somewhere. I personally think the first example they show is a
| terrible example as a result._
|
| But they didn't use it as an example of lockless programming,
| did they? They were explaining what "acquire" and "release"
| mean, and they wanted to explain it in terms the reader should
| already be familiar with.
|
| In fact, they explicitly say it's not lockless:
| In the previous paragraph we have seen how the acquire and
| release semantics of pthread_create() and pthread_join() allow
| the creator of a thread to exchange information with that
| thread and vice versa. We will now see how this kind of
| communication can happen in a lockless manner while threads
| run.
| wtallis wrote:
| Bruce Dawson recently put a bit of a different spin on things
| with his explanation:
| https://randomascii.wordpress.com/2020/11/29/arm-and-lock-fr...
|
| > _If you have circa-2005 code without these memory barriers
| then your code is broken, and has always been broken, on all
| CPUs, because compilers have always been allowed to rearrange
| writes. With these memory barriers (implemented as needed for
| different compilers and platforms) your code is beautiful and
| portable._
|
| > _It turns out that ARM's weak memory model really doesn't
| make things any more complicated. If you are writing lock-free
| code and not using any sort of memory barriers then your code
| is potentially broken everywhere due to compiler reordering. If
| you are using memory barriers then it should be easy to extend
| them to include hardware memory barriers._
| dragontamer wrote:
| The 00s were all about this multithreaded / multiCPU debate.
| For Java programmers, the solution was to establish memory-
| ordering at the language level (which then defined which
| optimizations a Java compiler was allowed, or not allowed to
| do).
|
| C++ programmers chose performance, though it wasn't until
| C++11 when the memory model was fully standardized. Indeed:
| ARM CPU-designers were still arguing for a weaker model than
| acquire/release: ARM CPU-designers wanted to implement
| consume/release semantics instead. (consume/release is
| "weaker": allowing for more reorderings and therefore more
| opportunities for higher-performance CPUs and/or code
| optimizations).
|
| Java's sequentially-consistent models are too strong (not
| enough optimizations available) for most programmers and most
| situations. Consume/release is in C++ standard, but few
| people understand it. Acquire/release is the model that seems
| to be winning out after all of these years.
|
| As a result, ARM and POWER are updating their CPUs to better-
| provide acquire/release semantics. (as the consume/release
| model of ARMv7 is simply not as popular).
| pdkl95 wrote:
| > For this to happen, the store and load must be endowed with
| release and acquire semantics respectively. To that end, we have
| the "store-release" and "load-acquire" operations.
|
| I suppose release and acquire semantics are baked into Rust as
| the language's "ownership" concept. A variable is moved (release)
| out of the calling scope and the function takes ownership
| (acquire) at the function call, which synchronizes the symmetric
| operations. In the pthread_create/pthread_join example, thread 2
| is borrowing the variable "s".
| millstone wrote:
| acquire/release semantics restricts reordering of _other_ loads
| and stores. That is, `a=1; b.store(2, release);` here the
| release operation on b affects the previous store to a. Rust 's
| ownership model cannot express this, as it treats separate
| variables as independent.
|
| Your analogy to borrowing is correct in the pthread case,
| because it has no race. But typically lockless programming has
| races, and I don't think it's useful to think of it in terms of
| ownership.
| rurban wrote:
| Thanks Paolo!
___________________________________________________________________
(page generated 2021-03-04 23:00 UTC)