[HN Gopher] A flexible, lightweight, spin-lock barrier
       ___________________________________________________________________
        
       A flexible, lightweight, spin-lock barrier
        
       Author : kencausey
       Score  : 61 points
       Date   : 2022-03-14 13:10 UTC (1 days ago)
        
 (HTM) web link (nullprogram.com)
 (TXT) w3m dump (nullprogram.com)
        
       | ativzzz wrote:
       | This seems very similar to the centralized counting barrier
       | described in [1] where each thread, upon hitting the barrier,
       | decrements the count and spins until a global "sense" boolean is
       | reversed. The last thread to hit the barrier, resets the counter
       | and flips the "sense" freeing up all other threads. I guess the
       | difference is this one treats the bool as odd/even values of an
       | integer, but does that work for n_threads > 2?
       | 
       | [1]http://web.mit.edu/6.173/www/currentsemester/readings/R06-sc..
       | .
        
         | hinkley wrote:
         | These boolean arrangements ("flipping") fail when the number of
         | watchers is high and the interval is small. The toggle can
         | switch twice while you're queued to be run again, and then all
         | bets are off, because it looks to you like nothing has
         | happened. Your messaging rate can only be half of the frequency
         | guaranteed by the kernel for waking every thread. And many OSes
         | offer no ironclad guarantee.
        
           | ativzzz wrote:
           | How? Only one thread, the last one when the counter is 0,
           | flips the switch. Before the switch is flipped, the counter
           | is reset, so the next flip will only happen when every thread
           | once again hits the next barrier and the counter decrements
           | to 0.
        
             | hinkley wrote:
             | State observed by thread 1: counter = 5, state = blue;
             | 
             | State observed by threads 2-4: counter = [4,5,1,2,3,4,5,6],
             | state = [blue, green, blue]
             | 
             | Thread 1 wakes up, sees counter = 6, state = blue;
             | 
             | Logical conclusion: system state is counter++, state
             | unchanged.
             | 
             | Actual scenario: counter+=6, state+=2
             | 
             | I believe this was first covered in my distributed
             | computing class, but was probably most concretely
             | illustrated in the case of concurrent garbage collection.
             | Any unfairness in the thread scheduler can lead to some
             | threads getting way more clock cycles than others. In low
             | frequency scenarios you can convince yourself it's not a
             | problem, only to be punished by success into finding out
             | you're wrong when you have 10,000 active users instead of
             | 10.
             | 
             | This is especially true with spinlocks, because it's a
             | greedy algorithm that trades fairness for responsiveness.
             | It can result in low throughput or even priority inversion.
             | They are a very sharp knife. Indispensable for very
             | specific tasks, certain bloodshed for others.
        
               | gpderetta wrote:
               | It can be probably classified as an instance of the ABA
               | problem.
        
               | hinkley wrote:
               | Precisely. I forgot the moniker but that's exactly it.
        
               | ativzzz wrote:
               | The counting barrier in the paper I linked is meant
               | specifically for shared memory multiprocessors, not
               | distributed systems, so what you're saying makes more
               | sense. Both the counter and the state are shared
               | variables across all threads. This clearly doesn't scale
               | well with number of threads/processors due to cache
               | contention, but the idea is to have a "simple" barrier.
               | There are a few barrier implementations that operate via
               | message passing further in the paper.
               | 
               | But either way, the counter can never be 6 since the max
               | counter is set to the number of threads, and each thread
               | at the barrier decrements the counter. Once a thread is
               | spinning, it doesn't care about the counter anymore, only
               | the state. Once the last thread reaches the barrier, the
               | state flips, and all threads awaken. Each thread
               | maintains a local copy of what the "reversed state"
               | should be to awaken. If one thread gets preference and
               | advances to the next barrier, it will decrement the
               | counter it reset and will spin for the global state to
               | flip again. So even if it reaches the second barrier
               | before any of the other threads awaken from the first, it
               | will wait for the rest of the threads to catch up.
        
               | hinkley wrote:
               | The differences between multiprocessing and IPC are
               | smaller than you think, and they have a habit of
               | disappearing and coming back in different hardware
               | refresh cycles.
               | 
               | Multicore systems have hardware support for some
               | consensus algorithms. Speed of light is practically a
               | rounding error, but not always. And clock synchronization
               | is... I'll use the word attenuated, since I hesitate to
               | call it 'solved' or 'avoided'.
        
       | atq2119 wrote:
       | It's a nice write up. I did this once with a slight modification.
       | The phase really only needs one bit, and an arbitrary number of
       | threads can be supported in a number of ways by having the last
       | thread to arrive at the barrier clean up the state based on a
       | second word that holds the number of threads.
       | 
       | For example, compare the thread counter part of the "v" value to
       | the number of threads. If equal, do an atomic store which writes
       | the opposite phase with a thread count of 0. If not equal, spin
       | until the phase bit changes.
       | 
       | This approach avoids a division and can support threads that
       | leave the barrier entirely. (And theoretically also entering the
       | barrier, but that's always dicey regardless of the barrier
       | implementation.)
        
         | gpderetta wrote:
         | So, I understand that the difficulty is being able to reset the
         | barrier without a quiescent period.
         | 
         | It seems to me that a counter being able to count to n is
         | enough and there is no need of an explicit phase bit:
         | Wait(int *b, int n):          1: x = *b          if (x == n-1)
         | *b=0 //last, reset          else if cas(b, x, x+1)
         | 2: if *b goto 2          else goto 1
        
           | atq2119 wrote:
           | Your spin loop could easily miss the time during which *b ==
           | 0. The article explains that. A separate phase bit is
           | required.
        
             | gpderetta wrote:
             | Yes. I see it now.
        
       | api wrote:
       | Be careful with spin locks. We used to have one in the ZeroTier
       | core. We pulled it due to a crazy insane bug where it would hang
       | on single core machines and another one where it would run away
       | on CPU usage on massively multicore machines. These bugs had the
       | same cause. Think about it.
       | 
       | Don't use spin locks unless you have profiled and have a clear
       | understanding of the contention factors involved. Also be sure to
       | test on single core and massively multicore.
       | 
       | Also look at whether you can use atomics. Atomics are better than
       | lock guarded things in almost every way. The only time locks beat
       | atomics is when you are updating a bunch of things in which case
       | it may be faster to lock once than to update 10 atomic variables.
       | 
       | Message passing can also win sometimes over concurrent access
       | even if it involves more copying. Profile and think.
        
         | kqr wrote:
         | Or as Torvalds said, don't use spinlocks in user space.
         | https://www.realworldtech.com/forum/?threadid=189711&curpost...
        
         | PaulDavisThe1st wrote:
         | > The only time locks beat atomics is when you are updating a
         | bunch of things in which case it may be faster to lock once
         | than to update 10 atomic variables.
         | 
         | Locks are _mandatory_ when you 're updating data that does not
         | fit into a single atomic type. You cannot use N atomics to
         | create atomic update of a data structure that must be
         | semantically atomic across its N members.
        
           | gpderetta wrote:
           | Technically any algorithm can be implemented in a lock free
           | way and there are n-way CAS constructs that indeed allow
           | updating N memory locations atomically. In practice they are
           | very slow.
        
             | atq2119 wrote:
             | You can use RCU to do this, but I'd argue that everything
             | else ends up being more or less a lock by another name
             | (maybe you could do read/write where the read is
             | optimistic, but at least the write side will have some sort
             | of lock). I'm curious what you had in mind?
        
               | [deleted]
        
               | gpderetta wrote:
               | I misplaced my copy of The Art Of Multiprocessor
               | Programming with the details, but with a simple CAS
               | primitive you can build a general N-CAS algorithm and
               | from that implement any algorithm ina a truly wait free
               | way. It will be slow and won't scale, but will indeed be
               | wait free
        
               | PaulDavisThe1st wrote:
               | Most RCU requires a lock, if only to handle cleaning up
               | the "dead wood" that occurs on every update (write).
        
               | MaxGanzII wrote:
               | Lock-free garbage collection exists and there's quite a
               | body of literature.
        
               | PaulDavisThe1st wrote:
               | Yeah, I've read a lot of it. It all comes with caveats
               | and potential performance issues.
        
               | PaulDavisThe1st wrote:
               | You can do this sort of thing with write-ordering a
               | "generation counter" that is updated before and after
               | writing a bunch of other data elements. The problem is
               | that it makes the read slow - you have to use a CAS-style
               | loop even to read anything (you must check that the
               | generation counter has the same value before and after
               | reading the other data). It also adds more memory
               | barriers than a simple lock.
        
           | api wrote:
           | Sure, forgot to mention that, but that's only if it must be
           | semantically atomic. That's not always the case.
        
           | MaxGanzII wrote:
           | You can produce this behaviour. Look for MCAS, from Cambridge
           | Uni.
           | 
           | It never took off. I think it didn't scale.
        
       | dragontamer wrote:
       | Hmmmm.
       | 
       | A lot of the "difficult parts" of the barrier is the
       | synchronization / memory-barrier part (not the same as a spin-
       | lock barrier. Ugggh, we need someone to revamp the language of
       | low-level multiprocessor programming).
       | 
       | Anyway, the memory-barrier part is "solved" by using atomics and
       | then handwaving all the loads/stores as sequentially-consistent.
       | Yeah, that works, but would lead to subpar performance on all
       | systems, as x86 has total-store ordering (less powerful than
       | sequentially consistent), so x86-compilers would probably need to
       | emit a memory barrier.
       | 
       | ARM / POWER8/POWER9/POWER10 systems are "even weaker", so they
       | need an even more powerful barrier to be emitted by the compiler.
       | 
       | -------
       | 
       | There's also the "Hyperthread yield" instruction that should be
       | placed somehwere in POWER/x86 systems. (Most ARM systems largely
       | don't have hyperthreads/SMT, but the ARM E2 does...). Without
       | "hyperthread yield", the pipeline is going to be uselessly full
       | of load/store instructions from this tight inner-loop.
       | 
       | EDIT: I had to search for it, but I now know the name of the
       | instruction on x86. "PAUSE" is the hyperthread-yield instruction.
       | https://www.felixcloutier.com/x86/pause
       | 
       | You may be "livelock starving" your brother-hyperthread because
       | your current thread is using up all of the pipeline / decoding
       | bandwidth. By using "hyperthread yield", your code will use the
       | minimum amount of decoding bandwidth.
       | 
       | -----
       | 
       | The article does talk about memory-barriers, but seems to be
       | missing the discussion on hyperthread yield. So its probably not
       | "as efficient" as it could get yet.
       | 
       | I can believe that the implementation works as discussed in the
       | blogpost. But its still yet incomplete: a faster implementation
       | probably exists, maybe not with "Fully Relaxed" atomics, but
       | probably with acquire/release atomics (which are free on x86, and
       | significantly faster on ARM/POWER9 than seq-cst atomics).
       | 
       | Note: ARM8+ and POWER9+ now have load-acquire and store-release
       | instructions, serving as an intermediate performance between
       | fully relaxed atomics and seq-cst atomics. (These instructions
       | may have existed earlier, but I know for sure they're in the
       | modern version of these chips).
       | 
       | Once acquire/release barriers are figured out in the right
       | location, the use of hyperthread-yield on x86 and POWER systems
       | would be best on highly-loaded systems. I'm not sure how to test
       | the efficacy / throughput of hyperthread-yield however, I only
       | know of its theoretical benefits. Coming up with a good test (to
       | make sure we're actually improving performance) is itself a
       | difficult question.
       | 
       | TL;DR: Good first pass implementation, and good discussion to get
       | things started. But there's still a long way before I consider it
       | a top-end implementation! For many projects, what is seen here is
       | probably sufficient (especially professional projects where you
       | need to move onto some other programming task). But a hobby
       | project could keep optimizing a bit longer to squeeze out a few
       | more cycles of efficiency here.
       | 
       | Coming up with a test to ensure the acquire-release is properly
       | done is also somewhat difficult, and would require ARM / POWER
       | hardware.
        
         | gpderetta wrote:
         | It seems to me that a barrier primitive needs at the very least
         | acq+rel semantics [1], which means that TSO implicit acquire
         | and release are neither necessary nor sufficient[2]. In
         | practice on x86 this means an atomic RMW which is seq-cst. I'm
         | not familiar enough with ARM/Power to know if acq+rel is
         | significantly faster than seq cst.
         | 
         | [1] See the Laws of Orders paper.
         | 
         | [2] without using asymmetric synchronization which is not
         | appropriate for a general purpose barrier.
        
           | dragontamer wrote:
           | I admit that I'm not 100% perfect with the various memory
           | models at play here. But my understanding is...
           | 
           | 1. Seq-cst -- Strongest/slowest. (Aka: Java volatile)
           | 
           | 2. Total Store order (aka: x86)
           | 
           | 3. Acq-Release (aka: newer-ARM / newer-POWER)
           | 
           | 4. Acq-Consume (aka: old-ARM / old-Power)
           | 
           | 5. Relaxed -- Weakest/fastest, DEC Alpha.
           | 
           | That is to say, acquire-release is often compiled into no-op
           | on x86, because x86 Total-store ordering is "stronger" than
           | acquire-release.
           | 
           | Acquire-Consume was thought to be sufficient, but C++11
           | debate argued otherwise, and designed a stronger Acquire-
           | release model. ARM/POWER designers then implemented acquire-
           | release model.
           | 
           | --------
           | 
           | I understand that things aren't perfectly linear, and that at
           | the lower level the discussion is about read/read,
           | read/write, write/read, and write/write barriers. But my
           | brain has never really grok'd this level of detail. (Instead,
           | I think in terms of the simplified list above). So please
           | correct me if I'm wrong with the above assumptions.
        
             | gpderetta wrote:
             | An acquire+release fence is equivalent to two indivisible
             | acquire and release fences but can't be implemented by
             | simply issuing the two fences in isolation, as they would
             | be unordered with each other. x86 has no native acq+relase
             | fence and it can't be simulated, for example, with a load
             | plus a store (i.e. separate acquire and release fences),
             | but you need an mfence or atomic rmw.
             | 
             | ARM and power do allow for indipendent reads of indipendent
             | writes (or maybe only power) so in principle an acq+rel
             | fence could be cheaper than a full seqctst one, although I
             | don't know if it is in practice.
        
         | secondcoming wrote:
         | Didn't Intel bork PAUSE on some CPUs? IIRC it can take 140
         | cycles to execute.
        
           | dragontamer wrote:
           | I don't think that's really a "bork" though.
           | 
           | 1. PAUSE should only be executed inside of the spinlock.
           | 
           | 2. Fast-path won't touch the "PAUSE" instruction. (Ex: lock
           | is unlocked. So you grab the lock, bypass the while loop and
           | carry on).
           | 
           | 3. Slow-path is... well... the slow-path. You'll be spinning
           | in there for thousands of cycles, waiting for "the other
           | thread". Saving power by effectively "sleeping" a bit longer
           | isn't a big issue.
           | 
           | -----
           | 
           | I very well could assume that Intel did the measurements, and
           | that PAUSE with 100+ delay of latency was higher performance
           | / throughput than their original PAUSE implementation. It is
           | a "slow-path" instruction after all.
           | 
           | If the spin-lock has any contention what-so-ever, you're
           | looking at 100+ cycles to even communicate to L3 cache
           | (~40-nanoseconds == 160 cycles to talk to L3 cache at all).
           | 
           | It makes no sense to be "any faster" than this, because L3
           | cache hasn't changed yet. You might as well sleep until the
           | next point-in-time where L3 cache had a chance to update
           | (aka: the value of the spinlock has changed).
           | 
           | L2 cache and L1 cache are "private" to each core, so its
           | impossible for another core to change the values in L1 or L2.
           | So the "closest" the spinlock's value can be is L3 cache. The
           | only exception I can think of... is if both threads are
           | SMT/Hyperthreads on the same core. This case has probably
           | gotten slower, but all other cases have probably gotten
           | faster (or more precisely: higher throughput).
        
           | usefulcat wrote:
           | In skylake, they increased the duration of pause from "about
           | 10 cycles" to "as many as 140 cycles".
           | 
           | https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-
           | cpu...
        
       | MaxGanzII wrote:
       | I may be wrong, but I don't see this working in theory.
       | 
       | The basic problem is that for any physical core, there is no
       | guarantee that _any_ writes will _ever_ be seen by any other core
       | (unless the necessary extra magic is done to make this so).
       | 
       | (There's a matching problem when reading, too, in that in theory
       | writes made by another core can be never seen by the reading
       | core, even if they have reached the domain covered by cache
       | coherency, unless the necessary extra magic, etc.)
       | 
       | So here with this code, in theory, the writes being made simply
       | are _never_ seen by another core, and so it doesn 't work.
       | 
       | As it is, in practise, writes typically make their way to other
       | physical cores in a reasonably short time, whether or not the
       | extra magic is performed; but this is not guaranteed. Also of
       | course the ordering of the emergence of those writes is not
       | guaranteed (although on some platforms, additional guarantees are
       | provided - Intel is particular generous in this regard, and
       | performance suffers because of it).
        
         | gpderetta wrote:
         | There is an implied seq-cst barrier in the atomic INC.
        
           | MaxGanzII wrote:
           | A barrier, of any kind, makes no difference here.
           | 
           | Barriers only influence the _order_ in which events become
           | visible _if and when they DO become visible_.
           | 
           | Barriers say nothing about whether or not events _actually
           | do_ become visible; and it may be that they never do (if the
           | necessary magic is not performed, etc).
           | 
           | You mention an _atomic_ increment. I don 't see atomic
           | increments this in the C11 code, but I'm not up to speed with
           | that version of the specification.
           | 
           | (Note that an atomic increment will solve the write-side
           | problem, but not the read-side problem. The reader must still
           | use a read barrier; but if there's an implicit full barrier,
           | then that's being done.)
        
             | gpderetta wrote:
             | The fact that another thread sees the incremented value
             | implies that that thread will see all stores that happened-
             | before that incremented value.
             | 
             | A fence or atomic operation guarantee that a store will be
             | visible globally in a finite amount of time
             | 
             | Edit: that's true for C11 _Atomic as well (I had to double
             | check), I.e. x++ if x is an _Atomic int is both atomic and
             | seqcst
        
               | MaxGanzII wrote:
               | > The fact that another thread sees the incremented value
               | implies that that thread will see all stores that
               | happened-before that incremented value.
               | 
               | Firstly, if the writer has used no barriers, writes can
               | emerge in any order (and may never emerge at all, anyway,
               | even if barriers were used). Secondly, if the reader has
               | not used barriers, then even if writes have been written
               | in some given order, the reader will by no means observe
               | that order.
               | 
               | > A fence or atomic operation guarantee that a store will
               | be visible globally in a finite amount of time
               | 
               | A fence does _NOT_ guarantee this. A fence controls and
               | only only controls order of events. It has _NO_ influence
               | on whether events actually emerge. An atomic operation
               | does, but this only solves the write-side of the problem;
               | the reader still needs to issue a read barrier to
               | guarantee seeing atomically written values.
        
               | gpderetta wrote:
               | Sure but inc in addition to the fence also issues the
               | required reads and writes.
        
         | dragontamer wrote:
         | > The basic problem is that for any physical core, there is no
         | guarantee that any writes will ever be seen by any other core
         | (unless the necessary extra magic is done to make this so).
         | 
         | They're using atomic read/writes with sequential-consistency.
         | 
         | This means that the compiler will automatically put memory-
         | barriers in the appropriate locations to guarantee sequential-
         | consistency, but probably at a significant performance cost.
         | 
         | The implementation is likely correct, but its the slowest
         | performance available (seq-cst is the simplest / most braindead
         | atomic operation, but slowest because you probably don't need
         | all those memory barriers).
         | 
         | I'd expect that Acq-release style barriers is possible for this
         | use case, which would be faster on x86, POWER, and ARM systems.
        
           | MaxGanzII wrote:
           | > They're using atomic read/writes with sequential-
           | consistency.
           | 
           | In the C11 code? I'm not up to speed with that version of the
           | spec, but unless there is behaviour invoked by the use of
           | _Atomic, the code looks to me to be performing normal, non-
           | atomic read/writes.
           | 
           | Another reply says there are full barriers being
           | automatically used, but that doesn't address the problem I've
           | described.
        
             | dragontamer wrote:
             | > but unless there is behaviour invoked by the use of
             | _Atomic
             | 
             | That is the behavior invoked by "atomic".
             | 
             | The general gist is that "manual" read/write barriers were
             | too difficult to think about and led to bugs. Instead of
             | having the programmer put the read/write barriers in the
             | correct locations, modern compilers put a type-
             | specification on variables... and then its the compiler's
             | job to put the barriers in the correct location.
             | 
             | This turned out to be necessary, because the compiler's
             | optimizer kept rearranging code (!!!) around memory
             | barriers anyway. So the "compiler" needs to participate in
             | any of the barriers, especially at higher levels of
             | optimization (-O3 rearranges a lot of code). If the
             | compiler needs to participate, you might as well have the
             | compiler handle the placement of those barriers.
             | 
             | "Atomic" variables by default will be sequentially-
             | consistent (ie: the _maximum_ use of memory barriers for
             | _maximum_ levels of consistency).
        
               | MaxGanzII wrote:
               | You're talking here about memory barriers, but I was
               | asking if atomic writes are in use.
               | 
               | Atomic writes have nothing to do with memory barriers. On
               | modern platforms atomic writes are available without
               | memory barriers of any kind.
               | 
               | Memory barriers do not cause events to become visible.
               | 
               | Only atomic writes do this; and so the question is not
               | whether barriers are in use due to _Atomic, but whether
               | atomic writes are in use due to the use of _Atomic.
        
               | dragontamer wrote:
               | https://en.cppreference.com/w/c/language/atomic
               | 
               | _Atomic types can be read-modified-written atomically (if
               | using a compound-assignment operator), or the
               | postincrement/pre-increment operations.
               | 
               | In these circumstances, the memory-barriers used for
               | those atomic-operations will be of the sequentially-
               | consistent memory model.
               | 
               | ------
               | 
               | So yes, that "++*barrier" operation is read-modify-write
               | atomically AND with memory-barriers in sequential-
               | consistency style
               | 
               | -----
               | 
               | I don't believe that "atomics" are enough to solve this
               | problem. At a minimum, I expect acq-consume to be
               | required (not that acq-consume model even works in
               | practice... so acq-release for today's compilers). That's
               | why I'm still talking memory model / fences here.
               | 
               | Not only is an atomic operation required, but so too is
               | the "publishing" of this information needing to happen in
               | the proper order. Fortunately, C11's _Atomic
               | implementation solves both issues.
        
               | MaxGanzII wrote:
               | > So yes, that "++*barrier" operation is read-modify-
               | write atomically AND with memory-barriers in sequential-
               | consistency style
               | 
               | The quoted material does not describe this assertion.
        
               | dragontamer wrote:
               | > Built-in increment and decrement operators and compound
               | assignment are read-modify-write atomic operations with
               | total sequentially consistent ordering (as if using
               | memory_order_seq_cst). If less strict synchronization
               | semantics are desired, the standard library functions may
               | be used instead.
        
               | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-03-15 23:01 UTC)