[HN Gopher] A flexible, lightweight, spin-lock barrier
___________________________________________________________________
A flexible, lightweight, spin-lock barrier
Author : kencausey
Score : 61 points
Date : 2022-03-14 13:10 UTC (1 days ago)
(HTM) web link (nullprogram.com)
(TXT) w3m dump (nullprogram.com)
| ativzzz wrote:
| This seems very similar to the centralized counting barrier
| described in [1] where each thread, upon hitting the barrier,
| decrements the count and spins until a global "sense" boolean is
| reversed. The last thread to hit the barrier, resets the counter
| and flips the "sense" freeing up all other threads. I guess the
| difference is this one treats the bool as odd/even values of an
| integer, but does that work for n_threads > 2?
|
| [1]http://web.mit.edu/6.173/www/currentsemester/readings/R06-sc..
| .
| hinkley wrote:
| These boolean arrangements ("flipping") fail when the number of
| watchers is high and the interval is small. The toggle can
| switch twice while you're queued to be run again, and then all
| bets are off, because it looks to you like nothing has
| happened. Your messaging rate can only be half of the frequency
| guaranteed by the kernel for waking every thread. And many OSes
| offer no ironclad guarantee.
| ativzzz wrote:
| How? Only one thread, the last one when the counter is 0,
| flips the switch. Before the switch is flipped, the counter
| is reset, so the next flip will only happen when every thread
| once again hits the next barrier and the counter decrements
| to 0.
| hinkley wrote:
| State observed by thread 1: counter = 5, state = blue;
|
| State observed by threads 2-4: counter = [4,5,1,2,3,4,5,6],
| state = [blue, green, blue]
|
| Thread 1 wakes up, sees counter = 6, state = blue;
|
| Logical conclusion: system state is counter++, state
| unchanged.
|
| Actual scenario: counter+=6, state+=2
|
| I believe this was first covered in my distributed
| computing class, but was probably most concretely
| illustrated in the case of concurrent garbage collection.
| Any unfairness in the thread scheduler can lead to some
| threads getting way more clock cycles than others. In low
| frequency scenarios you can convince yourself it's not a
| problem, only to be punished by success into finding out
| you're wrong when you have 10,000 active users instead of
| 10.
|
| This is especially true with spinlocks, because it's a
| greedy algorithm that trades fairness for responsiveness.
| It can result in low throughput or even priority inversion.
| They are a very sharp knife. Indispensable for very
| specific tasks, certain bloodshed for others.
| gpderetta wrote:
| It can be probably classified as an instance of the ABA
| problem.
| hinkley wrote:
| Precisely. I forgot the moniker but that's exactly it.
| ativzzz wrote:
| The counting barrier in the paper I linked is meant
| specifically for shared memory multiprocessors, not
| distributed systems, so what you're saying makes more
| sense. Both the counter and the state are shared
| variables across all threads. This clearly doesn't scale
| well with number of threads/processors due to cache
| contention, but the idea is to have a "simple" barrier.
| There are a few barrier implementations that operate via
| message passing further in the paper.
|
| But either way, the counter can never be 6 since the max
| counter is set to the number of threads, and each thread
| at the barrier decrements the counter. Once a thread is
| spinning, it doesn't care about the counter anymore, only
| the state. Once the last thread reaches the barrier, the
| state flips, and all threads awaken. Each thread
| maintains a local copy of what the "reversed state"
| should be to awaken. If one thread gets preference and
| advances to the next barrier, it will decrement the
| counter it reset and will spin for the global state to
| flip again. So even if it reaches the second barrier
| before any of the other threads awaken from the first, it
| will wait for the rest of the threads to catch up.
| hinkley wrote:
| The differences between multiprocessing and IPC are
| smaller than you think, and they have a habit of
| disappearing and coming back in different hardware
| refresh cycles.
|
| Multicore systems have hardware support for some
| consensus algorithms. Speed of light is practically a
| rounding error, but not always. And clock synchronization
| is... I'll use the word attenuated, since I hesitate to
| call it 'solved' or 'avoided'.
| atq2119 wrote:
| It's a nice write up. I did this once with a slight modification.
| The phase really only needs one bit, and an arbitrary number of
| threads can be supported in a number of ways by having the last
| thread to arrive at the barrier clean up the state based on a
| second word that holds the number of threads.
|
| For example, compare the thread counter part of the "v" value to
| the number of threads. If equal, do an atomic store which writes
| the opposite phase with a thread count of 0. If not equal, spin
| until the phase bit changes.
|
| This approach avoids a division and can support threads that
| leave the barrier entirely. (And theoretically also entering the
| barrier, but that's always dicey regardless of the barrier
| implementation.)
| gpderetta wrote:
| So, I understand that the difficulty is being able to reset the
| barrier without a quiescent period.
|
| It seems to me that a counter being able to count to n is
| enough and there is no need of an explicit phase bit:
| Wait(int *b, int n): 1: x = *b if (x == n-1)
| *b=0 //last, reset else if cas(b, x, x+1)
| 2: if *b goto 2 else goto 1
| atq2119 wrote:
| Your spin loop could easily miss the time during which *b ==
| 0. The article explains that. A separate phase bit is
| required.
| gpderetta wrote:
| Yes. I see it now.
| api wrote:
| Be careful with spin locks. We used to have one in the ZeroTier
| core. We pulled it due to a crazy insane bug where it would hang
| on single core machines and another one where it would run away
| on CPU usage on massively multicore machines. These bugs had the
| same cause. Think about it.
|
| Don't use spin locks unless you have profiled and have a clear
| understanding of the contention factors involved. Also be sure to
| test on single core and massively multicore.
|
| Also look at whether you can use atomics. Atomics are better than
| lock guarded things in almost every way. The only time locks beat
| atomics is when you are updating a bunch of things in which case
| it may be faster to lock once than to update 10 atomic variables.
|
| Message passing can also win sometimes over concurrent access
| even if it involves more copying. Profile and think.
| kqr wrote:
| Or as Torvalds said, don't use spinlocks in user space.
| https://www.realworldtech.com/forum/?threadid=189711&curpost...
| PaulDavisThe1st wrote:
| > The only time locks beat atomics is when you are updating a
| bunch of things in which case it may be faster to lock once
| than to update 10 atomic variables.
|
| Locks are _mandatory_ when you 're updating data that does not
| fit into a single atomic type. You cannot use N atomics to
| create atomic update of a data structure that must be
| semantically atomic across its N members.
| gpderetta wrote:
| Technically any algorithm can be implemented in a lock free
| way and there are n-way CAS constructs that indeed allow
| updating N memory locations atomically. In practice they are
| very slow.
| atq2119 wrote:
| You can use RCU to do this, but I'd argue that everything
| else ends up being more or less a lock by another name
| (maybe you could do read/write where the read is
| optimistic, but at least the write side will have some sort
| of lock). I'm curious what you had in mind?
| [deleted]
| gpderetta wrote:
| I misplaced my copy of The Art Of Multiprocessor
| Programming with the details, but with a simple CAS
| primitive you can build a general N-CAS algorithm and
| from that implement any algorithm ina a truly wait free
| way. It will be slow and won't scale, but will indeed be
| wait free
| PaulDavisThe1st wrote:
| Most RCU requires a lock, if only to handle cleaning up
| the "dead wood" that occurs on every update (write).
| MaxGanzII wrote:
| Lock-free garbage collection exists and there's quite a
| body of literature.
| PaulDavisThe1st wrote:
| Yeah, I've read a lot of it. It all comes with caveats
| and potential performance issues.
| PaulDavisThe1st wrote:
| You can do this sort of thing with write-ordering a
| "generation counter" that is updated before and after
| writing a bunch of other data elements. The problem is
| that it makes the read slow - you have to use a CAS-style
| loop even to read anything (you must check that the
| generation counter has the same value before and after
| reading the other data). It also adds more memory
| barriers than a simple lock.
| api wrote:
| Sure, forgot to mention that, but that's only if it must be
| semantically atomic. That's not always the case.
| MaxGanzII wrote:
| You can produce this behaviour. Look for MCAS, from Cambridge
| Uni.
|
| It never took off. I think it didn't scale.
| dragontamer wrote:
| Hmmmm.
|
| A lot of the "difficult parts" of the barrier is the
| synchronization / memory-barrier part (not the same as a spin-
| lock barrier. Ugggh, we need someone to revamp the language of
| low-level multiprocessor programming).
|
| Anyway, the memory-barrier part is "solved" by using atomics and
| then handwaving all the loads/stores as sequentially-consistent.
| Yeah, that works, but would lead to subpar performance on all
| systems, as x86 has total-store ordering (less powerful than
| sequentially consistent), so x86-compilers would probably need to
| emit a memory barrier.
|
| ARM / POWER8/POWER9/POWER10 systems are "even weaker", so they
| need an even more powerful barrier to be emitted by the compiler.
|
| -------
|
| There's also the "Hyperthread yield" instruction that should be
| placed somehwere in POWER/x86 systems. (Most ARM systems largely
| don't have hyperthreads/SMT, but the ARM E2 does...). Without
| "hyperthread yield", the pipeline is going to be uselessly full
| of load/store instructions from this tight inner-loop.
|
| EDIT: I had to search for it, but I now know the name of the
| instruction on x86. "PAUSE" is the hyperthread-yield instruction.
| https://www.felixcloutier.com/x86/pause
|
| You may be "livelock starving" your brother-hyperthread because
| your current thread is using up all of the pipeline / decoding
| bandwidth. By using "hyperthread yield", your code will use the
| minimum amount of decoding bandwidth.
|
| -----
|
| The article does talk about memory-barriers, but seems to be
| missing the discussion on hyperthread yield. So its probably not
| "as efficient" as it could get yet.
|
| I can believe that the implementation works as discussed in the
| blogpost. But its still yet incomplete: a faster implementation
| probably exists, maybe not with "Fully Relaxed" atomics, but
| probably with acquire/release atomics (which are free on x86, and
| significantly faster on ARM/POWER9 than seq-cst atomics).
|
| Note: ARM8+ and POWER9+ now have load-acquire and store-release
| instructions, serving as an intermediate performance between
| fully relaxed atomics and seq-cst atomics. (These instructions
| may have existed earlier, but I know for sure they're in the
| modern version of these chips).
|
| Once acquire/release barriers are figured out in the right
| location, the use of hyperthread-yield on x86 and POWER systems
| would be best on highly-loaded systems. I'm not sure how to test
| the efficacy / throughput of hyperthread-yield however, I only
| know of its theoretical benefits. Coming up with a good test (to
| make sure we're actually improving performance) is itself a
| difficult question.
|
| TL;DR: Good first pass implementation, and good discussion to get
| things started. But there's still a long way before I consider it
| a top-end implementation! For many projects, what is seen here is
| probably sufficient (especially professional projects where you
| need to move onto some other programming task). But a hobby
| project could keep optimizing a bit longer to squeeze out a few
| more cycles of efficiency here.
|
| Coming up with a test to ensure the acquire-release is properly
| done is also somewhat difficult, and would require ARM / POWER
| hardware.
| gpderetta wrote:
| It seems to me that a barrier primitive needs at the very least
| acq+rel semantics [1], which means that TSO implicit acquire
| and release are neither necessary nor sufficient[2]. In
| practice on x86 this means an atomic RMW which is seq-cst. I'm
| not familiar enough with ARM/Power to know if acq+rel is
| significantly faster than seq cst.
|
| [1] See the Laws of Orders paper.
|
| [2] without using asymmetric synchronization which is not
| appropriate for a general purpose barrier.
| dragontamer wrote:
| I admit that I'm not 100% perfect with the various memory
| models at play here. But my understanding is...
|
| 1. Seq-cst -- Strongest/slowest. (Aka: Java volatile)
|
| 2. Total Store order (aka: x86)
|
| 3. Acq-Release (aka: newer-ARM / newer-POWER)
|
| 4. Acq-Consume (aka: old-ARM / old-Power)
|
| 5. Relaxed -- Weakest/fastest, DEC Alpha.
|
| That is to say, acquire-release is often compiled into no-op
| on x86, because x86 Total-store ordering is "stronger" than
| acquire-release.
|
| Acquire-Consume was thought to be sufficient, but C++11
| debate argued otherwise, and designed a stronger Acquire-
| release model. ARM/POWER designers then implemented acquire-
| release model.
|
| --------
|
| I understand that things aren't perfectly linear, and that at
| the lower level the discussion is about read/read,
| read/write, write/read, and write/write barriers. But my
| brain has never really grok'd this level of detail. (Instead,
| I think in terms of the simplified list above). So please
| correct me if I'm wrong with the above assumptions.
| gpderetta wrote:
| An acquire+release fence is equivalent to two indivisible
| acquire and release fences but can't be implemented by
| simply issuing the two fences in isolation, as they would
| be unordered with each other. x86 has no native acq+relase
| fence and it can't be simulated, for example, with a load
| plus a store (i.e. separate acquire and release fences),
| but you need an mfence or atomic rmw.
|
| ARM and power do allow for indipendent reads of indipendent
| writes (or maybe only power) so in principle an acq+rel
| fence could be cheaper than a full seqctst one, although I
| don't know if it is in practice.
| secondcoming wrote:
| Didn't Intel bork PAUSE on some CPUs? IIRC it can take 140
| cycles to execute.
| dragontamer wrote:
| I don't think that's really a "bork" though.
|
| 1. PAUSE should only be executed inside of the spinlock.
|
| 2. Fast-path won't touch the "PAUSE" instruction. (Ex: lock
| is unlocked. So you grab the lock, bypass the while loop and
| carry on).
|
| 3. Slow-path is... well... the slow-path. You'll be spinning
| in there for thousands of cycles, waiting for "the other
| thread". Saving power by effectively "sleeping" a bit longer
| isn't a big issue.
|
| -----
|
| I very well could assume that Intel did the measurements, and
| that PAUSE with 100+ delay of latency was higher performance
| / throughput than their original PAUSE implementation. It is
| a "slow-path" instruction after all.
|
| If the spin-lock has any contention what-so-ever, you're
| looking at 100+ cycles to even communicate to L3 cache
| (~40-nanoseconds == 160 cycles to talk to L3 cache at all).
|
| It makes no sense to be "any faster" than this, because L3
| cache hasn't changed yet. You might as well sleep until the
| next point-in-time where L3 cache had a chance to update
| (aka: the value of the spinlock has changed).
|
| L2 cache and L1 cache are "private" to each core, so its
| impossible for another core to change the values in L1 or L2.
| So the "closest" the spinlock's value can be is L3 cache. The
| only exception I can think of... is if both threads are
| SMT/Hyperthreads on the same core. This case has probably
| gotten slower, but all other cases have probably gotten
| faster (or more precisely: higher throughput).
| usefulcat wrote:
| In skylake, they increased the duration of pause from "about
| 10 cycles" to "as many as 140 cycles".
|
| https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-
| cpu...
| MaxGanzII wrote:
| I may be wrong, but I don't see this working in theory.
|
| The basic problem is that for any physical core, there is no
| guarantee that _any_ writes will _ever_ be seen by any other core
| (unless the necessary extra magic is done to make this so).
|
| (There's a matching problem when reading, too, in that in theory
| writes made by another core can be never seen by the reading
| core, even if they have reached the domain covered by cache
| coherency, unless the necessary extra magic, etc.)
|
| So here with this code, in theory, the writes being made simply
| are _never_ seen by another core, and so it doesn 't work.
|
| As it is, in practise, writes typically make their way to other
| physical cores in a reasonably short time, whether or not the
| extra magic is performed; but this is not guaranteed. Also of
| course the ordering of the emergence of those writes is not
| guaranteed (although on some platforms, additional guarantees are
| provided - Intel is particular generous in this regard, and
| performance suffers because of it).
| gpderetta wrote:
| There is an implied seq-cst barrier in the atomic INC.
| MaxGanzII wrote:
| A barrier, of any kind, makes no difference here.
|
| Barriers only influence the _order_ in which events become
| visible _if and when they DO become visible_.
|
| Barriers say nothing about whether or not events _actually
| do_ become visible; and it may be that they never do (if the
| necessary magic is not performed, etc).
|
| You mention an _atomic_ increment. I don 't see atomic
| increments this in the C11 code, but I'm not up to speed with
| that version of the specification.
|
| (Note that an atomic increment will solve the write-side
| problem, but not the read-side problem. The reader must still
| use a read barrier; but if there's an implicit full barrier,
| then that's being done.)
| gpderetta wrote:
| The fact that another thread sees the incremented value
| implies that that thread will see all stores that happened-
| before that incremented value.
|
| A fence or atomic operation guarantee that a store will be
| visible globally in a finite amount of time
|
| Edit: that's true for C11 _Atomic as well (I had to double
| check), I.e. x++ if x is an _Atomic int is both atomic and
| seqcst
| MaxGanzII wrote:
| > The fact that another thread sees the incremented value
| implies that that thread will see all stores that
| happened-before that incremented value.
|
| Firstly, if the writer has used no barriers, writes can
| emerge in any order (and may never emerge at all, anyway,
| even if barriers were used). Secondly, if the reader has
| not used barriers, then even if writes have been written
| in some given order, the reader will by no means observe
| that order.
|
| > A fence or atomic operation guarantee that a store will
| be visible globally in a finite amount of time
|
| A fence does _NOT_ guarantee this. A fence controls and
| only only controls order of events. It has _NO_ influence
| on whether events actually emerge. An atomic operation
| does, but this only solves the write-side of the problem;
| the reader still needs to issue a read barrier to
| guarantee seeing atomically written values.
| gpderetta wrote:
| Sure but inc in addition to the fence also issues the
| required reads and writes.
| dragontamer wrote:
| > The basic problem is that for any physical core, there is no
| guarantee that any writes will ever be seen by any other core
| (unless the necessary extra magic is done to make this so).
|
| They're using atomic read/writes with sequential-consistency.
|
| This means that the compiler will automatically put memory-
| barriers in the appropriate locations to guarantee sequential-
| consistency, but probably at a significant performance cost.
|
| The implementation is likely correct, but its the slowest
| performance available (seq-cst is the simplest / most braindead
| atomic operation, but slowest because you probably don't need
| all those memory barriers).
|
| I'd expect that Acq-release style barriers is possible for this
| use case, which would be faster on x86, POWER, and ARM systems.
| MaxGanzII wrote:
| > They're using atomic read/writes with sequential-
| consistency.
|
| In the C11 code? I'm not up to speed with that version of the
| spec, but unless there is behaviour invoked by the use of
| _Atomic, the code looks to me to be performing normal, non-
| atomic read/writes.
|
| Another reply says there are full barriers being
| automatically used, but that doesn't address the problem I've
| described.
| dragontamer wrote:
| > but unless there is behaviour invoked by the use of
| _Atomic
|
| That is the behavior invoked by "atomic".
|
| The general gist is that "manual" read/write barriers were
| too difficult to think about and led to bugs. Instead of
| having the programmer put the read/write barriers in the
| correct locations, modern compilers put a type-
| specification on variables... and then its the compiler's
| job to put the barriers in the correct location.
|
| This turned out to be necessary, because the compiler's
| optimizer kept rearranging code (!!!) around memory
| barriers anyway. So the "compiler" needs to participate in
| any of the barriers, especially at higher levels of
| optimization (-O3 rearranges a lot of code). If the
| compiler needs to participate, you might as well have the
| compiler handle the placement of those barriers.
|
| "Atomic" variables by default will be sequentially-
| consistent (ie: the _maximum_ use of memory barriers for
| _maximum_ levels of consistency).
| MaxGanzII wrote:
| You're talking here about memory barriers, but I was
| asking if atomic writes are in use.
|
| Atomic writes have nothing to do with memory barriers. On
| modern platforms atomic writes are available without
| memory barriers of any kind.
|
| Memory barriers do not cause events to become visible.
|
| Only atomic writes do this; and so the question is not
| whether barriers are in use due to _Atomic, but whether
| atomic writes are in use due to the use of _Atomic.
| dragontamer wrote:
| https://en.cppreference.com/w/c/language/atomic
|
| _Atomic types can be read-modified-written atomically (if
| using a compound-assignment operator), or the
| postincrement/pre-increment operations.
|
| In these circumstances, the memory-barriers used for
| those atomic-operations will be of the sequentially-
| consistent memory model.
|
| ------
|
| So yes, that "++*barrier" operation is read-modify-write
| atomically AND with memory-barriers in sequential-
| consistency style
|
| -----
|
| I don't believe that "atomics" are enough to solve this
| problem. At a minimum, I expect acq-consume to be
| required (not that acq-consume model even works in
| practice... so acq-release for today's compilers). That's
| why I'm still talking memory model / fences here.
|
| Not only is an atomic operation required, but so too is
| the "publishing" of this information needing to happen in
| the proper order. Fortunately, C11's _Atomic
| implementation solves both issues.
| MaxGanzII wrote:
| > So yes, that "++*barrier" operation is read-modify-
| write atomically AND with memory-barriers in sequential-
| consistency style
|
| The quoted material does not describe this assertion.
| dragontamer wrote:
| > Built-in increment and decrement operators and compound
| assignment are read-modify-write atomic operations with
| total sequentially consistent ordering (as if using
| memory_order_seq_cst). If less strict synchronization
| semantics are desired, the standard library functions may
| be used instead.
| [deleted]
___________________________________________________________________
(page generated 2022-03-15 23:01 UTC)