[HN Gopher] A collection of lock-free data structures written in...
___________________________________________________________________
A collection of lock-free data structures written in standard C++11
Author : dnedic
Score : 108 points
Date : 2023-05-10 11:56 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| samsquire wrote:
| I think I lean towards per-thread sharding instead of mutex based
| or lock free data structures except for lockfree ringbuffers.
|
| You can get embarassingly parallel performance if you split your
| data by thread and aggregate periodically.
|
| If you need a consistent view of your entire set of data, that is
| a slow path with sharding.
|
| In my experiments with multithreaded software I simulate a bank
| where many bankaccounts are randomly withdrawn from and deposited
| to. https://github.com/samsquire/multiversion-concurrency-
| contro... ShardedBank.java
| ShardedBank2.java ShardedBankNonRandom.java
| ShardedHashMap.java ShardedNonRandomBank2.java
| ShardedTotalOrder.java ShardedTotalRandomOrder.java
|
| I get 700 million requests per second over 12 threads due to the
| sharding of money over accounts. Here I prioritise throughput of
| transacitons per second over balance checks (a consistent view of
| the data).
|
| I am also experimenting with left-right concurrency control which
| trades memory usage for performance, it basically keeps two
| copies of data, one that is currently being read and written to
| and another inactive copy that is not active. You switch the data
| structures around periodically to "make changes visible" to that
| thread.
| asmnzxklopqw wrote:
| [flagged]
| jylam wrote:
| Maybe I don't understand the concept, but aren't lock-free
| structures "just" delegating the locking mechanism to the CPU via
| atomic operations ? (although even if that's the case I can
| understand the speedup) (and if so, why aren't all those lockfree
| structures the default, instead of using mutexes ?)
| gpderetta wrote:
| Atomic operations can't be meaningfully said to perform any
| locking.
|
| In any case lock-free is defined in term of progress
| guarantees.
| 12323 wrote:
| [flagged]
| bluGill wrote:
| From the FAQ:
|
| > The biggest reason you would want to use a lockfree data
| structure in such a scenario would be performance. Locking has a
| non-neglegible runtime cost on hosted systems as every lock
| requires a syscall.
|
| This is misleading. While a lock does have a runtime cost, in
| some cases that cost is less than all the force CPU cache
| synchronization calls that lock free needs to do. With a lock you
| only have to sync once, after all the operations are done. You
| need to carefully measure this to see which is more performant
| for your application.
| RcouF1uZ4gsC wrote:
| For me the biggest benefit of lock free programming is avoiding
| deadlocks.
|
| Locks are not composable. Unless you are aware of what locks
| every function in your call tree is using, you can easily end
| up with a deadlock.
| smat wrote:
| I assume the author of the library works under a hard real-time
| constraint. Under such circumstances (an example would be low
| latency audio) you can not tolerate the latency impact of a
| sporadic syscall.
| quietbritishjim wrote:
| Perhaps, but that is almost the opposite of what they said:
| in hard real time you can tolerate a longer average latency
| in return for needing a shorter maximum latency. That matches
| from what I would expect from a lock free data structure. But
| that doesn't match the (dubious) claim that locks are usually
| slower.
| dnedic wrote:
| You make good points, I will try to rephrase that part of
| the README to be more accurate and explain benefits not
| mentioned yet like determinism.
| duped wrote:
| You often _can_ tolerate the latency. The problem of locks is
| they are potentially unbounded and you can miss a deadline.
|
| It's not about performance so much as determinism.
| bluGill wrote:
| Lock free doesn't solve this. One thread will always make
| progress, but you have no way to ensure it is your thread
| so you can miss a deadline with lock free. When the data is
| under a lot of contention across many cores this is an
| issue (most of us don't have hundreds of cores so we don't
| see this).
|
| Generally lock-free is better for these situations as odds
| are when you hold a lock at least some CPU cycles are used
| for something that isn't directly modifying the data that
| needs the lock - those cycles the other CPU can touch it
| when lock-free. (note that when using a lock there is a
| trade-off, often it is better to hold the lock for longer
| than needed instead of dropping, doing a couple operations
| and then locking again)
| tialaramex wrote:
| If you must make progress locally (not just globally) the
| guarantee you need is wait freedom which is even stronger
| than lock freedom.
|
| (All wait free algorithms are lock free because
| necessarily if every thread is guaranteed to eventually
| make progress then overall progress is definitely made)
| bcrl wrote:
| This is why RCU, aka Read-Copy-Update, exists. RCU avoids the
| expensive synchronization part by delaying the release of the
| older version of the data structure until a point at which
| _all_ CPUs are guaranteed to see the new version of the data.
| The patents have now expired, so it 's worth investigating for
| people that need to write high performance multithreaded code
| that hits certain shared data structures really hard.
| dnedic wrote:
| The linked talk from Herb Sutter says as much, but it is a good
| suggestion to make that visible upfrontm, thanks.
|
| Additionally, cacheline alignment of indexes is something
| that's there to mitigate the false sharing phenomenom and
| reduce some of the cache synchronization cost.
| ashvardanian wrote:
| Both lock-free and mutex-based approaches have their
| applications. The general rule of thumb, for 2-4 threads in the
| same NUMA node lock-free is faster. Need more cores? Use a
| proper heavy mutex.
| planede wrote:
| Moreover, is "every lock requires a syscall" accurate? Probably
| depends on target platform and standard library, but my
| impression was that at low contention it doesn't really require
| syscalls, at least on Linux and glibc's pthread.
| m4nu3l wrote:
| That's my understanding too. In Linux mutexes are implemented
| by futexes (Fast User-Space mutexes). If there is no
| contention they are guaranteed to not perform a syscall
| https://en.wikipedia.org/wiki/Futex
| cout wrote:
| Maybe this has changed, but last time I looked at futexes
| there was no syscall for locking (assuming no contention),
| but unlocking always made a syscall. This was many years
| ago so it could be different now.
| m4nu3l wrote:
| The code isn't the easiest to read but in glibc it seems
| that the syscall is only performed if waiters are
| detected in userspace during an unlock operation
|
| https://github.com/lattera/glibc/blob/master/nptl/pthread
| _mu...
| gpderetta wrote:
| Indeed You only need to FUTEX_WAKE if you know there are
| waiters (or of you lost track of the number of waiters).
| kevincox wrote:
| Definitely not. I don't think there many standard libraries
| that use syscalls for every lock. It is near universal to
| attempt a lock in user space (maybe even spin a few times)
| and only call the kernel if you need to wait. So low-
| contention locks should very rarely make a syscall.
| adzm wrote:
| Critical sections in Windows certainly behave this way with
| a configurable spin count.
| maldev wrote:
| You can literally just do an atomic swap on some memory
| location. Three lines of assembly, like.
| mov rax, 1 ; load the value to exchange into rax
| acquire_lock: ; attempt to acquire the lock
| xchg byte [ADDR_LOCK], al ; atomically swap the lock value
| with rax test al, al ; test if the original lock
| value was 0 (unlocked) jnz acquire_lock ; if it was
| not, loop until we can acquire the lock
|
| The downside is you want a backoff to sleep the thread so it
| doesn't go into a loop. But the actual lock code is simple.
| You can easily have this be your function "AcquireLock()" and
| then do
|
| while(!AcquireLock()) { //pause thread execution. }
|
| And I think this is where they get the syscall being needed,
| since this will normally require a syscall to pause the
| thread from the scheduler.
| dragontamer wrote:
| Pikus has a number of C++ Parallelism performance talks that
| discusses this issue.
|
| In particular, the "cost of synchronization" is roughly the
| price of a L3 memory read/write, or ~50 cycles. In contrast, a
| read/write to L1 cache is like 4 cycles latency and can be done
| multiple times per clock tick, and you can go faster (IE:
| register space).
|
| So an unsynchronized "counter++" will be done ~4-billion times
| a second.
|
| But a "counter++; sync()" will slow you down 50x slower. That
| is to say, the "sync()" is the "expensive part" to think about.
|
| ------------
|
| A lock is two sync() statements. One sync() when you lock, and
| a 2nd sync() when you unlock. Really, half-sync() since one is
| an acquire half-sync and the other is a release half-sync.
|
| If your lock-free data structure uses more than two sync()
| statements, you're _probably_ slower than a dumb lock-based
| method (!!!). This is because the vast majority of
| lock()/unlocks() are uncontested in practice.
|
| So the TL;DR is, use locks because they're so much easier to
| think about. When you need more speed, measure first, because
| it turns out that locks are really damn fast in practice and
| kind of hard to beat.
|
| That being said, there's other reasons to go lock-free. On some
| occasions, you will be forced to use lock-free code because
| blocking could not be tolerated. (Ex: lock-free interrupts. Its
| fine if the interrupt takes a bit longer than expected during
| contention). So in this case, even if lock-free is slower, the
| guarantee for forward progress is what you're going for, rather
| than performance.
|
| So the study of lock-free data-structures is still really
| useful. But don't always think of for performance reasons,
| because these data-structures very well will be slower than
| std-library + lock() in many cases.
| gpderetta wrote:
| A sync, assuming it is your typical memory barrier, is not
| bound by the L3 latency. You pay (in first approximation) the
| L3 cost when you touch a contended cache line, whether you
| are doing a plain write or a full atomic CAS.
|
| Separately fences and atomic RMWs are slower than plain
| read/writes, but that's because of the (partially)
| serialising effects they have on a CPU pipleline, and very
| little todo with L3 (or any memory) latency.
|
| Case in point: A CAS on intel is 20ish cycles, the L3 latency
| is 30-40 cycles or more. On the other hand you can have
| multiple L3 misses outstanding, but CAS hardly pipelines.
| dragontamer wrote:
| Perhaps my brain is conflicting multiple things.
|
| So here's what I know. L1 / L2 caches are "inside the
| core". To talk to other cores, your data must leave L1/L2
| cache and talk to a network. It is on _THIS_ network that
| the L3 cache exists.
|
| Its not really "L3 cache", its just the memory-network that
| implements the MOESI protocol (or whatever proprietary
| variant: MOESIF or whatever) that sits between L2 cache, L3
| cache, and core-to-core communications. So I don't want to
| make it sound like "You're waiting on L3 cache", because
| you're right. Its not really related to L3 cache.
|
| So I don't really know how that network works (I know the
| high-level details of MOESI, but I also know that's the
| textbook explanation and the real world is way more
| complex), aside from "It'd be roughly on the same order-of-
| magnitude cost" as the L3 cache.
|
| -------
|
| What's really going on, is that Core#1 is talking to all
| the other cores and basically performing "git pull" and
| "git push" to the data asynchronously. Core#1 prefers to
| perform "git checkin" and "git checkout" (aka: talk to
| L1/L2 cache), because that's faster.
|
| But when Core#1 does a full sync (git push/git pull), it is
| required to talk on the network that leads to other cores
| and/or L3 cache.
|
| Those "git push / git pull" messages are MOESI (F and
| others), meaning Modified/Owned/Exclusive/Shared/Invalid
| (and Forwarding, and other proprietary states). Which line
| up to version-control way better than most people expect.
| gpderetta wrote:
| Indeed L3 being shared and also often working as the
| MOESI directory works out to the interthread latency
| being the same order of magnitude as the L3 latency.
|
| My point is that sync has nothing to do with caches.
| Caches are coherent all the time and do not need
| barriers. In particular I don't think the git pull/push
| maps well to MOESI as it is an optimistic protocol and
| only require transfering opportunistically on demand what
| is actually needed by a remote core, not conservatively
| everything that has changed like in git.
|
| The explicit sync model is more representative of non
| coherent caches, which are not really common as they are
| hard to use.
|
| Memory barriers are for typical CPUs, purely an i
| internal matter of the core and synchronize internal
| queues with L1. In a simple x86 model, where the only
| visible source of reordering is StoreLoad, a memory
| barrier simply stalls the pipeline until all preceding
| stores in program order are flushed out of the write
| buffer into L1.
|
| In practice things these days things are more complex and
| a fence doesn't fully stall the pipeline, potentially
| only needs to visibly prevents loads from completing.
|
| Other more relaxed CPUs also need to synchronise load
| queues, but still for the most part fences are a core
| local matter.
|
| Some architectures have indeed remote fences (even x86 is
| apparently getting some in the near future) but these are
| more exotic and, AFAIK, do not usually map to default
| c++11 atomics.
| dragontamer wrote:
| > The explicit sync model is more representative of non
| coherent caches, which are not really common as they are
| hard to use.
|
| Not that I'm a professional GPU programmer. But I'm
| pretty certain that GPU caches are non-coherent.
|
| But yeah, cache-coherence is just assumed on modern CPUs.
| Your clarification on store-queues and load-queues is
| helpful (even if the caches are coherent, the store-queue
| and load-queue can still introduce an invalid reordering.
| So it sounds like your point is that the various sync()
| instructions are more about these queues?)
| ot wrote:
| This is true in principle and it is good calling it out, but in
| practice I've never seen a mutex-based data structure beat an
| equivalent lock-free data structure, even at low contention,
| unless the latter is extremely contrived.
|
| A mutex transaction generally requires 2 fences, one on lock
| and one on unlock. The one on unlock would not be strictly
| necessary in principle (on x86 archs the implicit acquire-
| release semantics would be enough) but you generally do a CAS
| anyway to atomically check whether there are any waiters that
| need a wake-up, which implies a fence.
|
| Good lock-free data structures OTOH require just one CAS (or
| other fenced RMW) on the shared state.
|
| Besides, at large scale, no matter how small your critical
| section is, it will be preempted every once in a while, and
| when you care about tail latency that is visible. Lock-free
| data structures have more predictable latency characteristics
| (even better if wait-free).
| kerkeslager wrote:
| I appreciate your polite tone here. To expand on this at the
| risk of sounding a bit rude: nobody should listen to anyone
| who speaks about performance in terms of reasoning about a
| system instead of profiling it.
|
| Computers are _shockingly_ complex. I can 't tell you how
| many times I've reasoned about a system, ran the profiler,
| and discovered I was completely wrong.
|
| When I was working on an interpreter for a Lisp, I
| implemented my first cut of scopes (all the variables within
| a scope and their values) as a naive unsorted list of
| key/value pairs, thinking I'd optimize later. When I came
| back to optimize, I reimplemented this as a hashmap, but when
| I ran my test programs, to my horror, they were all 10x
| slower. I plugged in a hashmap library used in lots of
| production systems and got a significant 2x performance gain,
| which was still slower than looping over an unsorted list of
| key/value pairs. The fact is, most scopes have <10 variables,
| and at that size, looping over a list is faster than the
| constant time of a hashmap. I can reason about why this is,
| but that's just fitting my reasons to the facts ex-post-
| facto. _Reasoning_ didn 't lead me to the correct answer,
| _observation_ did.
|
| Returning to parallel data structures, the fact is, _I don 't
| know why lock-free structures are faster than mutex-based
| structures, I just know that they are in every situation
| where I've profiled them_.
|
| Reasoning isn't completely useless--reasoning is how you
| intuit what you should be profiling. But if you're just
| reasoning about how two alternatives will perform and not
| profiling them in real-life production systems you're wasting
| everyone's time.
| slaymaker1907 wrote:
| I don't think that's quite correct because there are many
| optimizations which are impossible (or nearly impossible)
| to do after a system is implemented. Daniel Lemire wrote an
| excellent post on this exact subject
| https://lemire.me/blog/2023/04/27/hotspot-performance-
| engine....
|
| In terms of programming languages, I think python is an
| excellent example of a language which has many features
| that have ended up making it extremely difficult to
| optimize even compared to other dynamic languages like
| LISPs. Even if you don't have to worry about backwards
| compatibility, there are design decisions that can limit
| performance which end up necessitating a rewrite of the
| entire system to actually change.
| kerkeslager wrote:
| > I don't think that's quite correct because there are
| many optimizations which are impossible (or nearly
| impossible) to do after a system is implemented.
|
| Who said you have to profile _after_ a system is
| implemented? Certainly I didn 't: if anything, I prefer
| to profile during prototyping, although few companies
| outside the largest budget for any real prototyping these
| days it seems. Usually I settle for timing things and
| profiling as early as possible so that you can catch any
| performance issues before any calcifying structure is
| built around the non-performant code.
|
| Yes, I did profile after the fact in the Lisp story, but
| my point in that story was that my reasoning led me to
| the wrong conclusions, not that I did everything
| perfectly (on the contrary, it's a story about learning
| from my mistakes!).
|
| I agree that Daniel Lemire post is excellent, but nothing
| in that post leads me to believe he'd disagree with
| anything I've said.
| OskarS wrote:
| Assuming low or no contention, it is easy to imagine a
| scenario where a mutex vastly outperforms it: if you need to
| push a 1000 things into the queue, it's still just two fences
| for the mutex but it's now a 1000 CASes.
|
| Moreover: the point with mutexes is that your data structure
| can be the optimized assuming no thread-safety. There are
| lots of, like, hyper-optimized hash table variants (with all
| sorts of SIMD nonsense and stuff) that are just not possible
| to do lock-free. The very "lock-freedomness" of the
| datastructure slows it down enough that in low contention
| scenarios mutexes clearly would outperform them without being
| particularly contrived.
| ot wrote:
| If you are going to do batch operations, your data
| structure should be optimized to support them, so you're
| back to one CAS. The same would apply to the locked
| scenario, where you probably don't want to copy 1000
| elements in the critical section.
|
| About the sufficiently smart optimizations, sure,
| everything is easy to imagine, but in my experience this
| never happened, and I'd be curious to hear practical
| examples if you have any.
| sakras wrote:
| Here's one I had: I was trying to build a Bloom filter in
| parallel. Each thread had large-ish batches of hashes it
| wanted to insert into the filter. Naively, you'd just
| have each thread iterate through the batches and do
| __sync_fetch_and_or for each of the hashes (this was a
| register-blocked Bloom filter so we only needed to
| perform one 8-byte or operation per hash).
|
| What ended up being MUCH faster was to partition the
| filter, and to have a lock per partition. Each thread
| would attempt to grab a random lock, perform its inserts
| into that partition, then release the lock and try to
| grab another random lock that it hasn't grabbed yet.
| Granted, these locks were just atomic booleans, not
| std::mutex or anything like that. But I think this
| illustrates that partitioning+locking can be better for
| throughput. If you want predictable latency for single
| inserts, then I'd imagine the __sync_fetch_and_or
| strategy would work better. Which maybe brings up a
| broader point that this whole discussion relies a lot on
| exactly what "faster" means to you.
| dnedic wrote:
| This is why you would use the Ring Buffer or Bipartite
| Buffer to place 1000 elements at a time, not the queue.
| Check the documentation for more info.
| T0pH4t wrote:
| ^ This can definitely be the case in a multi-
| producer/multi-consumer (MPMC) scenario if CAS is involved
| with loops. Great care has to be taken when writing MPMC
| data structures without locks that are more performant then
| lock equivalents; they are far more complex. I think it
| should be called out that it seems most (if not all) of the
| data structures provided are single producer/consumer which
| generally always have much simpler designs (and limitations
| of use) then MPMC.
| T0pH4t wrote:
| I could I should also say this applies to MPSC and SPMC.
| Basically anything other than SPSC.
| [deleted]
| josephg wrote:
| Are there any good benchmarks which demonstrate the
| performance characteristics you're talking about? Or case
| studies where an application moved from mutexes to lock free
| data structures, and compared the resulting performance?
| xfs wrote:
| https://youtu.be/_qaKkHuHYE0 (CppCon) A senior software
| engineer at Google tried to optimize tcmalloc by replacing
| a mutex with lockless MPMC queue. After many bugs and
| tears, the result is not statistically significant in
| production systems.
| bluGill wrote:
| This is impossible to do in a useful way for publication.
| You can do case studies, but minor changes in various
| factors that seem minor can make a massive difference in
| benchmarks. As such you need to find real world data, used
| in a real world scenario, for your application: then
| benchmark it. Even then you have a benchmark useful for
| your application only, and not worth publishing.
| kerkeslager wrote:
| I disagree. I've gotten a lot from reading publications
| about people profiling their applications and posting
| about the results with descriptions of their application
| design and load. Yes, obviously you can't read that and
| draw conclusions about how the same data structure or
| algorithm will perform in your application, but it helps
| you build an intuition for what is likely to work in
| applications with different characteristics.
|
| Here's an example:
| https://queue.acm.org/detail.cfm?id=1814327
|
| If you read that for the first time and don't learn
| something, I'd be quite surprised.
| andersced wrote:
| Should be benchmarked against ->
|
| https://github.com/Deaod/spsc_queue
|
| If proven faster OK.. If not.. Well.. back to the drawing board.
|
| I gave it a try -> https://github.com/andersc/fastqueue
|
| Deaod is the kingpin.
| jcelerier wrote:
| https://max0x7ba.github.io/atomic_queue/html/benchmarks.html
| for an existing set of benchmarks where this could be added
| whiteboardma wrote:
| I've only looked at the queue implementation, but both push and
| pop contain obvious race conditions; I would highly suggest
| adding tests that actually use the data structures from multiple
| threads.
| dnedic wrote:
| Could you elaborate on the alleged race conditions? Any advice
| on reliably testing the race conditions? The problem with
| adding those is the fact that they will give lots of false
| negatives and if you rely on them you have a problem.
| whiteboardma wrote:
| Looking at the Push operation defined in queue_impl.hpp, if
| multiple threads perform concurrent pushes, they might end up
| writing their element to the same slot in _data since the
| current position _w is not incremented atomically
| ape4 wrote:
| Just add a lock ;)
| dnedic wrote:
| This is a multi producer scenario, the README clearly
| states that these data structures are only single producer
| single consumer multi thread/interrupt safe.
|
| I will also add disclaimers to the javadocs comments on
| methods just to reduce confusion.
| whiteboardma wrote:
| Wops, my bad
| gjadi wrote:
| You could use TLA+ to model the data structure operations and
| check the invariant. Checking the invariant with assert is
| also useful in my limited experience with concurrency.
|
| https://lamport.azurewebsites.net/tla/tla.html
| dnedic wrote:
| Thanks a lot, will check this out!
| kilotaras wrote:
| - A lot of code won't work for types with no default
| constructors, but that is at least compile error
|
| - Using memcpy[0] for arbitrary types is just wrong, see [1]
|
| [0]
| https://github.com/DNedic/lockfree/blob/main/lockfree/inc/bi...
|
| [1] https://www.open-
| std.org/jtc1/sc22/wg21/docs/papers/2023/p11...
| dnedic wrote:
| This is already noted in the Queue readme, only the Queue
| constructs the type, the other 2 data structures are meant for
| PODs only.
|
| I will take a look at adding support for constructing in-place
| for the other 2 data structures, but at the moment, they are
| just for PODs.
| Notbrainiac wrote:
| Every datastructure is lock free. Locks are required when you
| have multiple writers. The article states the usefull only for
| certain circumstance: for single consumer single producer
| scenarios. So yea within these assumptions you can make something
| work.
| jimktrains2 wrote:
| You may need a lock if you have operations that are not atomic,
| even with a single writer, as a reader could find an
| inconsistency.
| ot wrote:
| This seems unnecessarily pedantic. Lock-free conventionally
| implies concurrent, otherwise it's meaningless.
| Koshkin wrote:
| Note that "lock-free" is a technical term that has a very
| specific, precise meaning (implying that the _user_ of the
| structure does not need to use locking explicitly).
| dnedic wrote:
| Even in single producer single consumer scenarios you need
| locks for multithreaded/interrupt use if you're not properly
| using atomics and proper fences.
| elbigbad wrote:
| [flagged]
| cjensen wrote:
| An issue in C++ is that it only supports atomic changes to the
| builtin types. For example, you can only CAS a 64-bit value if
| your largest integer/pointer type is 64-bits.
|
| Good lock free algorithms use double-width instructions like
| cmpxchg16b which compare 64-bits but swap 128-bits. You can then
| use the compared 64-bits as a kind of version number to prevent
| the a-b-a problem.
|
| Using only the built-in atomics is working with a hand tied
| behind your back. With the wider version, it's trivial to write
| multi-producer multi-consumer stacks with no limits to the number
| of objects stored. It's also pretty easy (if you copy the
| published algorithm) to do the same with queues.
| dnedic wrote:
| True, that would help immensely in creating MPMC data
| structures, but as these are SPSC there is no problem. Also to
| clarify, this is only for the indexes, the data members can be
| anything.
|
| Using these intrinsics or inline assembly would break
| portability or create situations where platforms have different
| feature levels, which is not something I intend to do. I want
| the library to be compatible with everything from a tiny MCU to
| x86.
| gpderetta wrote:
| Actually C++ only requires TriviallyComparable for std::atomic.
| The issue with 2CAS is that intel until very recently only
| provided cmpxchg16b[1] but no 128 atomic load and stores: SSE
| 128 bit memory operations were not guaranteed to be atomic (and
| in fact were observed not to be on some AMDs).
|
| So a 128 bit std::atomic on intel was not only suboptimal as
| the compiler had to use 2cas for load and stores as well, but
| actually wrong as an atomic load from readonly memory would
| fault. So at some point the ABI was changed to use the spinlock
| pool. Not sure if it has changed since.
|
| If you do it "by hand", when you only need a 2cas, a 128 bit
| load that is not atomic is fine as any tearing will be detected
| by the CAS and 'fixed', but it is hard for the compiler to
| optimize generic code.
|
| [1] which actually does full 128bit compare and swap, you are
| probably confusing it with the Itanium variant.
| aseipp wrote:
| The lack of DWCAS as a primitive in general is really annoying,
| C++ aside. RISC-V's -A extension has no form of it, either; you
| only get XLEN-sized AMO + LL/SC (no 2*XLEN LL/SC either!)
|
| It's one of those features that when you want it, you really
| really _really_ want it, and the substitutions are all pretty
| bad in comparison.
| CyberDildonics wrote:
| > Good lock free algorithms use double-width instructions like
| cmpxchg16b which compare 64-bits but swap 128-bits
|
| The instructions should compare 128 bits and swap 128 bits.
|
| I don't know why 'good' algorithms would use these if they
| don't need to, because 128 bit operations are slower.
|
| Not only that, 128 bit compare and swap doesn't work if it is
| not 128 bit aligned while 64 bit compare and swap will work
| even if they aren't 64 bit aligned.
| gpderetta wrote:
| On x86, any CAS on a misaligned address that crosses a cache
| line boundary can fault in the best case (if the mis-feature
| is disabled by the os) or cost thousands of clock cycles on
| _all_ cores. So it "works" only for small values of "works".
___________________________________________________________________
(page generated 2023-05-10 23:01 UTC)