[HN Gopher] A collection of lock-free data structures written in...
       ___________________________________________________________________
        
       A collection of lock-free data structures written in standard C++11
        
       Author : dnedic
       Score  : 108 points
       Date   : 2023-05-10 11:56 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | samsquire wrote:
       | I think I lean towards per-thread sharding instead of mutex based
       | or lock free data structures except for lockfree ringbuffers.
       | 
       | You can get embarassingly parallel performance if you split your
       | data by thread and aggregate periodically.
       | 
       | If you need a consistent view of your entire set of data, that is
       | a slow path with sharding.
       | 
       | In my experiments with multithreaded software I simulate a bank
       | where many bankaccounts are randomly withdrawn from and deposited
       | to. https://github.com/samsquire/multiversion-concurrency-
       | contro...                 ShardedBank.java
       | ShardedBank2.java       ShardedBankNonRandom.java
       | ShardedHashMap.java       ShardedNonRandomBank2.java
       | ShardedTotalOrder.java       ShardedTotalRandomOrder.java
       | 
       | I get 700 million requests per second over 12 threads due to the
       | sharding of money over accounts. Here I prioritise throughput of
       | transacitons per second over balance checks (a consistent view of
       | the data).
       | 
       | I am also experimenting with left-right concurrency control which
       | trades memory usage for performance, it basically keeps two
       | copies of data, one that is currently being read and written to
       | and another inactive copy that is not active. You switch the data
       | structures around periodically to "make changes visible" to that
       | thread.
        
       | asmnzxklopqw wrote:
       | [flagged]
        
       | jylam wrote:
       | Maybe I don't understand the concept, but aren't lock-free
       | structures "just" delegating the locking mechanism to the CPU via
       | atomic operations ? (although even if that's the case I can
       | understand the speedup) (and if so, why aren't all those lockfree
       | structures the default, instead of using mutexes ?)
        
         | gpderetta wrote:
         | Atomic operations can't be meaningfully said to perform any
         | locking.
         | 
         | In any case lock-free is defined in term of progress
         | guarantees.
        
       | 12323 wrote:
       | [flagged]
        
       | bluGill wrote:
       | From the FAQ:
       | 
       | > The biggest reason you would want to use a lockfree data
       | structure in such a scenario would be performance. Locking has a
       | non-neglegible runtime cost on hosted systems as every lock
       | requires a syscall.
       | 
       | This is misleading. While a lock does have a runtime cost, in
       | some cases that cost is less than all the force CPU cache
       | synchronization calls that lock free needs to do. With a lock you
       | only have to sync once, after all the operations are done. You
       | need to carefully measure this to see which is more performant
       | for your application.
        
         | RcouF1uZ4gsC wrote:
         | For me the biggest benefit of lock free programming is avoiding
         | deadlocks.
         | 
         | Locks are not composable. Unless you are aware of what locks
         | every function in your call tree is using, you can easily end
         | up with a deadlock.
        
         | smat wrote:
         | I assume the author of the library works under a hard real-time
         | constraint. Under such circumstances (an example would be low
         | latency audio) you can not tolerate the latency impact of a
         | sporadic syscall.
        
           | quietbritishjim wrote:
           | Perhaps, but that is almost the opposite of what they said:
           | in hard real time you can tolerate a longer average latency
           | in return for needing a shorter maximum latency. That matches
           | from what I would expect from a lock free data structure. But
           | that doesn't match the (dubious) claim that locks are usually
           | slower.
        
             | dnedic wrote:
             | You make good points, I will try to rephrase that part of
             | the README to be more accurate and explain benefits not
             | mentioned yet like determinism.
        
           | duped wrote:
           | You often _can_ tolerate the latency. The problem of locks is
           | they are potentially unbounded and you can miss a deadline.
           | 
           | It's not about performance so much as determinism.
        
             | bluGill wrote:
             | Lock free doesn't solve this. One thread will always make
             | progress, but you have no way to ensure it is your thread
             | so you can miss a deadline with lock free. When the data is
             | under a lot of contention across many cores this is an
             | issue (most of us don't have hundreds of cores so we don't
             | see this).
             | 
             | Generally lock-free is better for these situations as odds
             | are when you hold a lock at least some CPU cycles are used
             | for something that isn't directly modifying the data that
             | needs the lock - those cycles the other CPU can touch it
             | when lock-free. (note that when using a lock there is a
             | trade-off, often it is better to hold the lock for longer
             | than needed instead of dropping, doing a couple operations
             | and then locking again)
        
               | tialaramex wrote:
               | If you must make progress locally (not just globally) the
               | guarantee you need is wait freedom which is even stronger
               | than lock freedom.
               | 
               | (All wait free algorithms are lock free because
               | necessarily if every thread is guaranteed to eventually
               | make progress then overall progress is definitely made)
        
         | bcrl wrote:
         | This is why RCU, aka Read-Copy-Update, exists. RCU avoids the
         | expensive synchronization part by delaying the release of the
         | older version of the data structure until a point at which
         | _all_ CPUs are guaranteed to see the new version of the data.
         | The patents have now expired, so it 's worth investigating for
         | people that need to write high performance multithreaded code
         | that hits certain shared data structures really hard.
        
         | dnedic wrote:
         | The linked talk from Herb Sutter says as much, but it is a good
         | suggestion to make that visible upfrontm, thanks.
         | 
         | Additionally, cacheline alignment of indexes is something
         | that's there to mitigate the false sharing phenomenom and
         | reduce some of the cache synchronization cost.
        
         | ashvardanian wrote:
         | Both lock-free and mutex-based approaches have their
         | applications. The general rule of thumb, for 2-4 threads in the
         | same NUMA node lock-free is faster. Need more cores? Use a
         | proper heavy mutex.
        
         | planede wrote:
         | Moreover, is "every lock requires a syscall" accurate? Probably
         | depends on target platform and standard library, but my
         | impression was that at low contention it doesn't really require
         | syscalls, at least on Linux and glibc's pthread.
        
           | m4nu3l wrote:
           | That's my understanding too. In Linux mutexes are implemented
           | by futexes (Fast User-Space mutexes). If there is no
           | contention they are guaranteed to not perform a syscall
           | https://en.wikipedia.org/wiki/Futex
        
             | cout wrote:
             | Maybe this has changed, but last time I looked at futexes
             | there was no syscall for locking (assuming no contention),
             | but unlocking always made a syscall. This was many years
             | ago so it could be different now.
        
               | m4nu3l wrote:
               | The code isn't the easiest to read but in glibc it seems
               | that the syscall is only performed if waiters are
               | detected in userspace during an unlock operation
               | 
               | https://github.com/lattera/glibc/blob/master/nptl/pthread
               | _mu...
        
               | gpderetta wrote:
               | Indeed You only need to FUTEX_WAKE if you know there are
               | waiters (or of you lost track of the number of waiters).
        
           | kevincox wrote:
           | Definitely not. I don't think there many standard libraries
           | that use syscalls for every lock. It is near universal to
           | attempt a lock in user space (maybe even spin a few times)
           | and only call the kernel if you need to wait. So low-
           | contention locks should very rarely make a syscall.
        
             | adzm wrote:
             | Critical sections in Windows certainly behave this way with
             | a configurable spin count.
        
           | maldev wrote:
           | You can literally just do an atomic swap on some memory
           | location. Three lines of assembly, like.
           | mov rax, 1  ; load the value to exchange into rax
           | acquire_lock:         ; attempt to acquire the lock
           | xchg byte [ADDR_LOCK], al  ; atomically swap the lock value
           | with rax         test al, al  ; test if the original lock
           | value was 0 (unlocked)         jnz acquire_lock  ; if it was
           | not, loop until we can acquire the lock
           | 
           | The downside is you want a backoff to sleep the thread so it
           | doesn't go into a loop. But the actual lock code is simple.
           | You can easily have this be your function "AcquireLock()" and
           | then do
           | 
           | while(!AcquireLock()) { //pause thread execution. }
           | 
           | And I think this is where they get the syscall being needed,
           | since this will normally require a syscall to pause the
           | thread from the scheduler.
        
         | dragontamer wrote:
         | Pikus has a number of C++ Parallelism performance talks that
         | discusses this issue.
         | 
         | In particular, the "cost of synchronization" is roughly the
         | price of a L3 memory read/write, or ~50 cycles. In contrast, a
         | read/write to L1 cache is like 4 cycles latency and can be done
         | multiple times per clock tick, and you can go faster (IE:
         | register space).
         | 
         | So an unsynchronized "counter++" will be done ~4-billion times
         | a second.
         | 
         | But a "counter++; sync()" will slow you down 50x slower. That
         | is to say, the "sync()" is the "expensive part" to think about.
         | 
         | ------------
         | 
         | A lock is two sync() statements. One sync() when you lock, and
         | a 2nd sync() when you unlock. Really, half-sync() since one is
         | an acquire half-sync and the other is a release half-sync.
         | 
         | If your lock-free data structure uses more than two sync()
         | statements, you're _probably_ slower than a dumb lock-based
         | method (!!!). This is because the vast majority of
         | lock()/unlocks() are uncontested in practice.
         | 
         | So the TL;DR is, use locks because they're so much easier to
         | think about. When you need more speed, measure first, because
         | it turns out that locks are really damn fast in practice and
         | kind of hard to beat.
         | 
         | That being said, there's other reasons to go lock-free. On some
         | occasions, you will be forced to use lock-free code because
         | blocking could not be tolerated. (Ex: lock-free interrupts. Its
         | fine if the interrupt takes a bit longer than expected during
         | contention). So in this case, even if lock-free is slower, the
         | guarantee for forward progress is what you're going for, rather
         | than performance.
         | 
         | So the study of lock-free data-structures is still really
         | useful. But don't always think of for performance reasons,
         | because these data-structures very well will be slower than
         | std-library + lock() in many cases.
        
           | gpderetta wrote:
           | A sync, assuming it is your typical memory barrier, is not
           | bound by the L3 latency. You pay (in first approximation) the
           | L3 cost when you touch a contended cache line, whether you
           | are doing a plain write or a full atomic CAS.
           | 
           | Separately fences and atomic RMWs are slower than plain
           | read/writes, but that's because of the (partially)
           | serialising effects they have on a CPU pipleline, and very
           | little todo with L3 (or any memory) latency.
           | 
           | Case in point: A CAS on intel is 20ish cycles, the L3 latency
           | is 30-40 cycles or more. On the other hand you can have
           | multiple L3 misses outstanding, but CAS hardly pipelines.
        
             | dragontamer wrote:
             | Perhaps my brain is conflicting multiple things.
             | 
             | So here's what I know. L1 / L2 caches are "inside the
             | core". To talk to other cores, your data must leave L1/L2
             | cache and talk to a network. It is on _THIS_ network that
             | the L3 cache exists.
             | 
             | Its not really "L3 cache", its just the memory-network that
             | implements the MOESI protocol (or whatever proprietary
             | variant: MOESIF or whatever) that sits between L2 cache, L3
             | cache, and core-to-core communications. So I don't want to
             | make it sound like "You're waiting on L3 cache", because
             | you're right. Its not really related to L3 cache.
             | 
             | So I don't really know how that network works (I know the
             | high-level details of MOESI, but I also know that's the
             | textbook explanation and the real world is way more
             | complex), aside from "It'd be roughly on the same order-of-
             | magnitude cost" as the L3 cache.
             | 
             | -------
             | 
             | What's really going on, is that Core#1 is talking to all
             | the other cores and basically performing "git pull" and
             | "git push" to the data asynchronously. Core#1 prefers to
             | perform "git checkin" and "git checkout" (aka: talk to
             | L1/L2 cache), because that's faster.
             | 
             | But when Core#1 does a full sync (git push/git pull), it is
             | required to talk on the network that leads to other cores
             | and/or L3 cache.
             | 
             | Those "git push / git pull" messages are MOESI (F and
             | others), meaning Modified/Owned/Exclusive/Shared/Invalid
             | (and Forwarding, and other proprietary states). Which line
             | up to version-control way better than most people expect.
        
               | gpderetta wrote:
               | Indeed L3 being shared and also often working as the
               | MOESI directory works out to the interthread latency
               | being the same order of magnitude as the L3 latency.
               | 
               | My point is that sync has nothing to do with caches.
               | Caches are coherent all the time and do not need
               | barriers. In particular I don't think the git pull/push
               | maps well to MOESI as it is an optimistic protocol and
               | only require transfering opportunistically on demand what
               | is actually needed by a remote core, not conservatively
               | everything that has changed like in git.
               | 
               | The explicit sync model is more representative of non
               | coherent caches, which are not really common as they are
               | hard to use.
               | 
               | Memory barriers are for typical CPUs, purely an i
               | internal matter of the core and synchronize internal
               | queues with L1. In a simple x86 model, where the only
               | visible source of reordering is StoreLoad, a memory
               | barrier simply stalls the pipeline until all preceding
               | stores in program order are flushed out of the write
               | buffer into L1.
               | 
               | In practice things these days things are more complex and
               | a fence doesn't fully stall the pipeline, potentially
               | only needs to visibly prevents loads from completing.
               | 
               | Other more relaxed CPUs also need to synchronise load
               | queues, but still for the most part fences are a core
               | local matter.
               | 
               | Some architectures have indeed remote fences (even x86 is
               | apparently getting some in the near future) but these are
               | more exotic and, AFAIK, do not usually map to default
               | c++11 atomics.
        
               | dragontamer wrote:
               | > The explicit sync model is more representative of non
               | coherent caches, which are not really common as they are
               | hard to use.
               | 
               | Not that I'm a professional GPU programmer. But I'm
               | pretty certain that GPU caches are non-coherent.
               | 
               | But yeah, cache-coherence is just assumed on modern CPUs.
               | Your clarification on store-queues and load-queues is
               | helpful (even if the caches are coherent, the store-queue
               | and load-queue can still introduce an invalid reordering.
               | So it sounds like your point is that the various sync()
               | instructions are more about these queues?)
        
         | ot wrote:
         | This is true in principle and it is good calling it out, but in
         | practice I've never seen a mutex-based data structure beat an
         | equivalent lock-free data structure, even at low contention,
         | unless the latter is extremely contrived.
         | 
         | A mutex transaction generally requires 2 fences, one on lock
         | and one on unlock. The one on unlock would not be strictly
         | necessary in principle (on x86 archs the implicit acquire-
         | release semantics would be enough) but you generally do a CAS
         | anyway to atomically check whether there are any waiters that
         | need a wake-up, which implies a fence.
         | 
         | Good lock-free data structures OTOH require just one CAS (or
         | other fenced RMW) on the shared state.
         | 
         | Besides, at large scale, no matter how small your critical
         | section is, it will be preempted every once in a while, and
         | when you care about tail latency that is visible. Lock-free
         | data structures have more predictable latency characteristics
         | (even better if wait-free).
        
           | kerkeslager wrote:
           | I appreciate your polite tone here. To expand on this at the
           | risk of sounding a bit rude: nobody should listen to anyone
           | who speaks about performance in terms of reasoning about a
           | system instead of profiling it.
           | 
           | Computers are _shockingly_ complex. I can 't tell you how
           | many times I've reasoned about a system, ran the profiler,
           | and discovered I was completely wrong.
           | 
           | When I was working on an interpreter for a Lisp, I
           | implemented my first cut of scopes (all the variables within
           | a scope and their values) as a naive unsorted list of
           | key/value pairs, thinking I'd optimize later. When I came
           | back to optimize, I reimplemented this as a hashmap, but when
           | I ran my test programs, to my horror, they were all 10x
           | slower. I plugged in a hashmap library used in lots of
           | production systems and got a significant 2x performance gain,
           | which was still slower than looping over an unsorted list of
           | key/value pairs. The fact is, most scopes have <10 variables,
           | and at that size, looping over a list is faster than the
           | constant time of a hashmap. I can reason about why this is,
           | but that's just fitting my reasons to the facts ex-post-
           | facto. _Reasoning_ didn 't lead me to the correct answer,
           | _observation_ did.
           | 
           | Returning to parallel data structures, the fact is, _I don 't
           | know why lock-free structures are faster than mutex-based
           | structures, I just know that they are in every situation
           | where I've profiled them_.
           | 
           | Reasoning isn't completely useless--reasoning is how you
           | intuit what you should be profiling. But if you're just
           | reasoning about how two alternatives will perform and not
           | profiling them in real-life production systems you're wasting
           | everyone's time.
        
             | slaymaker1907 wrote:
             | I don't think that's quite correct because there are many
             | optimizations which are impossible (or nearly impossible)
             | to do after a system is implemented. Daniel Lemire wrote an
             | excellent post on this exact subject
             | https://lemire.me/blog/2023/04/27/hotspot-performance-
             | engine....
             | 
             | In terms of programming languages, I think python is an
             | excellent example of a language which has many features
             | that have ended up making it extremely difficult to
             | optimize even compared to other dynamic languages like
             | LISPs. Even if you don't have to worry about backwards
             | compatibility, there are design decisions that can limit
             | performance which end up necessitating a rewrite of the
             | entire system to actually change.
        
               | kerkeslager wrote:
               | > I don't think that's quite correct because there are
               | many optimizations which are impossible (or nearly
               | impossible) to do after a system is implemented.
               | 
               | Who said you have to profile _after_ a system is
               | implemented? Certainly I didn 't: if anything, I prefer
               | to profile during prototyping, although few companies
               | outside the largest budget for any real prototyping these
               | days it seems. Usually I settle for timing things and
               | profiling as early as possible so that you can catch any
               | performance issues before any calcifying structure is
               | built around the non-performant code.
               | 
               | Yes, I did profile after the fact in the Lisp story, but
               | my point in that story was that my reasoning led me to
               | the wrong conclusions, not that I did everything
               | perfectly (on the contrary, it's a story about learning
               | from my mistakes!).
               | 
               | I agree that Daniel Lemire post is excellent, but nothing
               | in that post leads me to believe he'd disagree with
               | anything I've said.
        
           | OskarS wrote:
           | Assuming low or no contention, it is easy to imagine a
           | scenario where a mutex vastly outperforms it: if you need to
           | push a 1000 things into the queue, it's still just two fences
           | for the mutex but it's now a 1000 CASes.
           | 
           | Moreover: the point with mutexes is that your data structure
           | can be the optimized assuming no thread-safety. There are
           | lots of, like, hyper-optimized hash table variants (with all
           | sorts of SIMD nonsense and stuff) that are just not possible
           | to do lock-free. The very "lock-freedomness" of the
           | datastructure slows it down enough that in low contention
           | scenarios mutexes clearly would outperform them without being
           | particularly contrived.
        
             | ot wrote:
             | If you are going to do batch operations, your data
             | structure should be optimized to support them, so you're
             | back to one CAS. The same would apply to the locked
             | scenario, where you probably don't want to copy 1000
             | elements in the critical section.
             | 
             | About the sufficiently smart optimizations, sure,
             | everything is easy to imagine, but in my experience this
             | never happened, and I'd be curious to hear practical
             | examples if you have any.
        
               | sakras wrote:
               | Here's one I had: I was trying to build a Bloom filter in
               | parallel. Each thread had large-ish batches of hashes it
               | wanted to insert into the filter. Naively, you'd just
               | have each thread iterate through the batches and do
               | __sync_fetch_and_or for each of the hashes (this was a
               | register-blocked Bloom filter so we only needed to
               | perform one 8-byte or operation per hash).
               | 
               | What ended up being MUCH faster was to partition the
               | filter, and to have a lock per partition. Each thread
               | would attempt to grab a random lock, perform its inserts
               | into that partition, then release the lock and try to
               | grab another random lock that it hasn't grabbed yet.
               | Granted, these locks were just atomic booleans, not
               | std::mutex or anything like that. But I think this
               | illustrates that partitioning+locking can be better for
               | throughput. If you want predictable latency for single
               | inserts, then I'd imagine the __sync_fetch_and_or
               | strategy would work better. Which maybe brings up a
               | broader point that this whole discussion relies a lot on
               | exactly what "faster" means to you.
        
             | dnedic wrote:
             | This is why you would use the Ring Buffer or Bipartite
             | Buffer to place 1000 elements at a time, not the queue.
             | Check the documentation for more info.
        
             | T0pH4t wrote:
             | ^ This can definitely be the case in a multi-
             | producer/multi-consumer (MPMC) scenario if CAS is involved
             | with loops. Great care has to be taken when writing MPMC
             | data structures without locks that are more performant then
             | lock equivalents; they are far more complex. I think it
             | should be called out that it seems most (if not all) of the
             | data structures provided are single producer/consumer which
             | generally always have much simpler designs (and limitations
             | of use) then MPMC.
        
               | T0pH4t wrote:
               | I could I should also say this applies to MPSC and SPMC.
               | Basically anything other than SPSC.
        
             | [deleted]
        
           | josephg wrote:
           | Are there any good benchmarks which demonstrate the
           | performance characteristics you're talking about? Or case
           | studies where an application moved from mutexes to lock free
           | data structures, and compared the resulting performance?
        
             | xfs wrote:
             | https://youtu.be/_qaKkHuHYE0 (CppCon) A senior software
             | engineer at Google tried to optimize tcmalloc by replacing
             | a mutex with lockless MPMC queue. After many bugs and
             | tears, the result is not statistically significant in
             | production systems.
        
             | bluGill wrote:
             | This is impossible to do in a useful way for publication.
             | You can do case studies, but minor changes in various
             | factors that seem minor can make a massive difference in
             | benchmarks. As such you need to find real world data, used
             | in a real world scenario, for your application: then
             | benchmark it. Even then you have a benchmark useful for
             | your application only, and not worth publishing.
        
               | kerkeslager wrote:
               | I disagree. I've gotten a lot from reading publications
               | about people profiling their applications and posting
               | about the results with descriptions of their application
               | design and load. Yes, obviously you can't read that and
               | draw conclusions about how the same data structure or
               | algorithm will perform in your application, but it helps
               | you build an intuition for what is likely to work in
               | applications with different characteristics.
               | 
               | Here's an example:
               | https://queue.acm.org/detail.cfm?id=1814327
               | 
               | If you read that for the first time and don't learn
               | something, I'd be quite surprised.
        
       | andersced wrote:
       | Should be benchmarked against ->
       | 
       | https://github.com/Deaod/spsc_queue
       | 
       | If proven faster OK.. If not.. Well.. back to the drawing board.
       | 
       | I gave it a try -> https://github.com/andersc/fastqueue
       | 
       | Deaod is the kingpin.
        
         | jcelerier wrote:
         | https://max0x7ba.github.io/atomic_queue/html/benchmarks.html
         | for an existing set of benchmarks where this could be added
        
       | whiteboardma wrote:
       | I've only looked at the queue implementation, but both push and
       | pop contain obvious race conditions; I would highly suggest
       | adding tests that actually use the data structures from multiple
       | threads.
        
         | dnedic wrote:
         | Could you elaborate on the alleged race conditions? Any advice
         | on reliably testing the race conditions? The problem with
         | adding those is the fact that they will give lots of false
         | negatives and if you rely on them you have a problem.
        
           | whiteboardma wrote:
           | Looking at the Push operation defined in queue_impl.hpp, if
           | multiple threads perform concurrent pushes, they might end up
           | writing their element to the same slot in _data since the
           | current position _w is not incremented atomically
        
             | ape4 wrote:
             | Just add a lock ;)
        
             | dnedic wrote:
             | This is a multi producer scenario, the README clearly
             | states that these data structures are only single producer
             | single consumer multi thread/interrupt safe.
             | 
             | I will also add disclaimers to the javadocs comments on
             | methods just to reduce confusion.
        
               | whiteboardma wrote:
               | Wops, my bad
        
           | gjadi wrote:
           | You could use TLA+ to model the data structure operations and
           | check the invariant. Checking the invariant with assert is
           | also useful in my limited experience with concurrency.
           | 
           | https://lamport.azurewebsites.net/tla/tla.html
        
             | dnedic wrote:
             | Thanks a lot, will check this out!
        
       | kilotaras wrote:
       | - A lot of code won't work for types with no default
       | constructors, but that is at least compile error
       | 
       | - Using memcpy[0] for arbitrary types is just wrong, see [1]
       | 
       | [0]
       | https://github.com/DNedic/lockfree/blob/main/lockfree/inc/bi...
       | 
       | [1] https://www.open-
       | std.org/jtc1/sc22/wg21/docs/papers/2023/p11...
        
         | dnedic wrote:
         | This is already noted in the Queue readme, only the Queue
         | constructs the type, the other 2 data structures are meant for
         | PODs only.
         | 
         | I will take a look at adding support for constructing in-place
         | for the other 2 data structures, but at the moment, they are
         | just for PODs.
        
       | Notbrainiac wrote:
       | Every datastructure is lock free. Locks are required when you
       | have multiple writers. The article states the usefull only for
       | certain circumstance: for single consumer single producer
       | scenarios. So yea within these assumptions you can make something
       | work.
        
         | jimktrains2 wrote:
         | You may need a lock if you have operations that are not atomic,
         | even with a single writer, as a reader could find an
         | inconsistency.
        
         | ot wrote:
         | This seems unnecessarily pedantic. Lock-free conventionally
         | implies concurrent, otherwise it's meaningless.
        
         | Koshkin wrote:
         | Note that "lock-free" is a technical term that has a very
         | specific, precise meaning (implying that the _user_ of the
         | structure does not need to use locking explicitly).
        
         | dnedic wrote:
         | Even in single producer single consumer scenarios you need
         | locks for multithreaded/interrupt use if you're not properly
         | using atomics and proper fences.
        
         | elbigbad wrote:
         | [flagged]
        
       | cjensen wrote:
       | An issue in C++ is that it only supports atomic changes to the
       | builtin types. For example, you can only CAS a 64-bit value if
       | your largest integer/pointer type is 64-bits.
       | 
       | Good lock free algorithms use double-width instructions like
       | cmpxchg16b which compare 64-bits but swap 128-bits. You can then
       | use the compared 64-bits as a kind of version number to prevent
       | the a-b-a problem.
       | 
       | Using only the built-in atomics is working with a hand tied
       | behind your back. With the wider version, it's trivial to write
       | multi-producer multi-consumer stacks with no limits to the number
       | of objects stored. It's also pretty easy (if you copy the
       | published algorithm) to do the same with queues.
        
         | dnedic wrote:
         | True, that would help immensely in creating MPMC data
         | structures, but as these are SPSC there is no problem. Also to
         | clarify, this is only for the indexes, the data members can be
         | anything.
         | 
         | Using these intrinsics or inline assembly would break
         | portability or create situations where platforms have different
         | feature levels, which is not something I intend to do. I want
         | the library to be compatible with everything from a tiny MCU to
         | x86.
        
         | gpderetta wrote:
         | Actually C++ only requires TriviallyComparable for std::atomic.
         | The issue with 2CAS is that intel until very recently only
         | provided cmpxchg16b[1] but no 128 atomic load and stores: SSE
         | 128 bit memory operations were not guaranteed to be atomic (and
         | in fact were observed not to be on some AMDs).
         | 
         | So a 128 bit std::atomic on intel was not only suboptimal as
         | the compiler had to use 2cas for load and stores as well, but
         | actually wrong as an atomic load from readonly memory would
         | fault. So at some point the ABI was changed to use the spinlock
         | pool. Not sure if it has changed since.
         | 
         | If you do it "by hand", when you only need a 2cas, a 128 bit
         | load that is not atomic is fine as any tearing will be detected
         | by the CAS and 'fixed', but it is hard for the compiler to
         | optimize generic code.
         | 
         | [1] which actually does full 128bit compare and swap, you are
         | probably confusing it with the Itanium variant.
        
         | aseipp wrote:
         | The lack of DWCAS as a primitive in general is really annoying,
         | C++ aside. RISC-V's -A extension has no form of it, either; you
         | only get XLEN-sized AMO + LL/SC (no 2*XLEN LL/SC either!)
         | 
         | It's one of those features that when you want it, you really
         | really _really_ want it, and the substitutions are all pretty
         | bad in comparison.
        
         | CyberDildonics wrote:
         | > Good lock free algorithms use double-width instructions like
         | cmpxchg16b which compare 64-bits but swap 128-bits
         | 
         | The instructions should compare 128 bits and swap 128 bits.
         | 
         | I don't know why 'good' algorithms would use these if they
         | don't need to, because 128 bit operations are slower.
         | 
         | Not only that, 128 bit compare and swap doesn't work if it is
         | not 128 bit aligned while 64 bit compare and swap will work
         | even if they aren't 64 bit aligned.
        
           | gpderetta wrote:
           | On x86, any CAS on a misaligned address that crosses a cache
           | line boundary can fault in the best case (if the mis-feature
           | is disabled by the os) or cost thousands of clock cycles on
           | _all_ cores. So it  "works" only for small values of "works".
        
       ___________________________________________________________________
       (page generated 2023-05-10 23:01 UTC)