[HN Gopher] The concurrency trap: How an atomic counter stalled ...
       ___________________________________________________________________
        
       The concurrency trap: How an atomic counter stalled a pipeline
        
       Author : delifue
       Score  : 80 points
       Date   : 2025-06-07 11:25 UTC (6 days ago)
        
 (HTM) web link (www.conviva.com)
 (TXT) w3m dump (www.conviva.com)
        
       | hinkley wrote:
       | > However, further analysis in the perf environment revealed that
       | while there was a spike in the number of active sessions, those
       | gradually dropped off while the DAG processing time continued to
       | stay high.
       | 
       | Thundering herds will do that. Queuing theory. Once the queue
       | builds up it takes far longer to clear than intuition suggests.
        
       | hinkley wrote:
       | > The big difference between a concurrent hash map and ArcSwap is
       | that ArcSwap requires swapping out the entirety of the underlying
       | data with every write but trades this off with very cheap reads.
       | Writers don't even have to wait for all readers to finish since a
       | new epoch is created with the new version of data.
       | 
       | This is where tree data structures win out. Git and subversion
       | and pure functional languages don't replace the entire data
       | structure. They make a new element and attach it to the tree by
       | creating a new version of the intended parent, and a new version
       | of its parent, on up to the root element.
       | 
       | So somewhere out there is space for a hybrid of ArcSwap and
       | DashMap that uses sharded copy on write. These guys probably
       | don't need it given their low write rate, but shards should help
       | with write contention for other use cases.
        
         | masklinn wrote:
         | You still need something close to ArcSwap with your persistent
         | data structure, because once you've built your new tree
         | (possibly using transients) you need to overwrite the old root
         | with the new, atomically (and probably in a CAS loop really).
         | 
         | Left-right is an option which iirc avoids the CAS loop, as it's
         | a single writer primitive.
        
         | cmrdporcupine wrote:
         | im::HashMap is a great tool
         | 
         | I would love, though, to have a similar persistent structure
         | where the node pointers at each level were thread shareded,
         | instead of using e.g. atomics. A mutation at a branch could
         | record that the mutation is "just" for thread ID X that did the
         | mutation. At some later point a "commit" could reconcile the
         | storage to global, maybe
        
         | cryptonector wrote:
         | You could easily organize the hashmap hierarchically, yes.
        
         | deepsun wrote:
         | So the root element single point of congestion for all writes,
         | right?
        
           | hinkley wrote:
           | With a filesystem based trees that's often sufficient to help
           | reduce the amount of write contention. Because you're not
           | serializing the whole goddamned thing back to disk, just the
           | parts that changed, which are logarithmic to the size of the
           | edit history. And git uses GC to deal with writes that get to
           | the end and then fail due to concurrent access collisions.
           | 
           | For in-memory problems, there are B-trees, which are an older
           | solution to many problems than most of HN's userbase,
           | including myself. Then you still have the sharding which I
           | did in fact point out would be part of a better hybrid
           | solution.
        
       | nick_ wrote:
       | It's unclear if the author understands false sharing. The
       | _excessive_ "ping-pong"ing of cache lines between cores means the
       | cache line that the counter sits on is shared with other memory
       | that's being accessed in contention.
       | 
       | Making sure the counter variable is sitting alone on its own
       | cache line would completely remove the _excessive_ ping-pong
       | behaviour. The cache line with the counter would, of course,
       | still ping-pong as it 's mutated, but not excessively.
        
         | jeffbee wrote:
         | Long blog, corporate formatting covered in popups and hero
         | images, screenshots of the incredibly ugly ketchup and mustard
         | flame graph, discussion with some errors or vagueness a
         | completely vanilla performance detail. These are the universal
         | ingredients of the midsize SaaS company semitechnical marketing
         | game.
        
           | finnh wrote:
           | It's also unclear how the first graph, whose y-axis is
           | labelled "req/s", is showing a latency spike. Latency is not
           | visible on that graph, AFAICT, only throughput.
        
         | gpderetta wrote:
         | I don't think false sharing matters here. Cacheline bouncing of
         | the counter, which is written even by readers would be enough
         | to explain the bottleneck. The barrier being likely implied or
         | explicit in the atomic RMW used to update the counter will also
         | stall the pipeline, preventing any other memory operation to
         | happen concurrently.
         | 
         | All threads repeatedly CASing on the same memory location is
         | enough to bring any cpu to its knees.
         | 
         | The critical part of the fix is that readers no longer have to
         | modify any shared memory location and rely on deferred
         | reclamation.
        
         | senderista wrote:
         | Why not? If enough readers are mutating the same ref counter,
         | with very short critsecs, then this could definitely be a
         | problem, regardless of false sharing.
        
           | forrestthewoods wrote:
           | > If enough readers are mutating the same ref counter
           | 
           | Well if they were mutating the ref counter I'd call them
           | writers not readers. =P
           | 
           | But yes. Many people incorrectly assume that writing an
           | atomic_int from many threads is super fast. I mean it's
           | atomic! It couldn't be a smaller atom of work! Alas this is
           | incredibly incorrect. Even atomic ints can be extremely slow.
        
             | gpderetta wrote:
             | They are logically readers, the mutation is an
             | implementation detail.
        
               | senderista wrote:
               | Which relates to a fundamental principle of concurrent
               | programming: "(logical) readers shouldn't (physically)
               | write*".
               | 
               | *to contended shared memory
        
               | gpderetta wrote:
               | Well yes, hence the fix.
        
         | duskwuff wrote:
         | > The cache line with the counter would, of course, still ping-
         | pong as it's mutated, but not excessively.
         | 
         | And the way you solve this is by using a separate counter for
         | each core, at least a cache line apart from each other to
         | ensure they all get separate cache lines. The downside is that
         | you lose the ability to read the counter atomically - but if
         | the counter is incrementing fast enough that you need per-CPU
         | counters, a point-in-time value is basically meaningless
         | anyway.
        
           | gpderetta wrote:
           | The "counter" here is an RWMutex (or RWSpinLock). You can't
           | really make it distributed while guaranteeing mutual
           | exclusion [1]. You could use an hashmap with more fine
           | grained locking, but to deal with rehashing you end up
           | reinventing something like ArcSwap anyway.
           | 
           | [1] True no-writes-on-shared-locking RWMutexes do exist, but
           | are significantly more complex and require epoch detection
           | like the ArcSwap, so there is no significant advantage.
        
           | hinkley wrote:
           | For popcounts this is the classical solution. Make
           | aggregation lazy and worry about keeping k smaller numbers
           | accurate cheaply, and let the reader worry about calculating
           | a total, which is likely eventually consistent anyway.
           | 
           | But for multiple read exclusive write this is never going to
           | work. Or at least not without extra silicon. Which maybe
           | should be a thing. The problem of course would be that a
           | multi-word atomic sum instruction would actually have to
           | cross cache lines to avoid false sharing. So you'd end up
           | with counters on contiguous cache lines but occupying 64
           | bytes apiece, which is a lot.
           | 
           | It would almost call for a special region of memory that has
           | different cache coherency behavior and I can't see that being
           | easy, fast, or backward compatible which is maybe why we
           | don't do it.
        
         | cryptonector wrote:
         | Or maybe it just so happens that under the observed load most
         | readers really wanted to frob that particular counter, in which
         | case the only solution is to get rid of the counter.
        
         | hinkley wrote:
         | I think it may have been worth calling out the false sharing to
         | readers who are unfamiliar with the problem domain, but I don't
         | get the impression this was false sharing but regular old
         | sharing.
        
           | nick_ wrote:
           | I think you're right.
        
       | vrosas wrote:
       | > While we knew where to look, this investigation had already
       | taken weeks and things took a turn for the worse when we hit the
       | issue again on February 23rd.
       | 
       | It blows me away an issue like this could take _weeks_ to track
       | down. If I were in any leadership position at this company I 'd
       | be rolling heads with the lack of telemetry or domain knowledge
       | for these systems.
        
         | dblohm7 wrote:
         | I can't say I'm surprised, TBH. I had a rough idea of where the
         | problem might lie just by reading the title of the post. But I
         | was fortunate enough to do an undergraduate degree where
         | concurrency was actually taught, plus I've learned a lot over
         | the years working in highly-concurrent, asynchronous
         | environments.
         | 
         | Concurrent programming has been mainstream for some time now,
         | but I don't think the level of expertise of most engineers has
         | kept up. That becomes most apparent when software starts
         | hitting concurrency pitfalls: performance problems, deadlocks,
         | UAFs, and so on...
        
         | cryptonector wrote:
         | In my career I've had to deal with two big outages that took
         | over a week to diagnose and fix. One of those involved my code,
         | though the conditions that caused it involved bugs in others'
         | code interacting with bugs in mine such that in isolation none
         | of them would have caused an outage. The first one took two
         | weeks to diagnose, and I had to write specialty tools to help
         | diagnose it. The second took lots of data gathering, code
         | inspection, and making and testing many hypotheses. In neither
         | case did heads roll. The staff who go through such a trial by
         | fire come out of it all the more valuable than before, and that
         | includes those whose "fault" it might have been.
         | 
         | Sometimes you can't have the domain knowledge you'd want
         | beforehand. The reasons vary a great deal. For example, the
         | software in question might be commercial, and you might not
         | have an alternative vendor to switch to. Other times your "bus
         | factor" might have dropped to uncomfortably small values
         | through no fault of anyone in the organization (people leaving
         | too quickly for reasons having nothing to do with the way the
         | org is run, people dying, headcount and budgets not allowing
         | for remediating the situation, and just having way too many
         | critical systems whose bus factor to keep wide at all times).
        
       | kccqzy wrote:
       | I haven't seen a pure user space implementation of RCU that works
       | well without kernel support. The RCU methodology is easy to
       | understand, but how does a user space implementation understand
       | when it is safe to dispose of a piece of data structure? Don't
       | you need a primitive to run some code on each CPU?
        
         | j_seigh wrote:
         | You just need to keep track of each thread's quiescent states
         | however they are implemented.
        
           | kccqzy wrote:
           | Right but isn't that difficult to implement in a library?
        
             | cryptonector wrote:
             | Not really. https://github.com/nicowilliams/ctp is mine --
             | the API is not RCU, but a simpler and easier to use
             | "thread-safe variable" construct. You'll find two C
             | implementations there. I've been using this in production
             | for almost a decade without any trouble. (For a long time I
             | had a missing producer membar in the create primitive
             | (oops!), but because it only happens once it never caused
             | me any problems.)
             | 
             | To be fair I thought about it for a long time first. But
             | nowadays these techniques are well known, so now it's
             | easier to "copy" the thinking others have done, and that
             | thinking is the hard part.
        
               | kccqzy wrote:
               | I took a brief look at your link. How does this even
               | qualify to be RCU? The read-copy-update paradigm requires
               | that the update step be able to know whether the value to
               | be updated is derived from the recent copy. The set
               | function here doesn't even allow you to atomically
               | compare your read version against the current version.
               | Imagine you are implementing a simple array with RCU; you
               | read the array, make a copy, append to the array, and
               | then you need to atomically compare the array pointer
               | pre-append and set the array just so that you don't lose
               | writes by other threads.
        
               | cryptonector wrote:
               | The get function always returns you a stable value if at
               | least one value was ever written (else you get NULL,
               | natch).
               | 
               | The set function will not invalidate any values seen by
               | get that might still be referenced, but the new value
               | will be seen by all subsequent gets.
               | 
               | Old values will be destructed when no longer referenced.
               | 
               | To read-copy-update you'll need to take a lock (that
               | readers don't), get the current value (it will be the
               | current one given you're locking out other writers), copy
               | it, modify the copy, then set the new value.
               | 
               | You don't have to read-copy-update if the new values
               | don't depend on the preceding values. So my API does not
               | force read-copy-update, but you _can_ do RCU with it.
        
               | kccqzy wrote:
               | Got it. So basically you want writers to use their own
               | lock to coordinate writes. Okay, then this makes it
               | inconvenient to deal with readers that could possibly but
               | infrequently become writers. They have to do a second
               | read under a lock to write. And if one chooses a data
               | structure for which copy isn't a cheap operation, this
               | can be slower than classic RCU.
        
               | cryptonector wrote:
               | > Okay, then this makes it inconvenient to deal with
               | readers that could possibly but infrequently become
               | writers. They have to do a second read under a lock to
               | write.
               | 
               | Well, first of all reads are real cheap, so that's not a
               | big concern. In some systems readers don't ever turn
               | around to write, so that's ok, and in others they don't
               | modify the current value. E.g., in one system I've built
               | we use TinyCDB files and when new files are renamed into
               | place we open those and set the handle on one of these
               | thread-safe variables -- no read-modify involved here.
               | 
               | > And if one chooses a data structure for which copy
               | isn't a cheap operation, this can be slower than classic
               | RCU.
               | 
               | Either way it's RCU. It's best to publish immutable data
               | with RCU and design data structures where copy-on-write
               | modifications don't have to copy the whole data structure
               | but rather just small parts of it. E.g., in a tree you
               | should only have to copy the interior nodes from the one
               | you're modifying to the root.
               | 
               | One can always build a hash table where each cell is one
               | of these thread-safe variables (or plain old RCU), though
               | one would have to make these thread-safe variables
               | lighter weight to make having lots of them realistic
               | (currently my implementation has a pthread_key_t per
               | variable, which does not scale).
               | 
               | In TFA ArcSwap is essentially indistinguishable from a
               | read-only hash table published on something like one of
               | my thread-safe variables, and for TFA this worked better
               | than the available alternatives in spite of the copy-on-
               | write burden.
               | 
               | BTW, I built this thing originally to allow global
               | configuration to be shared with otherwise-shared-nothing
               | threads and allow the service to be reconfigurable at
               | run-time. Without something like this (or plain old RCU)
               | your options are limited to things like using read-write
               | locks (which are way worse than anything you can imagine
               | being bad about RCU used with immutable data structures).
               | I was very displeased with read-write locks when I needed
               | to solve this particular problem at Sun, but I didn't
               | build this thing until after I left Sun (Oracle).
               | 
               | Nowadays RCU-ish libraries are pretty widely available.
               | OpenSSL 3.3 (or was it 3.4?) has something like this
               | after the threaded performance debacles of 3.0. In
               | particular OpenSSL has an RCU-ish hash table. You might
               | be interested in it. IIRC (but it's been a while since I
               | looked) it uses one hazard-pointer-per-thread to acquire
               | a stable reference to a key or value long enough to then
               | incref a refcount -- pretty clever, and that's
               | essentially the way to make a "thread-safe variable"
               | light-weight enough to be able to use one for each slot
               | in a hash table.
        
             | j_seigh wrote:
             | Thread local vars would be used with lazy initialization.
             | Clean up might be a little tricky depending on what
             | implementation of thread local you use. Thread local
             | support is not as great as it should be.
        
         | cryptonector wrote:
         | You don't need kernel assistance to make user-land RCU-like
         | structures work well enough. I've done it twice. The price you
         | pay for not having kernel assistance is that some threads
         | sometimes have to do extra work, like signal waiting writers,
         | loop over a very short read, write to hazard pointer, read
         | again until the value read is stable, or do a garbage
         | collection, but this price is not too heavy.
        
       | athrowaway3z wrote:
       | I like the idea of arc_swap, but I've tried skimming the code to
       | see if it fit my mental model and found it way more complex than
       | i expected.
        
       | cryptonector wrote:
       | TFA is about concurrent hashmaps that croak under heavy read load
       | because the synchronization mechanisms of those hashmaps caused
       | cache ping-pong, and the fix was to use RCU with a copy-on-write
       | data structure for the hashmap. Each write requires cloning the
       | previous hashmap, but because the readers don't perform any
       | atomic operations other than one read (for the current read-only
       | hashmap) it all works out as the cache lines don't have to ping-
       | pong given that the readers are not writing.
       | 
       | The stuff about "debt" is about the design of ArcSwap, which uses
       | "debt" instead of "hazard pointers". The idea is that every
       | thread keeps track of the last version of the hashmap that it's
       | seen, and this is "debt" (references that need to be released
       | eventually), and this debt is a thread-local object that is
       | linked with all the other threads' debt, which essentially allows
       | garbage collect of "debt" (debt collection?), or something like
       | that. See https://github.com/vorner/arc-
       | swap/blob/master/src/docs/inte...
       | 
       | I've implemented something similar myself. Instead of
       | implementing user-land RCU (which is what perhaps I should have
       | done) I implemented a "thread-safe variable" where reads are
       | lock-less and wait-free, and writers pay all the costs. My
       | thread-safe variable API is real simple and intuitive:
       | `create(value_destructor) -> thread_safe_var`,
       | `set(thread_safe_var, value)`, `get(thread_safe_var) -> value`,
       | `release(thread_safe_var)`, and `destroy(thread_safe_var)`
       | (presumably only during an atexit() handler or shared object
       | destructor call, if at all). I have two implementations, one with
       | a linked list of per-thread hazard pointers, and the other with a
       | tuple of {next/current, current/previous} "cells" and a protocol
       | that gives readers time to read the cells, figure out which is
       | "current" and then dereference and incref a refcount.
       | 
       | Regardless of how one implements this data structure, the key
       | concept is that you have to atomically compose two distinct
       | atomic operations: a read, and a dereference to atomically
       | increment a reference count -- this is remarkably tricky!
       | 
       | When you have a GC it's really easy: just read -- there is no
       | reference count to increment.
       | 
       | With a linked list of per-thread hazard pointers all you have to
       | do is read the thing then write it into your thread-local hazard
       | pointer then read again, looping until the second read is the
       | same as the first read. Having the value thus safely copied to
       | the reader's thread-local hazard pointer then allows for notional
       | garbage collection. Any thread that would release an old version
       | of the value has to check all the hazard pointers to see that
       | it's not in use, and this amounts to a small garbage collection.
       | 
       | I especially like the hazard pointer approach.
       | 
       | See also MVars (Haskell), transactional memory, RCU, etc.
        
       | nasretdinov wrote:
       | Hate to be that person, but Go developers hit it first and then
       | implemented (a really nice in my opinion) concurrent hash map
       | that does stuff similar to what is described in the article, but
       | without copying the entire hash table each time:
       | https://pkg.go.dev/sync#Map
        
         | benmmurphy wrote:
         | Concurrent maps are often easier to implement in GC languages
         | because you can use the garbage collector for cleaning old
         | values. Whereas without a GC you will need an alternative like
         | RCU.
        
           | nasretdinov wrote:
           | I do agree with the statement that in GC languages concurrent
           | structures are easier to implement (and I actually quite like
           | that about Go too), however, fundamentally, GC languages also
           | implement GC itself somehow :), so the only real difference I
           | guess is that for some (complex) data structures you _need_
           | GC to implement them cleanly, and that's what RCU does
        
       ___________________________________________________________________
       (page generated 2025-06-13 23:01 UTC)