[HN Gopher] The concurrency trap: How an atomic counter stalled ...
___________________________________________________________________
The concurrency trap: How an atomic counter stalled a pipeline
Author : delifue
Score : 80 points
Date : 2025-06-07 11:25 UTC (6 days ago)
(HTM) web link (www.conviva.com)
(TXT) w3m dump (www.conviva.com)
| hinkley wrote:
| > However, further analysis in the perf environment revealed that
| while there was a spike in the number of active sessions, those
| gradually dropped off while the DAG processing time continued to
| stay high.
|
| Thundering herds will do that. Queuing theory. Once the queue
| builds up it takes far longer to clear than intuition suggests.
| hinkley wrote:
| > The big difference between a concurrent hash map and ArcSwap is
| that ArcSwap requires swapping out the entirety of the underlying
| data with every write but trades this off with very cheap reads.
| Writers don't even have to wait for all readers to finish since a
| new epoch is created with the new version of data.
|
| This is where tree data structures win out. Git and subversion
| and pure functional languages don't replace the entire data
| structure. They make a new element and attach it to the tree by
| creating a new version of the intended parent, and a new version
| of its parent, on up to the root element.
|
| So somewhere out there is space for a hybrid of ArcSwap and
| DashMap that uses sharded copy on write. These guys probably
| don't need it given their low write rate, but shards should help
| with write contention for other use cases.
| masklinn wrote:
| You still need something close to ArcSwap with your persistent
| data structure, because once you've built your new tree
| (possibly using transients) you need to overwrite the old root
| with the new, atomically (and probably in a CAS loop really).
|
| Left-right is an option which iirc avoids the CAS loop, as it's
| a single writer primitive.
| cmrdporcupine wrote:
| im::HashMap is a great tool
|
| I would love, though, to have a similar persistent structure
| where the node pointers at each level were thread shareded,
| instead of using e.g. atomics. A mutation at a branch could
| record that the mutation is "just" for thread ID X that did the
| mutation. At some later point a "commit" could reconcile the
| storage to global, maybe
| cryptonector wrote:
| You could easily organize the hashmap hierarchically, yes.
| deepsun wrote:
| So the root element single point of congestion for all writes,
| right?
| hinkley wrote:
| With a filesystem based trees that's often sufficient to help
| reduce the amount of write contention. Because you're not
| serializing the whole goddamned thing back to disk, just the
| parts that changed, which are logarithmic to the size of the
| edit history. And git uses GC to deal with writes that get to
| the end and then fail due to concurrent access collisions.
|
| For in-memory problems, there are B-trees, which are an older
| solution to many problems than most of HN's userbase,
| including myself. Then you still have the sharding which I
| did in fact point out would be part of a better hybrid
| solution.
| nick_ wrote:
| It's unclear if the author understands false sharing. The
| _excessive_ "ping-pong"ing of cache lines between cores means the
| cache line that the counter sits on is shared with other memory
| that's being accessed in contention.
|
| Making sure the counter variable is sitting alone on its own
| cache line would completely remove the _excessive_ ping-pong
| behaviour. The cache line with the counter would, of course,
| still ping-pong as it 's mutated, but not excessively.
| jeffbee wrote:
| Long blog, corporate formatting covered in popups and hero
| images, screenshots of the incredibly ugly ketchup and mustard
| flame graph, discussion with some errors or vagueness a
| completely vanilla performance detail. These are the universal
| ingredients of the midsize SaaS company semitechnical marketing
| game.
| finnh wrote:
| It's also unclear how the first graph, whose y-axis is
| labelled "req/s", is showing a latency spike. Latency is not
| visible on that graph, AFAICT, only throughput.
| gpderetta wrote:
| I don't think false sharing matters here. Cacheline bouncing of
| the counter, which is written even by readers would be enough
| to explain the bottleneck. The barrier being likely implied or
| explicit in the atomic RMW used to update the counter will also
| stall the pipeline, preventing any other memory operation to
| happen concurrently.
|
| All threads repeatedly CASing on the same memory location is
| enough to bring any cpu to its knees.
|
| The critical part of the fix is that readers no longer have to
| modify any shared memory location and rely on deferred
| reclamation.
| senderista wrote:
| Why not? If enough readers are mutating the same ref counter,
| with very short critsecs, then this could definitely be a
| problem, regardless of false sharing.
| forrestthewoods wrote:
| > If enough readers are mutating the same ref counter
|
| Well if they were mutating the ref counter I'd call them
| writers not readers. =P
|
| But yes. Many people incorrectly assume that writing an
| atomic_int from many threads is super fast. I mean it's
| atomic! It couldn't be a smaller atom of work! Alas this is
| incredibly incorrect. Even atomic ints can be extremely slow.
| gpderetta wrote:
| They are logically readers, the mutation is an
| implementation detail.
| senderista wrote:
| Which relates to a fundamental principle of concurrent
| programming: "(logical) readers shouldn't (physically)
| write*".
|
| *to contended shared memory
| gpderetta wrote:
| Well yes, hence the fix.
| duskwuff wrote:
| > The cache line with the counter would, of course, still ping-
| pong as it's mutated, but not excessively.
|
| And the way you solve this is by using a separate counter for
| each core, at least a cache line apart from each other to
| ensure they all get separate cache lines. The downside is that
| you lose the ability to read the counter atomically - but if
| the counter is incrementing fast enough that you need per-CPU
| counters, a point-in-time value is basically meaningless
| anyway.
| gpderetta wrote:
| The "counter" here is an RWMutex (or RWSpinLock). You can't
| really make it distributed while guaranteeing mutual
| exclusion [1]. You could use an hashmap with more fine
| grained locking, but to deal with rehashing you end up
| reinventing something like ArcSwap anyway.
|
| [1] True no-writes-on-shared-locking RWMutexes do exist, but
| are significantly more complex and require epoch detection
| like the ArcSwap, so there is no significant advantage.
| hinkley wrote:
| For popcounts this is the classical solution. Make
| aggregation lazy and worry about keeping k smaller numbers
| accurate cheaply, and let the reader worry about calculating
| a total, which is likely eventually consistent anyway.
|
| But for multiple read exclusive write this is never going to
| work. Or at least not without extra silicon. Which maybe
| should be a thing. The problem of course would be that a
| multi-word atomic sum instruction would actually have to
| cross cache lines to avoid false sharing. So you'd end up
| with counters on contiguous cache lines but occupying 64
| bytes apiece, which is a lot.
|
| It would almost call for a special region of memory that has
| different cache coherency behavior and I can't see that being
| easy, fast, or backward compatible which is maybe why we
| don't do it.
| cryptonector wrote:
| Or maybe it just so happens that under the observed load most
| readers really wanted to frob that particular counter, in which
| case the only solution is to get rid of the counter.
| hinkley wrote:
| I think it may have been worth calling out the false sharing to
| readers who are unfamiliar with the problem domain, but I don't
| get the impression this was false sharing but regular old
| sharing.
| nick_ wrote:
| I think you're right.
| vrosas wrote:
| > While we knew where to look, this investigation had already
| taken weeks and things took a turn for the worse when we hit the
| issue again on February 23rd.
|
| It blows me away an issue like this could take _weeks_ to track
| down. If I were in any leadership position at this company I 'd
| be rolling heads with the lack of telemetry or domain knowledge
| for these systems.
| dblohm7 wrote:
| I can't say I'm surprised, TBH. I had a rough idea of where the
| problem might lie just by reading the title of the post. But I
| was fortunate enough to do an undergraduate degree where
| concurrency was actually taught, plus I've learned a lot over
| the years working in highly-concurrent, asynchronous
| environments.
|
| Concurrent programming has been mainstream for some time now,
| but I don't think the level of expertise of most engineers has
| kept up. That becomes most apparent when software starts
| hitting concurrency pitfalls: performance problems, deadlocks,
| UAFs, and so on...
| cryptonector wrote:
| In my career I've had to deal with two big outages that took
| over a week to diagnose and fix. One of those involved my code,
| though the conditions that caused it involved bugs in others'
| code interacting with bugs in mine such that in isolation none
| of them would have caused an outage. The first one took two
| weeks to diagnose, and I had to write specialty tools to help
| diagnose it. The second took lots of data gathering, code
| inspection, and making and testing many hypotheses. In neither
| case did heads roll. The staff who go through such a trial by
| fire come out of it all the more valuable than before, and that
| includes those whose "fault" it might have been.
|
| Sometimes you can't have the domain knowledge you'd want
| beforehand. The reasons vary a great deal. For example, the
| software in question might be commercial, and you might not
| have an alternative vendor to switch to. Other times your "bus
| factor" might have dropped to uncomfortably small values
| through no fault of anyone in the organization (people leaving
| too quickly for reasons having nothing to do with the way the
| org is run, people dying, headcount and budgets not allowing
| for remediating the situation, and just having way too many
| critical systems whose bus factor to keep wide at all times).
| kccqzy wrote:
| I haven't seen a pure user space implementation of RCU that works
| well without kernel support. The RCU methodology is easy to
| understand, but how does a user space implementation understand
| when it is safe to dispose of a piece of data structure? Don't
| you need a primitive to run some code on each CPU?
| j_seigh wrote:
| You just need to keep track of each thread's quiescent states
| however they are implemented.
| kccqzy wrote:
| Right but isn't that difficult to implement in a library?
| cryptonector wrote:
| Not really. https://github.com/nicowilliams/ctp is mine --
| the API is not RCU, but a simpler and easier to use
| "thread-safe variable" construct. You'll find two C
| implementations there. I've been using this in production
| for almost a decade without any trouble. (For a long time I
| had a missing producer membar in the create primitive
| (oops!), but because it only happens once it never caused
| me any problems.)
|
| To be fair I thought about it for a long time first. But
| nowadays these techniques are well known, so now it's
| easier to "copy" the thinking others have done, and that
| thinking is the hard part.
| kccqzy wrote:
| I took a brief look at your link. How does this even
| qualify to be RCU? The read-copy-update paradigm requires
| that the update step be able to know whether the value to
| be updated is derived from the recent copy. The set
| function here doesn't even allow you to atomically
| compare your read version against the current version.
| Imagine you are implementing a simple array with RCU; you
| read the array, make a copy, append to the array, and
| then you need to atomically compare the array pointer
| pre-append and set the array just so that you don't lose
| writes by other threads.
| cryptonector wrote:
| The get function always returns you a stable value if at
| least one value was ever written (else you get NULL,
| natch).
|
| The set function will not invalidate any values seen by
| get that might still be referenced, but the new value
| will be seen by all subsequent gets.
|
| Old values will be destructed when no longer referenced.
|
| To read-copy-update you'll need to take a lock (that
| readers don't), get the current value (it will be the
| current one given you're locking out other writers), copy
| it, modify the copy, then set the new value.
|
| You don't have to read-copy-update if the new values
| don't depend on the preceding values. So my API does not
| force read-copy-update, but you _can_ do RCU with it.
| kccqzy wrote:
| Got it. So basically you want writers to use their own
| lock to coordinate writes. Okay, then this makes it
| inconvenient to deal with readers that could possibly but
| infrequently become writers. They have to do a second
| read under a lock to write. And if one chooses a data
| structure for which copy isn't a cheap operation, this
| can be slower than classic RCU.
| cryptonector wrote:
| > Okay, then this makes it inconvenient to deal with
| readers that could possibly but infrequently become
| writers. They have to do a second read under a lock to
| write.
|
| Well, first of all reads are real cheap, so that's not a
| big concern. In some systems readers don't ever turn
| around to write, so that's ok, and in others they don't
| modify the current value. E.g., in one system I've built
| we use TinyCDB files and when new files are renamed into
| place we open those and set the handle on one of these
| thread-safe variables -- no read-modify involved here.
|
| > And if one chooses a data structure for which copy
| isn't a cheap operation, this can be slower than classic
| RCU.
|
| Either way it's RCU. It's best to publish immutable data
| with RCU and design data structures where copy-on-write
| modifications don't have to copy the whole data structure
| but rather just small parts of it. E.g., in a tree you
| should only have to copy the interior nodes from the one
| you're modifying to the root.
|
| One can always build a hash table where each cell is one
| of these thread-safe variables (or plain old RCU), though
| one would have to make these thread-safe variables
| lighter weight to make having lots of them realistic
| (currently my implementation has a pthread_key_t per
| variable, which does not scale).
|
| In TFA ArcSwap is essentially indistinguishable from a
| read-only hash table published on something like one of
| my thread-safe variables, and for TFA this worked better
| than the available alternatives in spite of the copy-on-
| write burden.
|
| BTW, I built this thing originally to allow global
| configuration to be shared with otherwise-shared-nothing
| threads and allow the service to be reconfigurable at
| run-time. Without something like this (or plain old RCU)
| your options are limited to things like using read-write
| locks (which are way worse than anything you can imagine
| being bad about RCU used with immutable data structures).
| I was very displeased with read-write locks when I needed
| to solve this particular problem at Sun, but I didn't
| build this thing until after I left Sun (Oracle).
|
| Nowadays RCU-ish libraries are pretty widely available.
| OpenSSL 3.3 (or was it 3.4?) has something like this
| after the threaded performance debacles of 3.0. In
| particular OpenSSL has an RCU-ish hash table. You might
| be interested in it. IIRC (but it's been a while since I
| looked) it uses one hazard-pointer-per-thread to acquire
| a stable reference to a key or value long enough to then
| incref a refcount -- pretty clever, and that's
| essentially the way to make a "thread-safe variable"
| light-weight enough to be able to use one for each slot
| in a hash table.
| j_seigh wrote:
| Thread local vars would be used with lazy initialization.
| Clean up might be a little tricky depending on what
| implementation of thread local you use. Thread local
| support is not as great as it should be.
| cryptonector wrote:
| You don't need kernel assistance to make user-land RCU-like
| structures work well enough. I've done it twice. The price you
| pay for not having kernel assistance is that some threads
| sometimes have to do extra work, like signal waiting writers,
| loop over a very short read, write to hazard pointer, read
| again until the value read is stable, or do a garbage
| collection, but this price is not too heavy.
| athrowaway3z wrote:
| I like the idea of arc_swap, but I've tried skimming the code to
| see if it fit my mental model and found it way more complex than
| i expected.
| cryptonector wrote:
| TFA is about concurrent hashmaps that croak under heavy read load
| because the synchronization mechanisms of those hashmaps caused
| cache ping-pong, and the fix was to use RCU with a copy-on-write
| data structure for the hashmap. Each write requires cloning the
| previous hashmap, but because the readers don't perform any
| atomic operations other than one read (for the current read-only
| hashmap) it all works out as the cache lines don't have to ping-
| pong given that the readers are not writing.
|
| The stuff about "debt" is about the design of ArcSwap, which uses
| "debt" instead of "hazard pointers". The idea is that every
| thread keeps track of the last version of the hashmap that it's
| seen, and this is "debt" (references that need to be released
| eventually), and this debt is a thread-local object that is
| linked with all the other threads' debt, which essentially allows
| garbage collect of "debt" (debt collection?), or something like
| that. See https://github.com/vorner/arc-
| swap/blob/master/src/docs/inte...
|
| I've implemented something similar myself. Instead of
| implementing user-land RCU (which is what perhaps I should have
| done) I implemented a "thread-safe variable" where reads are
| lock-less and wait-free, and writers pay all the costs. My
| thread-safe variable API is real simple and intuitive:
| `create(value_destructor) -> thread_safe_var`,
| `set(thread_safe_var, value)`, `get(thread_safe_var) -> value`,
| `release(thread_safe_var)`, and `destroy(thread_safe_var)`
| (presumably only during an atexit() handler or shared object
| destructor call, if at all). I have two implementations, one with
| a linked list of per-thread hazard pointers, and the other with a
| tuple of {next/current, current/previous} "cells" and a protocol
| that gives readers time to read the cells, figure out which is
| "current" and then dereference and incref a refcount.
|
| Regardless of how one implements this data structure, the key
| concept is that you have to atomically compose two distinct
| atomic operations: a read, and a dereference to atomically
| increment a reference count -- this is remarkably tricky!
|
| When you have a GC it's really easy: just read -- there is no
| reference count to increment.
|
| With a linked list of per-thread hazard pointers all you have to
| do is read the thing then write it into your thread-local hazard
| pointer then read again, looping until the second read is the
| same as the first read. Having the value thus safely copied to
| the reader's thread-local hazard pointer then allows for notional
| garbage collection. Any thread that would release an old version
| of the value has to check all the hazard pointers to see that
| it's not in use, and this amounts to a small garbage collection.
|
| I especially like the hazard pointer approach.
|
| See also MVars (Haskell), transactional memory, RCU, etc.
| nasretdinov wrote:
| Hate to be that person, but Go developers hit it first and then
| implemented (a really nice in my opinion) concurrent hash map
| that does stuff similar to what is described in the article, but
| without copying the entire hash table each time:
| https://pkg.go.dev/sync#Map
| benmmurphy wrote:
| Concurrent maps are often easier to implement in GC languages
| because you can use the garbage collector for cleaning old
| values. Whereas without a GC you will need an alternative like
| RCU.
| nasretdinov wrote:
| I do agree with the statement that in GC languages concurrent
| structures are easier to implement (and I actually quite like
| that about Go too), however, fundamentally, GC languages also
| implement GC itself somehow :), so the only real difference I
| guess is that for some (complex) data structures you _need_
| GC to implement them cleanly, and that's what RCU does
___________________________________________________________________
(page generated 2025-06-13 23:01 UTC)