[HN Gopher] Making Sense of Acquire-Release Semantics
       ___________________________________________________________________
        
       Making Sense of Acquire-Release Semantics
        
       Author : sph
       Score  : 68 points
       Date   : 2024-05-10 08:59 UTC (1 days ago)
        
 (HTM) web link (davekilian.com)
 (TXT) w3m dump (davekilian.com)
        
       | catlifeonmars wrote:
       | Minor quibble: you _can_ linearize small amounts of memory using
       | atomic access. You just need to ensure that your memory fits
       | within the size of a single atomic access. For example, storing
       | two uint32 as a uint64 when there is atomic access to uint64
       | available.
        
         | mevric wrote:
         | I am curious to understand how the following is achieved? Is
         | there a material on this?
         | 
         | "storing two uint32 as a uint64"
        
           | JonChesterfield wrote:
           | Put them next to each other, 8 byte align the first one, use
           | a compiler mechanism to disable alias analysis, do the uint64
           | store. Attribute((may_alias)) is the local override, fno-
           | strict-aliasing the global one.
           | 
           | I think C++ can now do "these bytes are now that type",
           | called something like start_lifetime_as. C probably can't,
           | though using a union might be legitimate. The language rules
           | in this area are a mess.
        
       | zevv wrote:
       | Well written article, nice and to the point. Do recommend.
       | 
       | Decades ago I declared myself too stupid to use shared memory
       | with threading; I have learned to avoid this whenever possible,
       | or abstract away the memory access under a safe layer as soon as
       | possible. One of the greatest decisions of my career.
       | 
       | Memory model semantics is one of the parts of systems programming
       | that is generally poorly understood; I have had long discussions
       | with senior programmers who have spent their careers carelessly
       | threading their code without realizing what is happening under
       | the hood. Not only did some of them not properly understand the
       | acquire/release model, but they were not even aware of its
       | existence.
       | 
       | For a more in-depth explanation, I recommend Sutter's excellent
       | talk "Atomic <> weapons":
       | https://www.youtube.com/watch?v=A8eCGOqgvH4. It is a two hour
       | lecture, but it will be worth your time.
        
         | toast0 wrote:
         | > Decades ago I declared myself too stupid to use shared memory
         | with threading; I have learned to avoid this whenever possible,
         | or abstract away the memory access under a safe layer as soon
         | as possible. One of the greatest decisions of my career.
         | 
         | This probably qualifies you to do shared memory threading, when
         | it is needed. Such as debugging those abstraction layers.
         | Knowing you don't understand it puts you in the right frame of
         | mind to carefully do the work.
        
       | newpavlov wrote:
       | For a more in-depth understanding, I recommend reading "Rust
       | Atomics and Locks": https://marabos.nl/atomics/ It uses Rust for
       | examples, but it more or less applicable to C/C++ as well.
        
       | chrisaycock wrote:
       | The first explanation that really clarified memory barriers for
       | me was from Paul E. McKenney:
       | 
       | http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...
        
         | JonChesterfield wrote:
         | That one plus the kernel notes at
         | https://www.kernel.org/doc/Documentation/memory-barriers.txt
         | did the trick for me.
        
       | king_geedorah wrote:
       | No matter how many times I read about these, I'm always left just
       | slightly confused. Acquire/release is about the ordering of
       | instructions within a thread of execution. So if I have a writer
       | writing X, then writing Y, then I need write-release to make sure
       | that the compiler actually puts the instructions for Y after the
       | instructions for X in the machine code. If I want to guarantee
       | that the results of those writes are visible to another thread,
       | then I need a memory fence to force flushing of the caches out to
       | main memory basically?
       | 
       | The author mentions the queue as originally written is
       | technically correct for x86/x86_64. This is true only in the
       | sense that neither the producer or consumer can experience
       | partial reads / partial writes, right? It is still possible that
       | if, say, the consumer were busy waiting for an item to be
       | available then it will spin for as long as it takes for the CPU
       | to decide to flush the producer's writes out to main memory?
       | 
       | Whenever I see these concepts discussed, it is in the context of
       | the C++ stdatomic library. If I were writing a program in C99, I
       | would assume it would still be possible to communicate the same
       | intent / restrictions to the compiler, but I'm not sure because I
       | haven't been able to find any resources that discuss doing so.
       | How might one communicate that to the compiler, assuming they are
       | on x86/x86_64 where in theory the CPU should just do the right
       | thing with the right machine code?
       | 
       | Finally, does target architecture influence the compiler's
       | behavior in this regard at all? For example, if we take
       | x86/x86_64 as having acquire/release semantics without any
       | further work, does telling the compiler that my target
       | architecture is x86/x86_64 imply that those semantics should be
       | used throughout the program?
        
         | gpderetta wrote:
         | > Acquire/release is about the ordering of instructions within
         | a thread of execution
         | 
         | acquire/release is about visibility. Acquire and release always
         | go in pair. You can't really reason purely about a release and
         | an acquire in isolation and that's why simply thinking about
         | instruction reordering is not enough.
         | 
         | > So if I have a writer writing X, then writing Y, then I need
         | write-release to make sure that the compiler actually puts the
         | instructions for Y after the instructions for X in the machine
         | code. If I want to guarantee that the results of those writes
         | are visible to another thread, then I need a memory fence to
         | force flushing of the caches out to main memory basically
         | 
         | Whether an explicit memory fence is needed or not depends on
         | the architecture (for example you do not need them on x86). But
         | you do not need to care, if you use the atomic operation with
         | the correct semantic, the compiler will insert any required
         | fence for you.
         | 
         | As an aside, typically fences have nothing to do with caches.
         | One a store or a load operation hits the cache, the coherence
         | system takes care that everything works correctly. If fences
         | had to flush the cache, they would be orders of magnitude
         | slower.
         | 
         | Instead fences (explicit or otherwise) make sure that either
         | memory operations commit (i.e. are visible at the cache layer)
         | in the expected order or that an application can't tell
         | otherwise, i.e. reordering is still permitted across fences as
         | long as conflicts can be detected and repaired, typically this
         | can only happen for loads that can be retried without side
         | effects.
         | 
         | > Whenever I see these concepts discussed, it is in the context
         | of the C++ stdatomic library. If I were writing a program in
         | C99, I would assume it would still be possible to communicate
         | the same intent / restrictions to the compiler
         | 
         | formally in C99 multithreaded programs are UB. Of course other
         | standards (POSIX, openmp) and implementations (the old GCC
         | __sync_builtins) could give additional guarantees; but only C11
         | gave a model defined well enough to reason in depth about the
         | overall CPU+compiler system; before that people just had to
         | make a lot of assumptions.
         | 
         | > Finally, does target architecture influence the compiler's
         | behavior in this regard at all? For example, if we take
         | x86/x86_64 as having acquire/release semantics without any
         | further work, does telling the compiler that my target
         | architecture is x86/x86_64 imply that those semantics should be
         | used throughout the program?
         | 
         | It does, but note that the compiler will only respect
         | acquire/release semantics for atomic objects operations with
         | the required ordering, not normal load and stores.
        
         | moonchild wrote:
         | > If I were writing a program in C99, I would assume it would
         | still be possible to communicate the same intent / restrictions
         | to the compiler, but I'm not sure because I haven't been able
         | to find any resources that discuss doing so
         | 
         | You cannot. See boehm, 'threads cannot be implemented as a
         | library' (https://web.archive.org/web/20240118063106if_/https:/
         | /www.hp...). You can do that in c11, however, which includes
         | functionally the same facilities as c++11 (howbeit c-flavoured,
         | obviously). https://en.cppreference.com/w/c/atomic
         | 
         | > does target architecture influence the compiler's behavior in
         | this regard at all? For example, if we take x86/x86_64 as
         | having acquire/release semantics without any further work, does
         | telling the compiler that my target architecture is x86/x86_64
         | imply that those semantics should be used throughout the
         | program?
         | 
         | It does not imply that. You should completely ignore the target
         | architecture. C11 provides an abstract interface and a set of
         | abstract constraints for concurrent programs; you should
         | program against that interface, ensuring that your code is
         | correct under those constraints. The compiler is responsible
         | for making sure that the constraints are satisfied on whatever
         | target you happen to run (so your code will be portable!).
         | 
         | > if I have a writer writing X, then writing Y, then I need
         | write-release to make sure that the compiler actually puts the
         | instructions for Y after the instructions for X in the machine
         | code
         | 
         | You need Y to be a write-release if you would like it to be the
         | case that, if another thread who acquire-reads and observes Y,
         | then it will observe X. (The classic example is 'message
         | passing', where X is a message and Y is a flag saying that
         | there is a message. Obviously, it would be bad if you could see
         | the flag but not actually the message.) But maybe you don't
         | need that property.
         | 
         | > If I want to guarantee that the results of those writes are
         | visible to another thread, then I need a memory fence to force
         | flushing of the caches out to main memory basically?
         | 
         | No. That's not what a fence does and that's not how caches
         | work.
        
           | king_geedorah wrote:
           | Claro. Thank you for the threads as libraries link; very
           | interesting.
        
         | neerajsi wrote:
         | I just wanted to clarify something about flushing caches:
         | fences do not flush the caches in any way. Inside the CPU there
         | is a data structure called the load store queue. It keeps track
         | of pending loads and stores, of which there could be many. This
         | is done so that the processor can run ahead and request things
         | from the caches or to be populated into the caches without
         | having to stop dead the moment it has to wait for any one
         | access. The memory fencing influences how entries in the load
         | store queue are allowed to provide values to the rest of the
         | CPU execution units. On weak orderes processors like ARM, the
         | load store queue is allowed to forward values to the execution
         | pipelines as soon as they are available from the caches, except
         | if a store and load are to the same address. X86 only allows
         | values to go from loads to the pipeline in program order. It
         | can start operations early, but if it detects that a store
         | comes in for a load that's not the oldest it has to throw away
         | the work done based on the speculated load.
         | 
         | Stores are a little special in that the CPU can declare a store
         | as complete without actually writing data to the cache system.
         | So the stores go into a store buffer while the target cache
         | line is still being acquired. Loads have to check the store
         | buffer. On x86 the store buffer releases values to the cache in
         | order, and on ARM the store buffer drains in any order. However
         | both CPU architectures allow loads to read values from the
         | store buffer without them being in the cache and without the
         | normal load queue ordering. They also allow loads to occur to
         | different addresses before stora. So on x86 a store followed by
         | a load can execute as the load first then the store.
         | 
         | Fences logically force the store buffer to flush and the load
         | queue to resolve values from the cache. So everything before
         | the fence is in the caching subsystem, where standard coherency
         | ensures they're visible when requested. Then new operations
         | start filling the load store queue, but they are known to be
         | later than operations before the fence.
        
           | king_geedorah wrote:
           | That clarifies fences more for me a little bit more. Thanks
           | for the insight.
        
         | JonChesterfield wrote:
         | If you program without fences, instructions are reordered by
         | the compiler and/or the processor as they see fit provided the
         | single threaded semantics are unchanged. This is why alias
         | analysis is a big deal. A store can be moved before a load if
         | they are independent, i.e. do not alias. A store followed by a
         | load can be simplified to use the value already in a register
         | if there is no other store in the meantime.
         | 
         | This doesn't work if there are multiple threads and shared
         | mutable state. Whatever semantics the programmer had in mind,
         | and encoded in load/store patterns, are usually insufficient
         | for correctness under arbitrary interleaving of threads.
         | 
         | This is fixed by introducing additional constraints on which
         | instructions can be reordered. Fences affect loads, stores,
         | both. Usually with respect to all memory but potentially only a
         | subset of it. The point is to say that moving some operation
         | past some other one will cause unacceptable behaviour for this
         | program, so neither compiler nor CPU shall do so.
         | 
         | On top of this there's a C++ memory order model, where you can
         | tag an integer add with acq_rel semantics, or specify fences,
         | all reasoned in terms of synchronises with and a global oracle
         | determining acceptable execution sequences. I think this is
         | grossly over complicated and heavily obfuscates the programming
         | model to no gain. Fortunately one can mechanically desugar it
         | into the fences and reason with the result.
        
       | ajross wrote:
       | This bit isn't quite correct:
       | 
       | > _Acquire and release semantics don't have any meaning on Intel-
       | and AMD-brand processors. On x86 CPUs, which is to say, on just
       | about any Intel- or AMD-brand CPU you can buy right now, memory
       | operations on a single CPU core happen in program order._
       | 
       | In fact x86 CPUs do allow themselves to reorder reads around
       | other reads[1]. The rule is that no memory access is allowed to
       | cross a write operation[2]. The distinction isn't important to
       | traditional critical section analysis like the article is doing,
       | but there are lockless algorithms out there that depend on fully-
       | ordered reads. Dekker's famous (but mostly useless in practice)
       | algorithm for mutual exclusion is one.
       | 
       | [1] Note here I'm using the more traditional and frankly much
       | clearer terminology about actual hardware behavior and not the
       | frustrating abstractions embraced by the language design
       | community.
       | 
       | [2] Or one of a handful of "serializing instructions", the most
       | commonly relied on being LOCK CMPXCHG
        
         | moonchild wrote:
         | > x86 CPUs do allow themselves to reorder reads around other
         | reads. The rule is that no memory access is allowed to cross a
         | write operation
         | 
         | That's not correct. Intel manual, vol 3, sec. 9.2.2:
         | 
         | > Reads are not reordered with other reads
         | 
         | > Writes are not reordered with older reads.
         | 
         | > Writes to memory are not reordered with other writes
         | 
         | A read _may_ be reordered w.r.t. an older write (and hence tfa
         | is incorrect that po=ppo on x86), but reads are not ordered
         | with other reads. You can see in the c /c++ processor mappings
         | here https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
         | that load acquire/store release can be mapped to plain loads
         | and stores on x86.
        
           | ajross wrote:
           | Oops, indeed. Must be a bit flip error in my memory.
        
         | moonchild wrote:
         | > more traditional and frankly much clearer terminology about
         | actual hardware behavior and not the frustrating abstractions
         | embraced by the language design community
         | 
         | Axiomatic memory models were pushed as much from the hardware
         | world as from the software world if not more so, and they exist
         | for a reason. Overfitting obligatorily abstract models to the
         | behaviour of a particular microarchitecture benefits neither
         | hardware nor software writers
        
           | ajross wrote:
           | > Overfitting obligatorily abstract models to the behaviour
           | of a particular microarchitecture benefits neither hardware
           | nor software writers
           | 
           | I'm too dumb to know what the first clause means. But the
           | second is silly: understanding how to control the ordering of
           | the actual memory operations (i.e. the mapping to actual
           | behavior as seen on a bus/interconnect somewhere) can _only_
           | benefit hardware writers (because it tells them how to
           | implement and explain what 's happening) and software writers
           | (because it's simple).
           | 
           | You don't see articles like TFA about read and write
           | barriers, because they're obvious. Acquire/release is
           | abstract egghead stuff, which means it's almost certainly a
           | bad API.
        
             | gary_0 wrote:
             | What moonchild meant is that higher-level semantics are
             | necessary because directly describing what the hardware is
             | doing won't work because almost every model of CPU does
             | something slightly different.
             | 
             | Your argument that the higher-level terminology the
             | industry settled on is confusing is valid (albeit
             | subjective).
             | 
             | > Acquire/release is abstract egghead stuff
             | 
             | The article explains why this functionality is named that
             | way, and why it's necessary. It's even kind of convenient,
             | because in one line of standardized code you tell the
             | compiler what you want, and it takes care of all the ugly
             | platform-specific gunk, guaranteed!
        
               | ajross wrote:
               | > What moonchild meant
               | 
               | To spell it out: I know very well what moonchild meant,
               | and am no stranger to lockless algorithm design or memory
               | ordering semantics. It was a turn of phrase riffing on
               | the point that "memory ordering semantics are APIs" and
               | thus should have been designed with an eye toward clarity
               | and comprehension for the working programmers who need to
               | use them.
               | 
               | Acquire/release was intended to make the job of a
               | compiler author easier, by trying to home in on the
               | differences between hardware that might be reasonably
               | abstracted as a single API. And I'm saying that's a bad
               | trade, because the compiler nerds are doing just fine and
               | what the dumb jocks in the trenches want is read and
               | write barriers.
        
         | gpderetta wrote:
         | Indeed, program ordering is too strong. x86 is TSO, which, as
         | mentioned elsethread, allows for Store/Load reordering (but not
         | Load/Load)
         | 
         | Dekker per-se is not terribly useful, but once you know that
         | pattern (write to one variable and check a separate variable to
         | make a decision) you start to recognise it in many lock-free
         | algorithms that try to be clever with plain stores and loads.
        
       ___________________________________________________________________
       (page generated 2024-05-11 23:00 UTC)