[HN Gopher] Making Sense of Acquire-Release Semantics
___________________________________________________________________
Making Sense of Acquire-Release Semantics
Author : sph
Score : 68 points
Date : 2024-05-10 08:59 UTC (1 days ago)
(HTM) web link (davekilian.com)
(TXT) w3m dump (davekilian.com)
| catlifeonmars wrote:
| Minor quibble: you _can_ linearize small amounts of memory using
| atomic access. You just need to ensure that your memory fits
| within the size of a single atomic access. For example, storing
| two uint32 as a uint64 when there is atomic access to uint64
| available.
| mevric wrote:
| I am curious to understand how the following is achieved? Is
| there a material on this?
|
| "storing two uint32 as a uint64"
| JonChesterfield wrote:
| Put them next to each other, 8 byte align the first one, use
| a compiler mechanism to disable alias analysis, do the uint64
| store. Attribute((may_alias)) is the local override, fno-
| strict-aliasing the global one.
|
| I think C++ can now do "these bytes are now that type",
| called something like start_lifetime_as. C probably can't,
| though using a union might be legitimate. The language rules
| in this area are a mess.
| zevv wrote:
| Well written article, nice and to the point. Do recommend.
|
| Decades ago I declared myself too stupid to use shared memory
| with threading; I have learned to avoid this whenever possible,
| or abstract away the memory access under a safe layer as soon as
| possible. One of the greatest decisions of my career.
|
| Memory model semantics is one of the parts of systems programming
| that is generally poorly understood; I have had long discussions
| with senior programmers who have spent their careers carelessly
| threading their code without realizing what is happening under
| the hood. Not only did some of them not properly understand the
| acquire/release model, but they were not even aware of its
| existence.
|
| For a more in-depth explanation, I recommend Sutter's excellent
| talk "Atomic <> weapons":
| https://www.youtube.com/watch?v=A8eCGOqgvH4. It is a two hour
| lecture, but it will be worth your time.
| toast0 wrote:
| > Decades ago I declared myself too stupid to use shared memory
| with threading; I have learned to avoid this whenever possible,
| or abstract away the memory access under a safe layer as soon
| as possible. One of the greatest decisions of my career.
|
| This probably qualifies you to do shared memory threading, when
| it is needed. Such as debugging those abstraction layers.
| Knowing you don't understand it puts you in the right frame of
| mind to carefully do the work.
| newpavlov wrote:
| For a more in-depth understanding, I recommend reading "Rust
| Atomics and Locks": https://marabos.nl/atomics/ It uses Rust for
| examples, but it more or less applicable to C/C++ as well.
| chrisaycock wrote:
| The first explanation that really clarified memory barriers for
| me was from Paul E. McKenney:
|
| http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...
| JonChesterfield wrote:
| That one plus the kernel notes at
| https://www.kernel.org/doc/Documentation/memory-barriers.txt
| did the trick for me.
| king_geedorah wrote:
| No matter how many times I read about these, I'm always left just
| slightly confused. Acquire/release is about the ordering of
| instructions within a thread of execution. So if I have a writer
| writing X, then writing Y, then I need write-release to make sure
| that the compiler actually puts the instructions for Y after the
| instructions for X in the machine code. If I want to guarantee
| that the results of those writes are visible to another thread,
| then I need a memory fence to force flushing of the caches out to
| main memory basically?
|
| The author mentions the queue as originally written is
| technically correct for x86/x86_64. This is true only in the
| sense that neither the producer or consumer can experience
| partial reads / partial writes, right? It is still possible that
| if, say, the consumer were busy waiting for an item to be
| available then it will spin for as long as it takes for the CPU
| to decide to flush the producer's writes out to main memory?
|
| Whenever I see these concepts discussed, it is in the context of
| the C++ stdatomic library. If I were writing a program in C99, I
| would assume it would still be possible to communicate the same
| intent / restrictions to the compiler, but I'm not sure because I
| haven't been able to find any resources that discuss doing so.
| How might one communicate that to the compiler, assuming they are
| on x86/x86_64 where in theory the CPU should just do the right
| thing with the right machine code?
|
| Finally, does target architecture influence the compiler's
| behavior in this regard at all? For example, if we take
| x86/x86_64 as having acquire/release semantics without any
| further work, does telling the compiler that my target
| architecture is x86/x86_64 imply that those semantics should be
| used throughout the program?
| gpderetta wrote:
| > Acquire/release is about the ordering of instructions within
| a thread of execution
|
| acquire/release is about visibility. Acquire and release always
| go in pair. You can't really reason purely about a release and
| an acquire in isolation and that's why simply thinking about
| instruction reordering is not enough.
|
| > So if I have a writer writing X, then writing Y, then I need
| write-release to make sure that the compiler actually puts the
| instructions for Y after the instructions for X in the machine
| code. If I want to guarantee that the results of those writes
| are visible to another thread, then I need a memory fence to
| force flushing of the caches out to main memory basically
|
| Whether an explicit memory fence is needed or not depends on
| the architecture (for example you do not need them on x86). But
| you do not need to care, if you use the atomic operation with
| the correct semantic, the compiler will insert any required
| fence for you.
|
| As an aside, typically fences have nothing to do with caches.
| One a store or a load operation hits the cache, the coherence
| system takes care that everything works correctly. If fences
| had to flush the cache, they would be orders of magnitude
| slower.
|
| Instead fences (explicit or otherwise) make sure that either
| memory operations commit (i.e. are visible at the cache layer)
| in the expected order or that an application can't tell
| otherwise, i.e. reordering is still permitted across fences as
| long as conflicts can be detected and repaired, typically this
| can only happen for loads that can be retried without side
| effects.
|
| > Whenever I see these concepts discussed, it is in the context
| of the C++ stdatomic library. If I were writing a program in
| C99, I would assume it would still be possible to communicate
| the same intent / restrictions to the compiler
|
| formally in C99 multithreaded programs are UB. Of course other
| standards (POSIX, openmp) and implementations (the old GCC
| __sync_builtins) could give additional guarantees; but only C11
| gave a model defined well enough to reason in depth about the
| overall CPU+compiler system; before that people just had to
| make a lot of assumptions.
|
| > Finally, does target architecture influence the compiler's
| behavior in this regard at all? For example, if we take
| x86/x86_64 as having acquire/release semantics without any
| further work, does telling the compiler that my target
| architecture is x86/x86_64 imply that those semantics should be
| used throughout the program?
|
| It does, but note that the compiler will only respect
| acquire/release semantics for atomic objects operations with
| the required ordering, not normal load and stores.
| moonchild wrote:
| > If I were writing a program in C99, I would assume it would
| still be possible to communicate the same intent / restrictions
| to the compiler, but I'm not sure because I haven't been able
| to find any resources that discuss doing so
|
| You cannot. See boehm, 'threads cannot be implemented as a
| library' (https://web.archive.org/web/20240118063106if_/https:/
| /www.hp...). You can do that in c11, however, which includes
| functionally the same facilities as c++11 (howbeit c-flavoured,
| obviously). https://en.cppreference.com/w/c/atomic
|
| > does target architecture influence the compiler's behavior in
| this regard at all? For example, if we take x86/x86_64 as
| having acquire/release semantics without any further work, does
| telling the compiler that my target architecture is x86/x86_64
| imply that those semantics should be used throughout the
| program?
|
| It does not imply that. You should completely ignore the target
| architecture. C11 provides an abstract interface and a set of
| abstract constraints for concurrent programs; you should
| program against that interface, ensuring that your code is
| correct under those constraints. The compiler is responsible
| for making sure that the constraints are satisfied on whatever
| target you happen to run (so your code will be portable!).
|
| > if I have a writer writing X, then writing Y, then I need
| write-release to make sure that the compiler actually puts the
| instructions for Y after the instructions for X in the machine
| code
|
| You need Y to be a write-release if you would like it to be the
| case that, if another thread who acquire-reads and observes Y,
| then it will observe X. (The classic example is 'message
| passing', where X is a message and Y is a flag saying that
| there is a message. Obviously, it would be bad if you could see
| the flag but not actually the message.) But maybe you don't
| need that property.
|
| > If I want to guarantee that the results of those writes are
| visible to another thread, then I need a memory fence to force
| flushing of the caches out to main memory basically?
|
| No. That's not what a fence does and that's not how caches
| work.
| king_geedorah wrote:
| Claro. Thank you for the threads as libraries link; very
| interesting.
| neerajsi wrote:
| I just wanted to clarify something about flushing caches:
| fences do not flush the caches in any way. Inside the CPU there
| is a data structure called the load store queue. It keeps track
| of pending loads and stores, of which there could be many. This
| is done so that the processor can run ahead and request things
| from the caches or to be populated into the caches without
| having to stop dead the moment it has to wait for any one
| access. The memory fencing influences how entries in the load
| store queue are allowed to provide values to the rest of the
| CPU execution units. On weak orderes processors like ARM, the
| load store queue is allowed to forward values to the execution
| pipelines as soon as they are available from the caches, except
| if a store and load are to the same address. X86 only allows
| values to go from loads to the pipeline in program order. It
| can start operations early, but if it detects that a store
| comes in for a load that's not the oldest it has to throw away
| the work done based on the speculated load.
|
| Stores are a little special in that the CPU can declare a store
| as complete without actually writing data to the cache system.
| So the stores go into a store buffer while the target cache
| line is still being acquired. Loads have to check the store
| buffer. On x86 the store buffer releases values to the cache in
| order, and on ARM the store buffer drains in any order. However
| both CPU architectures allow loads to read values from the
| store buffer without them being in the cache and without the
| normal load queue ordering. They also allow loads to occur to
| different addresses before stora. So on x86 a store followed by
| a load can execute as the load first then the store.
|
| Fences logically force the store buffer to flush and the load
| queue to resolve values from the cache. So everything before
| the fence is in the caching subsystem, where standard coherency
| ensures they're visible when requested. Then new operations
| start filling the load store queue, but they are known to be
| later than operations before the fence.
| king_geedorah wrote:
| That clarifies fences more for me a little bit more. Thanks
| for the insight.
| JonChesterfield wrote:
| If you program without fences, instructions are reordered by
| the compiler and/or the processor as they see fit provided the
| single threaded semantics are unchanged. This is why alias
| analysis is a big deal. A store can be moved before a load if
| they are independent, i.e. do not alias. A store followed by a
| load can be simplified to use the value already in a register
| if there is no other store in the meantime.
|
| This doesn't work if there are multiple threads and shared
| mutable state. Whatever semantics the programmer had in mind,
| and encoded in load/store patterns, are usually insufficient
| for correctness under arbitrary interleaving of threads.
|
| This is fixed by introducing additional constraints on which
| instructions can be reordered. Fences affect loads, stores,
| both. Usually with respect to all memory but potentially only a
| subset of it. The point is to say that moving some operation
| past some other one will cause unacceptable behaviour for this
| program, so neither compiler nor CPU shall do so.
|
| On top of this there's a C++ memory order model, where you can
| tag an integer add with acq_rel semantics, or specify fences,
| all reasoned in terms of synchronises with and a global oracle
| determining acceptable execution sequences. I think this is
| grossly over complicated and heavily obfuscates the programming
| model to no gain. Fortunately one can mechanically desugar it
| into the fences and reason with the result.
| ajross wrote:
| This bit isn't quite correct:
|
| > _Acquire and release semantics don't have any meaning on Intel-
| and AMD-brand processors. On x86 CPUs, which is to say, on just
| about any Intel- or AMD-brand CPU you can buy right now, memory
| operations on a single CPU core happen in program order._
|
| In fact x86 CPUs do allow themselves to reorder reads around
| other reads[1]. The rule is that no memory access is allowed to
| cross a write operation[2]. The distinction isn't important to
| traditional critical section analysis like the article is doing,
| but there are lockless algorithms out there that depend on fully-
| ordered reads. Dekker's famous (but mostly useless in practice)
| algorithm for mutual exclusion is one.
|
| [1] Note here I'm using the more traditional and frankly much
| clearer terminology about actual hardware behavior and not the
| frustrating abstractions embraced by the language design
| community.
|
| [2] Or one of a handful of "serializing instructions", the most
| commonly relied on being LOCK CMPXCHG
| moonchild wrote:
| > x86 CPUs do allow themselves to reorder reads around other
| reads. The rule is that no memory access is allowed to cross a
| write operation
|
| That's not correct. Intel manual, vol 3, sec. 9.2.2:
|
| > Reads are not reordered with other reads
|
| > Writes are not reordered with older reads.
|
| > Writes to memory are not reordered with other writes
|
| A read _may_ be reordered w.r.t. an older write (and hence tfa
| is incorrect that po=ppo on x86), but reads are not ordered
| with other reads. You can see in the c /c++ processor mappings
| here https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
| that load acquire/store release can be mapped to plain loads
| and stores on x86.
| ajross wrote:
| Oops, indeed. Must be a bit flip error in my memory.
| moonchild wrote:
| > more traditional and frankly much clearer terminology about
| actual hardware behavior and not the frustrating abstractions
| embraced by the language design community
|
| Axiomatic memory models were pushed as much from the hardware
| world as from the software world if not more so, and they exist
| for a reason. Overfitting obligatorily abstract models to the
| behaviour of a particular microarchitecture benefits neither
| hardware nor software writers
| ajross wrote:
| > Overfitting obligatorily abstract models to the behaviour
| of a particular microarchitecture benefits neither hardware
| nor software writers
|
| I'm too dumb to know what the first clause means. But the
| second is silly: understanding how to control the ordering of
| the actual memory operations (i.e. the mapping to actual
| behavior as seen on a bus/interconnect somewhere) can _only_
| benefit hardware writers (because it tells them how to
| implement and explain what 's happening) and software writers
| (because it's simple).
|
| You don't see articles like TFA about read and write
| barriers, because they're obvious. Acquire/release is
| abstract egghead stuff, which means it's almost certainly a
| bad API.
| gary_0 wrote:
| What moonchild meant is that higher-level semantics are
| necessary because directly describing what the hardware is
| doing won't work because almost every model of CPU does
| something slightly different.
|
| Your argument that the higher-level terminology the
| industry settled on is confusing is valid (albeit
| subjective).
|
| > Acquire/release is abstract egghead stuff
|
| The article explains why this functionality is named that
| way, and why it's necessary. It's even kind of convenient,
| because in one line of standardized code you tell the
| compiler what you want, and it takes care of all the ugly
| platform-specific gunk, guaranteed!
| ajross wrote:
| > What moonchild meant
|
| To spell it out: I know very well what moonchild meant,
| and am no stranger to lockless algorithm design or memory
| ordering semantics. It was a turn of phrase riffing on
| the point that "memory ordering semantics are APIs" and
| thus should have been designed with an eye toward clarity
| and comprehension for the working programmers who need to
| use them.
|
| Acquire/release was intended to make the job of a
| compiler author easier, by trying to home in on the
| differences between hardware that might be reasonably
| abstracted as a single API. And I'm saying that's a bad
| trade, because the compiler nerds are doing just fine and
| what the dumb jocks in the trenches want is read and
| write barriers.
| gpderetta wrote:
| Indeed, program ordering is too strong. x86 is TSO, which, as
| mentioned elsethread, allows for Store/Load reordering (but not
| Load/Load)
|
| Dekker per-se is not terribly useful, but once you know that
| pattern (write to one variable and check a separate variable to
| make a decision) you start to recognise it in many lock-free
| algorithms that try to be clever with plain stores and loads.
___________________________________________________________________
(page generated 2024-05-11 23:00 UTC)