[HN Gopher] Programming Language Memory Models (2021)
       ___________________________________________________________________
        
       Programming Language Memory Models (2021)
        
       Author : fanf2
       Score  : 63 points
       Date   : 2024-12-12 09:42 UTC (13 hours ago)
        
 (HTM) web link (research.swtch.com)
 (TXT) w3m dump (research.swtch.com)
        
       | pvg wrote:
       | Discussion at the time
       | https://news.ycombinator.com/item?id=27750610
        
       | kmeisthax wrote:
       | I'd like to necro one particular comment from this discussion.
       | Someone said that "x86 cannot have acquire/release semantics",
       | because by default x86 has total store order (TSO).
       | 
       | My question: What stops Intel or AMD from providing an opt-in
       | weaker memory model? Or what would the programming model look
       | like for new programs that wanted to abandon TSO for better
       | performance? Would it just be weak order prefixes for existing
       | memory-altering instructions? Would you need a process-wide bit
       | to weakly order the whole program? Would that affect loaded
       | libraries (including OS-provided libraries) too? Would programs
       | dropping TSO potentially affect kernels or hypervisors above them
       | on the privilege hierarchy of the CPU?
        
         | jcranmer wrote:
         | To be pedantic, x86 _does_ have opt-in weaker memory models.
         | When you 're setting up the TLB entries, you can configure how
         | strong the cache behavior is for the memory... which means in
         | practice, you can't really access these memory models unless
         | you're doing firmware, or you're using nontemporal stores [1].
         | 
         | On a more practical level, however, it's actually disruptive to
         | make the hardware have optional less-restrictive semantic
         | operations, because the other instructions might need more
         | fences to guarantee the necessary properties. For example, on
         | x86, compilers drop all explicit store fences because they're
         | "unnecessary" on the hardware, and adding in operations that
         | would make them necessary, even for existing code that doesn't
         | know about the instructions that don't exist yet.
         | 
         | [1] Incidentally, this means trying to read the manual to
         | figure out what a nontemporal store actually does can feel like
         | turning your brain to mush.
        
           | kmeisthax wrote:
           | > For example, on x86, compilers drop all explicit store
           | fences because they're "unnecessary" on the hardware
           | 
           | I mean specifically for new programs that are written with
           | weaker memory models in mind. So you'd have to enable an
           | -fNOTSO flag on your compiler that emits all the fences that
           | would otherwise be skipped.
           | 
           | Looking at what I can find about nontemporal stores, they
           | sound like they already have a weird kind of release
           | semantics on x86, even though their intent was to avoid cache
           | thrashing, not so much to allow greater memory reordering.
           | Are these actually used in compilers?
        
             | jcranmer wrote:
             | Compilers will generally expose a nontemporal store
             | intrinsic, but don't expect them to actually try to use it
             | automatically.
        
             | anonymousDan wrote:
             | I think they were particularly beneficial in some workloads
             | for optane persistent memory
        
           | adrian_b wrote:
           | The cache behavior regarding the ordering of memory accesses
           | cannot be configured.
           | 
           | What you can select when mapping memory is to make some
           | regions uncached (with several variants, e.g. write-
           | combining).
           | 
           | Some of the kinds of uncached memory regions, especially
           | those whose names include "write-combining", implement weaker
           | orderings of the memory accesses in comparison with how
           | cacheable memory works on x86, where 3 of the 4 kinds of
           | memory access reorderings are prohibited (i.e. all except
           | that subsequent loads can be performed before prior stores).
           | 
           | However choosing a memory type with weakly-ordered accesses
           | cannot increase the performance on x86, because the loss of
           | performance by not using the cache is much greater.
           | 
           | The weakly-ordered accesses only recover a small part of the
           | performance loss caused by uncached memory.
        
         | 4ad wrote:
         | There is this myth that having a weaker memory model can lead
         | to higher performance. It sounds plausible, and it was a good
         | idea to explore at the end of the 80s, but in the end it turned
         | out that it's simply not substantiated by facts. Note that
         | Apple Silicon is TSO.
         | 
         | Related: https://kcsrk.info/papers/pldi18-memory.pdf
        
           | fwip wrote:
           | Apple's Mx chip has an instruction to enable TSO, which
           | Rosetta uses when running x86 code. I don't believe it uses
           | TSO when running native ARM code, but I could be mistaken.
        
           | SubjectToChange wrote:
           | _Note that Apple Silicon is TSO._
           | 
           | Apple Silicon supports both Arm's relaxed memory model and
           | Intel's TSO for backwards compatibility. Microsoft is doing
           | the same for Windows on Arm.
        
         | dang wrote:
         | (We detached this from
         | https://news.ycombinator.com/item?id=42400471 since it's a good
         | top-level comment)
        
         | SubjectToChange wrote:
         | _My question: What stops Intel or AMD from providing an opt-in
         | weaker memory model?_
         | 
         | Probably for the same reason Intel/AMD doesn't get rid of the
         | rest of the cruft in x86-64, i.e. backwards compatibility.
         | Additionally, there would be issues with Intel/AMD leveraging
         | an optional weak memory model in their chips without
         | compromising the performance of legacy TSO applications. They
         | are probably better off making x86-64 perform best under TSO.
        
         | gpderetta wrote:
         | > x86 cannot have acquire/release semantics
         | 
         | As I said at the time, this is nonsense. All stores on x86 are
         | have release semantics and all loads have acquire semantics.
         | 
         | Atomic RMW are sequentially consistent.
         | 
         | Technically you could implement relaxed stores via non temporal
         | stores but it would be pointless.
        
           | kmeisthax wrote:
           | For those watching along at home, we're talking about these
           | two comments from dragontamer:
           | 
           | - https://news.ycombinator.com/item?id=27753913
           | 
           | - https://news.ycombinator.com/item?id=27762749
           | 
           | I interpreted dragontamer's comment to mean "x86 does not
           | have relaxed stores", not "x86 does not have a strong memory
           | model". The (presumed) problem is that you can't actually
           | test acquire/release code on x86, because it won't crash if
           | you get it wrong.
           | 
           | There's a second level of this in C++'s memory model:
           | consume/release. I've no idea what the difference between
           | acquire and consume are; when I look it up there's usually
           | some reference to the DEC Alpha, a 30 year old workstation
           | and server chip nobody uses today which was legendary for
           | pushing the boundaries of memory ordering. My assumption is
           | that no hardware provides memory ordering exactly as weak as
           | consume, so I shouldn't bother asking for it, since it'll
           | just get strengthened to acquire on any chip that matters.
           | 
           | Also, from you and one other person, it sounds like relaxed
           | memory models aren't actually a performance benefit? Would
           | the opposite - ARM mandating TSO in ARMv9.1-A or whatever -
           | make sense? A lot of ink and code is wasted talking about
           | memory model strengths.
        
             | jcranmer wrote:
             | > There's a second level of this in C++'s memory model:
             | consume/release. I've no idea what the difference between
             | acquire and consume are; when I look it up there's usually
             | some reference to the DEC Alpha, a 30 year old workstation
             | and server chip nobody uses today which was legendary for
             | pushing the boundaries of memory ordering. My assumption is
             | that no hardware provides memory ordering exactly as weak
             | as consume, so I shouldn't bother asking for it, since
             | it'll just get strengthened to acquire on any chip that
             | matters.
             | 
             | You have it wrong here.
             | 
             | So the basic idea behind release/consume comes from an
             | observation that, on all hardware not named Alpha [1],
             | there is no hardware fence needed to guarantee that a load
             | happens-before another load that is data-dependent on the
             | first load. So the C++ committee decided to add a memory
             | model that would preserve this guarantee. A consume load is
             | exactly like an acquire load, but only for other loads that
             | are data-dependent on that load as opposed to all loads in
             | general. So whereas an acquire load requires a fence on
             | pretty much every hardware not x86 or SPARC in TSO mode, a
             | consume load would only require a fence on Alpha.
             | 
             | For various reasons, data-dependency isn't a property that
             | compilers are in the position of guaranteeing, which means
             | every compiler ended up implementing consume as the
             | stronger acquire instead, defeating its entire design goal
             | of eliminating unnecessary fences. There have been a couple
             | of attempts within the committee to try to find a path that
             | would let something like consume by tweaking the necessary
             | data dependence definitions, but none of them have swayed
             | the implementers, so the current state is that they've
             | given up and let consume die.
             | 
             | Of particular note is that release/consume is not designed
             | to _support_ the Alpha memory model; in fact, it 's quite
             | the opposite: it's designed to support everybody _but_
             | Alpha, since for Alpha (and essentially only Alpha),
             | release /consume is not usefully weaker than
             | release/acquire. Instead, the people who benefit are the
             | PPCs and the ARMs of the world, for whom release/consume is
             | a better approximation of their native memory model and
             | would allow many fences to be omitted.
             | 
             | [1] Actually, some accelerator hardware might have memory
             | models as weak as Alpha here, but I'm far less familiar
             | with memory models as they apply to accelerators.
        
           | adrian_b wrote:
           | It is correct that on x86 all normal stores have release
           | semantics and all normal loads have acquire semantics (only a
           | few instructions behave differently, i.e. the string
           | instructions and the streaming "non-temporal" store
           | instructions).
           | 
           | However the ordering properties of the x86 stores and loads
           | are stronger than that. An x86 store is not only a release
           | store, but it has an additional property that could be called
           | of being an "initial store", i.e. a store that is guaranteed
           | to be performed before all subsequent stores.
           | 
           | An x86 load is not only an acquire load, but it has an
           | additional property that could be called of being a "final
           | load", i.e. a load that is guaranteed to be performed after
           | all prior loads.
           | 
           | The research paper that has introduced the concepts of
           | acquire loads and release stores (in May 1990) was mistaken.
           | They have claimed that these 2 kinds of accesses are enough
           | for the synchronization of accesses to shared resources.
           | 
           | While it is true that acquire loads and release stores are
           | sufficient for implementing critical sections, there are
           | other algorithms for accessing shared resources that need the
           | other 2 kinds of ordered loads and stores, i.e. final loads
           | and initial stores.
           | 
           | On x86, this does not matter, because the normal loads and
           | stores provide all 4 kinds of ordered loads and stores, so
           | any algorithm can be implemented without using memory
           | barriers or other special instructions.
           | 
           | In ISAs that provide only 2 kinds of ordered loads and
           | stores, i.e. acquire loads and release stores, like Arm
           | Aarch64, the other 2 kinds of ordered loads and stores must
           | be synthesized using memory barriers, i.e. an initial store
           | is made by a weakly-ordered store followed by a store
           | barrier, while a final load is made by a weakly-ordered load
           | preceded by a load barrier.
           | 
           | Arm Aarch64 does not have load barriers, but it has stronger
           | acquire barriers, which can always be used instead of any
           | load barriers. The Arm Aarch64 acquire barrier has the
           | confusing mnemonic DMB.LD, apparently inspired by the x86
           | LFENCE, which is also an acquire barrier, not a load barrier,
           | despite its confusing mnemonic.
           | 
           | (a load barrier guarantees that all prior loads are performed
           | before all subsequent loads; an acquire barrier guarantees
           | that all prior loads are performed before all subsequent
           | loads _and_ stores; such memory barriers are stronger than
           | necessary; the weaker ordering properties provided by the 4
           | kinds of ordered loads and stores are sufficient)
        
       ___________________________________________________________________
       (page generated 2024-12-12 23:00 UTC)