[HN Gopher] Programming Language Memory Models (2021)
___________________________________________________________________
Programming Language Memory Models (2021)
Author : fanf2
Score : 63 points
Date : 2024-12-12 09:42 UTC (13 hours ago)
(HTM) web link (research.swtch.com)
(TXT) w3m dump (research.swtch.com)
| pvg wrote:
| Discussion at the time
| https://news.ycombinator.com/item?id=27750610
| kmeisthax wrote:
| I'd like to necro one particular comment from this discussion.
| Someone said that "x86 cannot have acquire/release semantics",
| because by default x86 has total store order (TSO).
|
| My question: What stops Intel or AMD from providing an opt-in
| weaker memory model? Or what would the programming model look
| like for new programs that wanted to abandon TSO for better
| performance? Would it just be weak order prefixes for existing
| memory-altering instructions? Would you need a process-wide bit
| to weakly order the whole program? Would that affect loaded
| libraries (including OS-provided libraries) too? Would programs
| dropping TSO potentially affect kernels or hypervisors above them
| on the privilege hierarchy of the CPU?
| jcranmer wrote:
| To be pedantic, x86 _does_ have opt-in weaker memory models.
| When you 're setting up the TLB entries, you can configure how
| strong the cache behavior is for the memory... which means in
| practice, you can't really access these memory models unless
| you're doing firmware, or you're using nontemporal stores [1].
|
| On a more practical level, however, it's actually disruptive to
| make the hardware have optional less-restrictive semantic
| operations, because the other instructions might need more
| fences to guarantee the necessary properties. For example, on
| x86, compilers drop all explicit store fences because they're
| "unnecessary" on the hardware, and adding in operations that
| would make them necessary, even for existing code that doesn't
| know about the instructions that don't exist yet.
|
| [1] Incidentally, this means trying to read the manual to
| figure out what a nontemporal store actually does can feel like
| turning your brain to mush.
| kmeisthax wrote:
| > For example, on x86, compilers drop all explicit store
| fences because they're "unnecessary" on the hardware
|
| I mean specifically for new programs that are written with
| weaker memory models in mind. So you'd have to enable an
| -fNOTSO flag on your compiler that emits all the fences that
| would otherwise be skipped.
|
| Looking at what I can find about nontemporal stores, they
| sound like they already have a weird kind of release
| semantics on x86, even though their intent was to avoid cache
| thrashing, not so much to allow greater memory reordering.
| Are these actually used in compilers?
| jcranmer wrote:
| Compilers will generally expose a nontemporal store
| intrinsic, but don't expect them to actually try to use it
| automatically.
| anonymousDan wrote:
| I think they were particularly beneficial in some workloads
| for optane persistent memory
| adrian_b wrote:
| The cache behavior regarding the ordering of memory accesses
| cannot be configured.
|
| What you can select when mapping memory is to make some
| regions uncached (with several variants, e.g. write-
| combining).
|
| Some of the kinds of uncached memory regions, especially
| those whose names include "write-combining", implement weaker
| orderings of the memory accesses in comparison with how
| cacheable memory works on x86, where 3 of the 4 kinds of
| memory access reorderings are prohibited (i.e. all except
| that subsequent loads can be performed before prior stores).
|
| However choosing a memory type with weakly-ordered accesses
| cannot increase the performance on x86, because the loss of
| performance by not using the cache is much greater.
|
| The weakly-ordered accesses only recover a small part of the
| performance loss caused by uncached memory.
| 4ad wrote:
| There is this myth that having a weaker memory model can lead
| to higher performance. It sounds plausible, and it was a good
| idea to explore at the end of the 80s, but in the end it turned
| out that it's simply not substantiated by facts. Note that
| Apple Silicon is TSO.
|
| Related: https://kcsrk.info/papers/pldi18-memory.pdf
| fwip wrote:
| Apple's Mx chip has an instruction to enable TSO, which
| Rosetta uses when running x86 code. I don't believe it uses
| TSO when running native ARM code, but I could be mistaken.
| SubjectToChange wrote:
| _Note that Apple Silicon is TSO._
|
| Apple Silicon supports both Arm's relaxed memory model and
| Intel's TSO for backwards compatibility. Microsoft is doing
| the same for Windows on Arm.
| dang wrote:
| (We detached this from
| https://news.ycombinator.com/item?id=42400471 since it's a good
| top-level comment)
| SubjectToChange wrote:
| _My question: What stops Intel or AMD from providing an opt-in
| weaker memory model?_
|
| Probably for the same reason Intel/AMD doesn't get rid of the
| rest of the cruft in x86-64, i.e. backwards compatibility.
| Additionally, there would be issues with Intel/AMD leveraging
| an optional weak memory model in their chips without
| compromising the performance of legacy TSO applications. They
| are probably better off making x86-64 perform best under TSO.
| gpderetta wrote:
| > x86 cannot have acquire/release semantics
|
| As I said at the time, this is nonsense. All stores on x86 are
| have release semantics and all loads have acquire semantics.
|
| Atomic RMW are sequentially consistent.
|
| Technically you could implement relaxed stores via non temporal
| stores but it would be pointless.
| kmeisthax wrote:
| For those watching along at home, we're talking about these
| two comments from dragontamer:
|
| - https://news.ycombinator.com/item?id=27753913
|
| - https://news.ycombinator.com/item?id=27762749
|
| I interpreted dragontamer's comment to mean "x86 does not
| have relaxed stores", not "x86 does not have a strong memory
| model". The (presumed) problem is that you can't actually
| test acquire/release code on x86, because it won't crash if
| you get it wrong.
|
| There's a second level of this in C++'s memory model:
| consume/release. I've no idea what the difference between
| acquire and consume are; when I look it up there's usually
| some reference to the DEC Alpha, a 30 year old workstation
| and server chip nobody uses today which was legendary for
| pushing the boundaries of memory ordering. My assumption is
| that no hardware provides memory ordering exactly as weak as
| consume, so I shouldn't bother asking for it, since it'll
| just get strengthened to acquire on any chip that matters.
|
| Also, from you and one other person, it sounds like relaxed
| memory models aren't actually a performance benefit? Would
| the opposite - ARM mandating TSO in ARMv9.1-A or whatever -
| make sense? A lot of ink and code is wasted talking about
| memory model strengths.
| jcranmer wrote:
| > There's a second level of this in C++'s memory model:
| consume/release. I've no idea what the difference between
| acquire and consume are; when I look it up there's usually
| some reference to the DEC Alpha, a 30 year old workstation
| and server chip nobody uses today which was legendary for
| pushing the boundaries of memory ordering. My assumption is
| that no hardware provides memory ordering exactly as weak
| as consume, so I shouldn't bother asking for it, since
| it'll just get strengthened to acquire on any chip that
| matters.
|
| You have it wrong here.
|
| So the basic idea behind release/consume comes from an
| observation that, on all hardware not named Alpha [1],
| there is no hardware fence needed to guarantee that a load
| happens-before another load that is data-dependent on the
| first load. So the C++ committee decided to add a memory
| model that would preserve this guarantee. A consume load is
| exactly like an acquire load, but only for other loads that
| are data-dependent on that load as opposed to all loads in
| general. So whereas an acquire load requires a fence on
| pretty much every hardware not x86 or SPARC in TSO mode, a
| consume load would only require a fence on Alpha.
|
| For various reasons, data-dependency isn't a property that
| compilers are in the position of guaranteeing, which means
| every compiler ended up implementing consume as the
| stronger acquire instead, defeating its entire design goal
| of eliminating unnecessary fences. There have been a couple
| of attempts within the committee to try to find a path that
| would let something like consume by tweaking the necessary
| data dependence definitions, but none of them have swayed
| the implementers, so the current state is that they've
| given up and let consume die.
|
| Of particular note is that release/consume is not designed
| to _support_ the Alpha memory model; in fact, it 's quite
| the opposite: it's designed to support everybody _but_
| Alpha, since for Alpha (and essentially only Alpha),
| release /consume is not usefully weaker than
| release/acquire. Instead, the people who benefit are the
| PPCs and the ARMs of the world, for whom release/consume is
| a better approximation of their native memory model and
| would allow many fences to be omitted.
|
| [1] Actually, some accelerator hardware might have memory
| models as weak as Alpha here, but I'm far less familiar
| with memory models as they apply to accelerators.
| adrian_b wrote:
| It is correct that on x86 all normal stores have release
| semantics and all normal loads have acquire semantics (only a
| few instructions behave differently, i.e. the string
| instructions and the streaming "non-temporal" store
| instructions).
|
| However the ordering properties of the x86 stores and loads
| are stronger than that. An x86 store is not only a release
| store, but it has an additional property that could be called
| of being an "initial store", i.e. a store that is guaranteed
| to be performed before all subsequent stores.
|
| An x86 load is not only an acquire load, but it has an
| additional property that could be called of being a "final
| load", i.e. a load that is guaranteed to be performed after
| all prior loads.
|
| The research paper that has introduced the concepts of
| acquire loads and release stores (in May 1990) was mistaken.
| They have claimed that these 2 kinds of accesses are enough
| for the synchronization of accesses to shared resources.
|
| While it is true that acquire loads and release stores are
| sufficient for implementing critical sections, there are
| other algorithms for accessing shared resources that need the
| other 2 kinds of ordered loads and stores, i.e. final loads
| and initial stores.
|
| On x86, this does not matter, because the normal loads and
| stores provide all 4 kinds of ordered loads and stores, so
| any algorithm can be implemented without using memory
| barriers or other special instructions.
|
| In ISAs that provide only 2 kinds of ordered loads and
| stores, i.e. acquire loads and release stores, like Arm
| Aarch64, the other 2 kinds of ordered loads and stores must
| be synthesized using memory barriers, i.e. an initial store
| is made by a weakly-ordered store followed by a store
| barrier, while a final load is made by a weakly-ordered load
| preceded by a load barrier.
|
| Arm Aarch64 does not have load barriers, but it has stronger
| acquire barriers, which can always be used instead of any
| load barriers. The Arm Aarch64 acquire barrier has the
| confusing mnemonic DMB.LD, apparently inspired by the x86
| LFENCE, which is also an acquire barrier, not a load barrier,
| despite its confusing mnemonic.
|
| (a load barrier guarantees that all prior loads are performed
| before all subsequent loads; an acquire barrier guarantees
| that all prior loads are performed before all subsequent
| loads _and_ stores; such memory barriers are stronger than
| necessary; the weaker ordering properties provided by the 4
| kinds of ordered loads and stores are sufficient)
___________________________________________________________________
(page generated 2024-12-12 23:00 UTC)