[HN Gopher] Programming Language Memory Models
___________________________________________________________________
Programming Language Memory Models
Author : thinxer
Score : 94 points
Date : 2021-07-06 16:22 UTC (6 hours ago)
(HTM) web link (research.swtch.com)
(TXT) w3m dump (research.swtch.com)
| raphlinus wrote:
| A GPU followup to this article.
|
| While on CPU sequentially consistent semantics are efficient to
| implement, that seems to be much less true on GPU. Thus, Vulkan
| completely eliminates sequential consistency and provides only
| acquire/release semantics[1].
|
| It is extremely difficult to reason about programs using these
| advanced memory semantics. For example, there is a discussion
| about whether a spinlock implemented in terms of acquire and
| release can be reordered in a way to introduce deadlock (see
| reddit discussion linked from [2]). I was curious enough about
| this I tried to model it in CDSChecker, but did not get
| definitive results (the deadlock checker in that tool is enabled
| for mutexes provided by API, but not for mutexes built out of
| primitives). I'll also note that using AcqRel semantics is not
| provided by the Rust version of compare_exchange_weak (perhaps a
| nit on TFA's assertion that Rust adopts the C++ memory model
| wholesale), so if acquire to lock the spinlock is not adequate,
| it's likely it would need to go to SeqCst.
|
| Thus, I find myself quite unsure whether this kind of spinlock
| would work on Vulkan or would be prone to deadlock. It's also
| possible it could be fixed by putting a release barrier before
| the lock loop.
|
| We have some serious experts on HN, so hopefully someone who
| knows the answer can enlighten us - mixed in of course with all
| the confidently wrong assertions that inevitably pop up in
| discussions about memory model semantics.
|
| [1]: https://www.khronos.org/blog/comparing-the-vulkan-spir-v-
| mem...
|
| [2]: https://rigtorp.se/spinlock/
| raphlinus wrote:
| Also: it remains difficult to fully nail down the semantics of
| sequential consistency as well, especially when it's mixed with
| other memory semantics. Very likely next time Russ updates his
| article he should add a reference to Repairing Sequential
| Consistency in C/C++11[1].
|
| [1]: https://plv.mpi-sws.org/scfix/full.pdf
| dragontamer wrote:
| GPU-spinlocks are a bad idea, unless the spinlock is applied
| over the entire Thread-group.
|
| Even then, I'm pretty sure the spinlock is a bad idea, because
| you probably should be using GPUs as a coprocessor and
| enforcing "orderings" over CUDA-Streams or OpenCL Task Graphs.
| The kernel-spawn and kernel-end mechanism provides you your
| synchronization functionality ("happens-before") when you need
| it.
|
| ---------
|
| From there on out: the GPU-low level synchronization of choice
| is the thread-barrier (which can extend out beyond a wavefront,
| but only up to a block).
|
| --------
|
| So that'd be my advice: use a thread-barrier at the lowest
| level for thread blocks (synchronization between 1024 threads
| and below). And use kernel-start / kernel-end graphs (aka: CUDA
| Stream and/or OpenCL Task Graphs) for synchronizing groups of
| more than 1024 threads together.
|
| Otherwise, I've done some experiments with acquire/release and
| basic lock/unlock mechanisms. They seem to work as expected.
| You get deadlocks immediately on older hardware because of the
| implicit SIMD-execution (so you want only thread#0 or active-
| thread#0 to perform the lock for the whole wavefront / thread
| block). You'll still want to use thread-barriers for higher
| performance synchronization.
|
| Frankly, I'm not exactly sure why you'd want to use a spinlock
| since thread-barriers are simply higher performance in the GPU
| world.
| raphlinus wrote:
| In _general_ spinlocks are a bad idea, but you do see them in
| contexts like decoupled look-back. As you say, thread
| granularity is a problem (unless you 're on CUDA on Volta+
| hardware, which has independent thread scheduling), so you
| want threadgroup or workgroup granularity.
|
| In any case, I'm interested in pushing the boundaries of
| lock-free algorithms. It is of course easy to reason about
| kernel-{start/end} synchronization, but the granularity may
| be too coarse for some interesting applications.
| dragontamer wrote:
| This is the first time I've heard of the term "decoupled
| look-back". But I see that it refers to CUB's
| implementation of device-wide scan.
|
| I briefly looked at the code, and came across: https://gith
| ub.com/NVIDIA/cub/blob/main/cub/agent/agent_scan...
|
| I'm seeing lots of calls to "CTA_SYNC()", which ends up
| being just a "__syncthreads" (a simple thread-barrier).
| See: https://github.com/NVIDIA/cub/blob/a8910accebe74ce043a
| 13026f...
|
| I admit that I'm looking rather quickly though, but... I'm
| not exactly seeing where this mysterious "spinlock" is that
| you're talking about. I haven't tried very hard yet but
| maybe you can point out what code exactly in this
| device_scan / decoupled look-back uses a spinlock? Cause
| I'm just not seeing it.
|
| ----------
|
| And of course: a call to cub's "device scan" is innately
| ordered to kernel-start / kernel-end. So there's your
| synchronization mechanism right there and then.
| raphlinus wrote:
| I don't think CUB is doing decoupled look-back, the
| reference you want is:
| https://research.nvidia.com/publication/single-pass-
| parallel...
|
| It doesn't use the word "spin" but repeated polling (step
| 4 in the algorithm presented in section 4.1, particularly
| when the flag is X) is basically the same.
| dragontamer wrote:
| 3rd paragraph:
|
| > In this report, we describe the decoupled-lookback
| method of single-pass parallel prefix scan and its
| implementation within the open-source CUB library of GPU
| parallel primitives
|
| The CUB-library also states:
|
| https://nvlabs.github.io/cub/structcub_1_1_device_scan.ht
| ml
|
| >> As of CUB 1.0.1 (2013), CUB's device-wide scan APIs
| have implemented our "decoupled look-back" algorithm for
| performing global prefix scan with only a single pass
| through the input data, as described in our 2016
| technical report [1]
|
| Where [1] is a footnote pointing at the exact paper you
| just linked.
|
| -----------
|
| > It doesn't use the word "spin" but repeated polling
| (step 4 in the algorithm presented in section 4.1,
| particularly when the flag is X) is basically the same.
|
| That certainly sounds spinlock-ish. At least that gives
| me what to look for in the code.
| raphlinus wrote:
| Ah, good then, I didn't know that, thanks for the cite. I
| haven't studied the CUB code on it carefully.
| knz42 wrote:
| A lot of the complexity comes from the lack of expressivity in
| languages to relate variables (or data structure fields)
| semantically to each other. If there was a way to tell the
| compiler "these variables are always accessed in tandem", the
| compiler could be smart about ordering and memory fences.
|
| The idea to extend programming languages and type systems in that
| direction is not new: folk who've been using distributed
| computing for computations have to think about this already, and
| could teach a few things to folk who use shared memory multi-
| processors.
|
| Here's an idea for ISA primitives that could help a language
| group variables together: bind/propagate operators on
| (combinations of) address ranges.
| https://pure.uva.nl/ws/files/1813114/109501_19.pdf
| smasher164 wrote:
| Even with that expressivity, someone who incorrectly relates or
| forgets to relate two variables could experience the same
| issues. It's still important to address what happens when the
| program has data races or when it is data-race-free but the
| memory model permits overreaching optimizations. The language
| and implementation should strive to make a program
| approximately correct.
| dragontamer wrote:
| That's Java's Object.lock() mechanism.
|
| All variables inside of an object (aka: any class) are assumed
| to be related to each other. synchronized(foobar_object){
| baz(); } ensures that all uses of foobar_object inside the
| synchronization{} area are sequential (and therefore correct).
|
| --------
|
| The issue is that some people (a minority) are interested in
| "beating locks" and making something even more efficient.
| karmakaze wrote:
| In Java, any object can be used to synchronize any data, e.g.
| synchronized(foobar_object){ foo(); }
| synchronized(foobar_object){ bar(); }
| synchronized(foobar_object){ baz(); }
|
| Will have foo, bar, baz methods well behaved in any data that
| they share regardless of whether they are foobar methods or
| methods of any other class(es). It is exactly analogous to
| the S(a) -> S(a) synchronizing instruction from the article
| that establishes a happens-before partitioning each thread
| into before/after the S(a).
|
| The only time synchronized(explicit_object) relates to
| anything else is when also using the keyword where
| `synchronized void foo()` is equivalent (with a minor
| performance difference) to `synchronized(this) { ... }`
| wrapping the entire body of the foo method.
| voidnullnil wrote:
| These "memory models" are too complex for languages intended for
| dilettante developers. It was a disaster in Java/C#. Not even
| more than a handful of programmers in existence know in depth how
| it works, as in, can they understand any given trivial program in
| their language. At best they only know some vague stuff like that
| locking prevents any non visibility issues. It goes far deeper
| than that though (which is also the fault of complex language
| designs like Java and C#).
|
| The common programmer does not understand that you've just
| transformed their program - for which they were taught merely
| that multiple threads needs synchronization - into a new game,
| which has an entire separate specification, where every shared
| variable obeys a set of abstruse rules revolving around the
| happens-before relationship. Locks, mutexes, atomic variables are
| all one thing. Fences are a completely different thing. At least
| in the way most people intuit programs to work.
|
| Go tries to appeal to programmers as consumers (that is, when
| given a choice between cleaner design and pleasing the user who
| just wants to "get stuff done", they choose the latter), but yet
| also adds in traditional complexities like this. Yes, there is
| performance trade off to having shared memory behave intuitively,
| but that's much better than bugs that 99% of your CHOSEN userbase
| do not know how to avoid. Also remember Go has lots of weird edge
| cases, like sharing a slice across threads can lead to memory
| corruption (in the C / assembly sense, not merely within that
| array) despite the rest of the language being memory-safe.
| Multiply that by the "memory model".
|
| Edit: forgot spaces between paragraphs.
| filleduchaos wrote:
| > These "memory models" are too complex for languages intended
| for dilettante developers.
|
| It would be nice if sometime we stopped pretending that
| beginners are too slow to know/understand things and instead
| faced the fact that their instructors and mentors are bad at
| teaching.
| temac wrote:
| They are not too complex for "languages intended for dilettante
| developers": they can (and should) use sequential consistency
| or locks everywhere.
| voidnullnil wrote:
| How many programmers do you know put locks around polling
| varibles for seemingly no reason (as they are not cognizant
| of the memory model)?
| pjmlp wrote:
| Well, while it may appear gatekeeping, maybe those
| dilettante developers should be using something else
| instead, like BASIC?
| formerly_proven wrote:
| CPython has a pleasant memory model from what I've heard
| :)
| electricshampo1 wrote:
| "Java and JavaScript have avoided introducing weak
| (acquire/release) synchronizing atomics, which seem tailored for
| x86."
|
| This is not true for Java; see
|
| http://gee.cs.oswego.edu/dl/html/j9mm.html
|
| https://docs.oracle.com/en/java/javase/16/docs/api/java.base...
| dragontamer wrote:
| Its not true in general. x86 CANNOT have weak acquire/release
| semantics. x86 is "too strong", you get total-store ordering by
| default.
|
| If you want to test out weaker acquire/release semantics, you
| need to buy an ARM or POWER9 processor.
| bullen wrote:
| In a 100 years the main languages used will still be C on the
| client (with a C++ compiler) and Java on the server.
|
| Go has no VM but it has a GC. WASM has a VM but no GC.
|
| Eveything has been tried and Java still kicks everythings ass to
| the moon on the server.
|
| Fragmentation is bad, lets stop using bad languages and focus on
| the products we build instead.
|
| "While I'm on the topic of concurrency I should mention my far
| too brief chat with Doug Lea. He commented that multi-threaded
| Java these days far outperforms C, due to the memory management
| and a garbage collector. If I recall correctly he said "only 12
| times faster than C means you haven't started optimizing"." -
| Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html
|
| "Many lock-free structures offer atomic-free read paths, notably
| concurrent containers in garbage collected languages, such as
| ConcurrentHashMap in Java. Languages without garbage collection
| have fewer straightforward options, mostly because safe memory
| reclamation is a hard problem..." - Travis Downs
| https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
| pphysch wrote:
| Pretty sure Java is/was popular (and has enormous momentum)
| because "it just works" at a time when the Internet is taking
| off, not because it is some linguistic or technological marvel.
| It will definitely stick around, just like COBOL and C and Go.
| bullen wrote:
| COBOL is not around in anything interesting, and trust me go
| is not going to be used to build anything that we'll use in
| 20 years.
| pphysch wrote:
| Do I hear "containers are a temporary fad"?
| bullen wrote:
| Yep
| skohan wrote:
| I'm sorry is this comment from 1998? I've been working in
| software for over a decade, and I haven't seen server work
| being done in Java in ages.
|
| From my perspective, Go in the context of serverless
| programming seems to currently be the best choice for server-
| side programming.
|
| In the next 20 years I expect Go will be supplanted by a
| language which is a lot like go (automatic memory management,
| simple, easy to learn & write and performant enough) but with
| the addition of algebraic data types, named parameters, and a
| _slightly_ higher level of abstraction.
| pjmlp wrote:
| And I haven't seen server work done in C++ since 2006, we
| keep replacing those systems with Java and .NET ones.
|
| To each its own.
| bullen wrote:
| Infinite growth does not exist, everything peaks at some
| point. You have to wonder if a memory model from 2005 still
| kicks go's ass in 2021 how your your prediction that "there
| will always be something new and shiny to distract us from
| the focus we need to leverage the real value of the internet"
| will play out?
|
| What have you built with go that is interesting?
| skohan wrote:
| It's not about new and shiny. Programming is still a
| relatively new field when compared to other fields. For
| instance, mathematics and physics took centuries to land on
| the right way to formalize things.
|
| C is maybe the only good programming language invented so
| far. Java was a failed attempt at improving C. I think
| we're rapidly converging on the second good programming
| language, and it's not going to have null pointer
| exceptions.
| valbaca wrote:
| > In the next 20 years I expect Go will be supplanted by a
| language which is a lot like go (automatic memory management,
| simple, easy to learn & write and performant enough) but with
| the addition of algebraic data types, named parameters, and a
| slightly higher level of abstraction.
|
| I'd love for this to be Crystal: https://crystal-lang.org/
|
| > I haven't seen server work being done in Java in ages.
|
| In the meantime, I've been doing a large amount of Java
| backend server work for the past 10 years.
| skohan wrote:
| I have a feeling it's going to have C-like syntax and
| frankly I hope so because using an `end` keyword instead of
| braces makes no sense to me.
| filleduchaos wrote:
| Arguably using curly braces to delineate blocks makes no
| inherent sense either. We just do it because that's what
| everybody else does.
| jqpabc123 wrote:
| _If thread 2 copies done into a register before thread 1
| executes, it may keep using that register for the entire loop,
| never noticing that thread 1 later modifies done._
|
| Alternative solution: Forget all the "atomic" semantics and
| simply avoid "optimization" of global variables. Access to any
| global variable should always occur direct from memory. Sure,
| this will be less than optimal in some cases but such is the
| price of using globals. Their use should be discouraged anyway.
|
| In other words, make "atomic" the sensible and logical default
| with globals. Assignment is an "atomic" operation, just don't
| circumvent it by using a local copy as an "optimization".
| dragontamer wrote:
| Removing optimizations on "global" variables will leave the bug
| in Singleton objects (which are very similar to global
| variables, but the compiler doesn't know that they're global)
|
| ---------
|
| "Volatile" is close but not good enough semantically to
| describe what we want. That's why these new Atomic-variables
| are being declared with seqcst semantics (be it in Java, C++,
| C, or whatever you program in).
|
| That's the thing: we need a new class of variables that wasn't
| known 20 years ago. Variables that follow the sequential-
| consistency requirements, for this use case.
|
| ---------
|
| Note: on ARM systems, even if the compiler doesn't mess up, the
| L1 cache could mess you up. ARM has multiple load/store
| semantics available. If you have relaxed (default) semantics on
| a load, it may be on a "stale" value from DDR4.
|
| That is to say: your loop may load a value into L1 cache, then
| your core will read the variable over and over from L1 cache
| (not realizing that L3 cache has been updated to a new value).
| Not only does your compiler need to know to "not store the
| value in a register", the memory-subsystem also needs to know
| to read the data from L3 cache over-and-over again (never using
| L1 cache).
|
| Rearranging loads/stores on x86 is not allowed in this manner.
| But ARM is more "Relaxed" than x86. If reading the "Freshest"
| value is important, you need to have the appropriate memory
| barriers on ARM (or PowerPC).
| jqpabc123 wrote:
| _which are very similar to global variables, but the compiler
| doesn 't know that they're global_
|
| Since as you say, they are very similar, wouldn't it be
| reasonable to assume for access purposes that they are
| effectively global?
| dragontamer wrote:
| Lets do this example in Java (but it should be simple
| enough that C#, Python, Javascript and other programmers
| would understand it). public void
| myFunction(FooObject o){ o.doSomething();
| }
|
| How does the compiler know if "FooObject o" is a singleton
| or not? That's the thing about the "Singleton" pattern, you
| have an effective "global-ish" variable, but all of your
| code is written with normal pass-the-object style.
|
| EDIT: If you're not aware, the way this works is that you
| call myFunction(getTheSingleton());, where
| "getTheSingleton()" fetches the Singleton object.
| myFunction() has no way of "knowing" its actually
| interacting with the singleton. This is a useful pattern,
| because you can create unit-tests over the "global"
| variable by simply mocking out the Singleton for a mock
| object (or maybe an object with preset state for better
| unit testing). Among other benefits (but also similar
| downsides to using a global variable: difficult to reason
| because you have this "shared state" being used all over
| the place)
| mumblemumble wrote:
| Is there anything special about singletons here?
|
| In Java, there's no real difference between a singleton
| and any other object. A singleton is an object that just
| happens to have a single instance. Practically speaking,
| they're typically used as a clever design pattern to
| "work around" Java's lack of language-level support for
| global variables, so there's that. But I think that that
| fact might not be relevant to the issue at hand?
|
| The more basic issue is, if you have two different
| threads concurrently executing `myFunction`, what happens
| when they're both operating on the same instance of
| `FooObject`?
| dragontamer wrote:
| > Is there anything special about singletons here?
|
| No, aside from the fact that the root commenter clearly
| understands the issue with global variables, but not
| necessarily singletons.
|
| I'm trying to use the singleton concept as a "teaching
| bridge" moment, as the Singleton is clearly "like a
| global variable" in terms of the data-race, but
| generalizes to any object in your code.
|
| The commenter I'm replying seems to think that global-
| variables are the only kind of variable where this
| problem occurs. He's wrong. All objects and all variables
| have this problem.
| ekiwi wrote:
| > Access to any global variable should always occur direct from
| memory.
|
| What if your function takes a pointer that might be pointing to
| a global variable? Does that mean that all accesses through a
| pointer are now excempt from optimization unless the compiler
| can prove that the pointer will never point to a global
| variable?
| jqpabc123 wrote:
| What if your function takes a pointer that might be pointing
| to an "atomic" variable?
|
| Pointers can be used to circumvent most safety measures. If
| you obscure the access, you should assume responsibility for
| the result.
| cbsmith wrote:
| The great thing about simple memory models is they work
| really well until you think about it. ;-)
| mumblemumble wrote:
| This problem isn't specific to global variables; it happens
| with all shared mutable state. I would assume that the author
| only used global variables because that lets them keep the
| working examples as short as possible, and minimize irrelevant
| details.
|
| And yes, you _can_ put a full memory fence around every access
| to a variable that is shared across threads. But doing so would
| just destroy the performance of your program. Compared to using
| a register, accessing main memory typically takes something on
| the order of 100 times as long. Given that we 're talking about
| concerns that are specific to a relatively low-level approach
| to parallelism, I think it's safe to assume that performance is
| the whole point, so that would be an unacceptable tradeoff.
| dragontamer wrote:
| > And yes, you can put a full memory fence around every
| access to a variable that is shared across threads. But doing
| so would just destroy the performance of your program. Given
| that we're talking about concerns that are specific to a
| relatively low-level approach to parallelism, I think it's
| safe to assume that performance is the whole point, so that
| would be an unacceptable tradeoff.
|
| Indeed.
|
| Just a reminder to everyone: your pthreads_mutex_lock() and
| pthreads_mutex_unlock() functions already contain the
| appropriate compiler / cache memory barriers in the correct
| locations.
|
| This "Memory Model" discussion is only for people who want to
| build faster systems: for people searching for a "better
| spinlock", or for writing lock-free algorithms / lock-free
| data structures.
|
| This is the stuff of cutting edge research right now: its a
| niche subject. Your typical programmer _SHOULD_ just stick a
| typical pthread_mutex_t onto an otherwise single-threaded
| data-structure and call it. Locks work. They're not "the
| best", but "the best" is constantly being researched /
| developed right now. I'm pretty sure that any new lockfree
| data-structure with decent performance is pretty much an
| instant PH.D thesis material.
|
| -----------
|
| Anyway, the reason why "single-threaded data-structure behind
| a mutex" works is because your data-structure still keeps all
| of its performance benefits (from sticking to L1 cache, or
| letting the compiler "manually cache" data to registers when
| appropriate), and then you only lose performance when
| associated with the lock() or unlock() calls (which will
| innately have memory barriers to publish the results)
|
| That's 2 memory barriers (one barrier for lock() and one
| barrier for unlock()). The thing about lock-free algorithms
| is that they __might__ get you down to __1__ memory barrier
| per operation if you're a really, really good programmer. But
| its not exactly easy. (Or: they might still have 2 memory
| barriers but the lockfree aspect of "always forward progress"
| and/or deadlock free might be easier to prove)
|
| Writing a low-performance but otherwise correct lock free
| algorithm isn't actually that hard. Writing a lock free
| algorithm that beats your typical mutex + data-structure
| however, is devilishly hard.
| voidnullnil wrote:
| > This "Memory Model" discussion is only for people who
| want to build faster systems: for people searching for a
| "better spinlock", or for writing lock-free algorithms /
| lock-free data structures.
|
| Actually, most practitioner code has bugs from their
| implicit assumptions that shared variable writes are
| visible or ordered the way they think they are.
| dragontamer wrote:
| But the practitioner doesn't need to know the memory
| model (aside from "memory models are complicated").
|
| To solve that problem, the practitioner only needs to
| know that "mutex.lock()" and "mutex.unlock()" orders
| reads/writes in a clearly defined manner. If the
| practitioner is wondering about the difference between
| load-acquire and load-relaxed, they've probably gone too
| deep.
| voidnullnil wrote:
| > To solve that problem, the practitioner only needs to
| know that "mutex.lock()"
|
| This is true, but they do not know that. If you do not
| give some kind of substantiation, they will shrug it off
| and go back to "nah this thing doesn't need a mutex",
| like with a polling variable (contrived example).
| Kranar wrote:
| Can you explain what you mean by a "polling variable"
| needing a mutex? Usually polling is done using atomic
| instructions instead of a mutex. Are you referring to
| condition variables?
| mumblemumble wrote:
| In a lot of code I've seen, there are threads polling
| some variable without using any sort of special guard.
| The assumption (based, I assume, on how you really could
| get away with this back in the days of single-core,
| single-CPU computers) is that you only need to worry
| about race conditions when writing to primitive
| variables, and that simply reading them is always safe.
| Kranar wrote:
| Okay but the poster mentioned a mutex, which would not be
| a good way to go about polling a variable in Java. All
| you need to guarantee synchronization of primitive values
| in Java is the use of volatile [1]. If you need to
| compose atomic operations together, then you can use an
| atomic or a mutex, but it would not occur to me to use a
| mutex to perform a single atomic read or write on a
| variable in Java.
|
| [1] https://docs.oracle.com/javase/specs/jls/se8/html/jls
| -8.html...
| wcarss wrote:
| The prior article in this series from ~a week ago is 'Hardware
| Memory Models', at https://research.swtch.com/hwmm, with some hn-
| discussion here: https://news.ycombinator.com/item?id=27684703
|
| Another somewhat recently posted (but years-old) page with
| different but related content is 'Memory Models that Underlie
| Programming Languages': http://canonical.org/~kragen/memory-
| models/
|
| a few previous hn discussions of that one:
|
| https://news.ycombinator.com/item?id=17099608
|
| https://news.ycombinator.com/item?id=27455509
|
| https://news.ycombinator.com/item?id=13293290
___________________________________________________________________
(page generated 2021-07-06 23:00 UTC)