[HN Gopher] Myths Programmers Believe about CPU Caches (2018)
___________________________________________________________________
Myths Programmers Believe about CPU Caches (2018)
Author : whack
Score : 135 points
Date : 2025-10-31 00:46 UTC (1 days ago)
(HTM) web link (software.rajivprab.com)
(TXT) w3m dump (software.rajivprab.com)
| yohbho wrote:
| 2018, discussed on HN in 2023:
| https://news.ycombinator.com/item?id=36333034
| Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
| 176 points by whack on June 14, 2023 | hide | past | favorite |
| 138 comments
| breppp wrote:
| Very interesting, I am hardly an expert, but this gives the
| impression that if only without that meddling software we will
| all live in a synchronous world.
|
| This ignores store buffers and consequently memory fencing which
| is the basis for the nightmarish std::memory_order, the worst api
| documentation you will ever meet
| jeffbee wrote:
| If the committee had any good taste at all they would have
| thrown DEC AXP overboard before C++11, which would have cut the
| majority of the words out of the specification for
| std::memory_order. It was only the obsolete and quite
| impossible to program Alpha CPU that requires the ultimate
| level of complexity.
| QuaternionsBhop wrote:
| Since the CPU is doing cache coherency transparently, perhaps
| there should be some sort of way to promise that an application
| is well-behaved in order to access a lower-level non-transparent
| instruction set to manually manage the cache coherency from the
| application level. Or perhaps applications can never be trusted
| with that level of control over the hardware. The MESI model
| reminded me of Rust's ownership and borrowing. The pattern also
| appears in OpenGL vs Vulkan drivers, implicit sync vs explicit
| sync. Yet another example would be the cache management work
| involved in squeezing out maximum throughput CUDA on an
| enterprise GPU.
| cpgxiii wrote:
| There are some knobs that newer processors give for cache
| control, mostly to partition or reserve cache space to improve
| security or reduce cache contention between processes.
|
| Actual manual cache management is way too much of an
| implementation detail for a general-purpose CPU to expose;
| doing so would deeply tie code to a specific set of processor
| behavior. Cache sizes and even hierarchies change often between
| processor generations, and some internal cache behavior has
| changed within a generation as a result of microcode and/or
| hardware steppings. Actual cache control would be like MIPS
| exposing delay slots but so much worse (at least older delay
| slots really only turn into performance issues, older cache
| control would easily turn into correctness issues).
|
| Really the only way to make this work is for the final
| compilation/"specialization" step to occur on the specific
| device in question, like with a processor using binary
| translation (e.g. Transmeta, Nvidia Denver) or specialization
| (e.g. Mill) or a system that effectively enforces runtime
| compilation (e.g. runtime shader/program compilation in OpenGL
| and OpenCL).
| groundzeros2015 wrote:
| Did this article share more than one myth?
|
| The reason why programmers don't believe in cache coherency is
| because they have experienced a closely related phenomena, memory
| reordering. This requires you to use a memory fence when
| accessing a shared value between multiples cores - as if they
| needed to synchronize.
| Lwerewolf wrote:
| I'm pretty sure that most cases of x86 reordering issues are a
| matter of the compiler reordering things, which isn't (afaik)
| solved with just "volatile". Caveat - haven't dealt with this
| for at least over a year (multicore sync without using OS
| primitives in general).
| nly wrote:
| x86 has a Total Store Order (TSO) memory model, which
| effectively means (in a mental model where only 1 shared
| memory operation happens at once and completes before the
| next) stores are queued but loads can be executed immediately
| even if stores are queued in the store buffer.
|
| On a single core a load can be served from the store buffer
| (queue), but other cores can't see those stores yet, which is
| where all the inconsistencies come from.
| fer wrote:
| Literally the volatile keyword in Java is to make the Java
| compiler aware in order to insert memory barriers. That only
| guarantees consistent reads or writes, but it doesn't make it
| thread safe (i.e. write after read), that's what atomics are
| for.
|
| Also not only compilers reorder things, most processors
| nowadays do OoOE; even if the order from the compiler is
| perfect in theory, different latencies for different
| instruction operands may lead to execute later things earlier
| not to stall the CPU.
| zozbot234 wrote:
| Note that this is _only_ true of the Java /C# volatile
| keyword. The volatile keyword in C/C++ is solely about
| direct access in hardware to memory-mapped locations, for
| such purposes as controlling external devices; it is
| entirely unrelated to the C11 memory model for concurrency,
| does not provide the same guarantees, and should never be
| used for that purpose.
| igtztorrero wrote:
| In Golang there are sync.Mutex, sync.Atomic and Channels to
| create this fence and prevent data races. I prefer sync.Mutex.
|
| Does anyone understand how Go handles the CPU cache?
| groundzeros2015 wrote:
| Yes. Locks will use a memory fence. More advanced programs
| will need fence without locking.
| daemontus wrote:
| I may be completely out of line here, but isn't the story on ARM
| very very different? I vaguely recall the whole point of having
| stuff like weak atomics being that on x86, those don't do
| anything, but on ARM they are essential for cache coherency and
| memory ordering? But then again, I may just be conflating memory
| ordering and coherency.
| jeffbee wrote:
| Well, since this is a thread about how programmers use the
| wrong words to model how they think a CPU cache works, I think
| it bears mentioning that you've used "atomics" here to mean
| something irrelevant. It is not true that x86 atomics do
| nothing. Atomic instructions or, on x86, their prefix, make a
| naturally non-atomic operation such as a read-modify-write
| atomic. The ARM ISA actually lacked such a facility until
| ARMv8.1.
|
| The instructions to which you refer are not atomics, but rather
| instructions that influence the ordering of loads and stores.
| x86 has total store ordering by design. On ARM, the program has
| to use LDAR/STLR to establish ordering.
| phire wrote:
| Everything it says about cache coherency is exactly the same on
| ARM.
|
| Memory ordering has nothing to do with cache coherency, it's
| all about what happens within the CPU pipeline itself. On ARM
| reads and writes can become reordered within the CPU pipeline
| itself, before they hit the caches (which are still fully
| coherent).
|
| ARM still has strict memory ordering for code within a single
| core (some older processors do not), but the writes from one
| core might become visible to other cores in the wrong order.
| ashvardanian wrote:
| Here's my favorite practically applicable cache-related fact:
| even on x86 on recent server CPUs, cache-coherency protocols may
| be operating at a different granularity than the cache line size.
| A typical case with new Intel server CPUs is operating at the
| granularity of 2 consecutive cache lines. Some thread-pool
| implementations like CrossBeam in Rust and my ForkUnion in Rust
| and C++, explicitly document that and align objects to 128 bytes
| [1]: /** * @brief Defines variable
| alignment to avoid false sharing. * @see https://en.cppre
| ference.com/w/cpp/thread/hardware_destructive_interference_size
| * @see https://docs.rs/crossbeam-
| utils/latest/crossbeam_utils/struct.CachePadded.html *
| * The C++ STL way to do it is to use
| `std::hardware_destructive_interference_size` if available:
| * * @code{.cpp} * #if
| defined(__cpp_lib_hardware_interference_size) * static
| constexpr std::size_t default_alignment_k =
| std::hardware_destructive_interference_size; * #else
| * static constexpr std::size_t default_alignment_k =
| alignof(std::max_align_t); * #endif * @endcode
| * * That however results into all kinds of ABI warnings
| with GCC, and suboptimal alignment choice, * unless you
| hard-code `--param hardware_destructive_interference_size=64` or
| disable the warning * with `-Wno-interference-size`.
| */ static constexpr std::size_t default_alignment_k = 128;
|
| As mentioned in the docstring above, using STL's
| `std::hardware_destructive_interference_size` won't help you. On
| ARM, this issue becomes even more pronounced, so concurrency-
| heavy code should ideally be compiled multiple times for
| different coherence protocols and leverage "dynamic dispatch",
| similar to how I & others handle SIMD instructions in libraries
| that need to run on a very diverse set of platforms.
|
| [1]
| https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...
| Sesse__ wrote:
| This makes attempts of cargo-culting
| __attribute__((aligned(64))) without benchmarking even more
| hilarious. :-)
| rnrn wrote:
| It's not a cargo cult if the actions directly cause cargo to
| arrive based on well understood mechanics.
|
| Regardless of whether it would be better in some situations
| to align to 128 bytes, 64 bytes really is the cache line size
| on all common x86 cpus and it is a good idea to avoid threads
| modifying the same cacheline.
| Sesse__ wrote:
| It indeed isn't, but I've seen my share of systems where
| nobody checked if cargo arrived. (The code was checked in
| without any benchmarks done, and after many years, it was
| found that the macros used were effectively no-ops :-) )
| rnrn wrote:
| > even on x86 on recent server CPUs, cache-coherency protocols
| may be operating at a different granularity than the cache line
| size. A typical case with new Intel server CPUs is operating at
| the granularity of 2 consecutive cache lines
|
| I don't think it is accurate that Intel CPUs use 2 cache lines
| / 128 bytes as the coherency protocol granule.
|
| Yes, there can be additional destructive interference effects
| at that granularity, but that's due to prefetching (of two
| cachelines with coherency managed independently) rather than
| having coherency operating on one 128 byte granule
|
| AFAIK 64 bytes is still the correct granule for avoiding false
| sharing, with two cores modifying two consecutive cachelines
| having way less destructive interference than two cores
| modifying one cacheline.
| j_seigh wrote:
| Coherent cache is transparent to the memory model. So if someone
| trying to explain memory model and ordering mentioned cache as
| affecting the memory model, it was pretty much a sign they didn't
| fully understand what they were talking about.
| wpollock wrote:
| This post, along with the tutorial links it and the comments
| contain, provide good insights on the topic of caches, coherence,
| and related topics. I would like to add a link that I feel is
| also very good, maybe better:
|
| <https://marabos.nl/atomics/hardware.html>
|
| While the book this chapter is from is about Rust, this chapter
| is pretty much language-agnostic.
___________________________________________________________________
(page generated 2025-11-01 23:01 UTC)