[HN Gopher] Myths Programmers Believe about CPU Caches (2018)
       ___________________________________________________________________
        
       Myths Programmers Believe about CPU Caches (2018)
        
       Author : whack
       Score  : 135 points
       Date   : 2025-10-31 00:46 UTC (1 days ago)
        
 (HTM) web link (software.rajivprab.com)
 (TXT) w3m dump (software.rajivprab.com)
        
       | yohbho wrote:
       | 2018, discussed on HN in 2023:
       | https://news.ycombinator.com/item?id=36333034
       | Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
       | 176 points by whack on June 14, 2023 | hide | past | favorite |
       | 138 comments
        
       | breppp wrote:
       | Very interesting, I am hardly an expert, but this gives the
       | impression that if only without that meddling software we will
       | all live in a synchronous world.
       | 
       | This ignores store buffers and consequently memory fencing which
       | is the basis for the nightmarish std::memory_order, the worst api
       | documentation you will ever meet
        
         | jeffbee wrote:
         | If the committee had any good taste at all they would have
         | thrown DEC AXP overboard before C++11, which would have cut the
         | majority of the words out of the specification for
         | std::memory_order. It was only the obsolete and quite
         | impossible to program Alpha CPU that requires the ultimate
         | level of complexity.
        
       | QuaternionsBhop wrote:
       | Since the CPU is doing cache coherency transparently, perhaps
       | there should be some sort of way to promise that an application
       | is well-behaved in order to access a lower-level non-transparent
       | instruction set to manually manage the cache coherency from the
       | application level. Or perhaps applications can never be trusted
       | with that level of control over the hardware. The MESI model
       | reminded me of Rust's ownership and borrowing. The pattern also
       | appears in OpenGL vs Vulkan drivers, implicit sync vs explicit
       | sync. Yet another example would be the cache management work
       | involved in squeezing out maximum throughput CUDA on an
       | enterprise GPU.
        
         | cpgxiii wrote:
         | There are some knobs that newer processors give for cache
         | control, mostly to partition or reserve cache space to improve
         | security or reduce cache contention between processes.
         | 
         | Actual manual cache management is way too much of an
         | implementation detail for a general-purpose CPU to expose;
         | doing so would deeply tie code to a specific set of processor
         | behavior. Cache sizes and even hierarchies change often between
         | processor generations, and some internal cache behavior has
         | changed within a generation as a result of microcode and/or
         | hardware steppings. Actual cache control would be like MIPS
         | exposing delay slots but so much worse (at least older delay
         | slots really only turn into performance issues, older cache
         | control would easily turn into correctness issues).
         | 
         | Really the only way to make this work is for the final
         | compilation/"specialization" step to occur on the specific
         | device in question, like with a processor using binary
         | translation (e.g. Transmeta, Nvidia Denver) or specialization
         | (e.g. Mill) or a system that effectively enforces runtime
         | compilation (e.g. runtime shader/program compilation in OpenGL
         | and OpenCL).
        
       | groundzeros2015 wrote:
       | Did this article share more than one myth?
       | 
       | The reason why programmers don't believe in cache coherency is
       | because they have experienced a closely related phenomena, memory
       | reordering. This requires you to use a memory fence when
       | accessing a shared value between multiples cores - as if they
       | needed to synchronize.
        
         | Lwerewolf wrote:
         | I'm pretty sure that most cases of x86 reordering issues are a
         | matter of the compiler reordering things, which isn't (afaik)
         | solved with just "volatile". Caveat - haven't dealt with this
         | for at least over a year (multicore sync without using OS
         | primitives in general).
        
           | nly wrote:
           | x86 has a Total Store Order (TSO) memory model, which
           | effectively means (in a mental model where only 1 shared
           | memory operation happens at once and completes before the
           | next) stores are queued but loads can be executed immediately
           | even if stores are queued in the store buffer.
           | 
           | On a single core a load can be served from the store buffer
           | (queue), but other cores can't see those stores yet, which is
           | where all the inconsistencies come from.
        
           | fer wrote:
           | Literally the volatile keyword in Java is to make the Java
           | compiler aware in order to insert memory barriers. That only
           | guarantees consistent reads or writes, but it doesn't make it
           | thread safe (i.e. write after read), that's what atomics are
           | for.
           | 
           | Also not only compilers reorder things, most processors
           | nowadays do OoOE; even if the order from the compiler is
           | perfect in theory, different latencies for different
           | instruction operands may lead to execute later things earlier
           | not to stall the CPU.
        
             | zozbot234 wrote:
             | Note that this is _only_ true of the Java /C# volatile
             | keyword. The volatile keyword in C/C++ is solely about
             | direct access in hardware to memory-mapped locations, for
             | such purposes as controlling external devices; it is
             | entirely unrelated to the C11 memory model for concurrency,
             | does not provide the same guarantees, and should never be
             | used for that purpose.
        
         | igtztorrero wrote:
         | In Golang there are sync.Mutex, sync.Atomic and Channels to
         | create this fence and prevent data races. I prefer sync.Mutex.
         | 
         | Does anyone understand how Go handles the CPU cache?
        
           | groundzeros2015 wrote:
           | Yes. Locks will use a memory fence. More advanced programs
           | will need fence without locking.
        
       | daemontus wrote:
       | I may be completely out of line here, but isn't the story on ARM
       | very very different? I vaguely recall the whole point of having
       | stuff like weak atomics being that on x86, those don't do
       | anything, but on ARM they are essential for cache coherency and
       | memory ordering? But then again, I may just be conflating memory
       | ordering and coherency.
        
         | jeffbee wrote:
         | Well, since this is a thread about how programmers use the
         | wrong words to model how they think a CPU cache works, I think
         | it bears mentioning that you've used "atomics" here to mean
         | something irrelevant. It is not true that x86 atomics do
         | nothing. Atomic instructions or, on x86, their prefix, make a
         | naturally non-atomic operation such as a read-modify-write
         | atomic. The ARM ISA actually lacked such a facility until
         | ARMv8.1.
         | 
         | The instructions to which you refer are not atomics, but rather
         | instructions that influence the ordering of loads and stores.
         | x86 has total store ordering by design. On ARM, the program has
         | to use LDAR/STLR to establish ordering.
        
         | phire wrote:
         | Everything it says about cache coherency is exactly the same on
         | ARM.
         | 
         | Memory ordering has nothing to do with cache coherency, it's
         | all about what happens within the CPU pipeline itself. On ARM
         | reads and writes can become reordered within the CPU pipeline
         | itself, before they hit the caches (which are still fully
         | coherent).
         | 
         | ARM still has strict memory ordering for code within a single
         | core (some older processors do not), but the writes from one
         | core might become visible to other cores in the wrong order.
        
       | ashvardanian wrote:
       | Here's my favorite practically applicable cache-related fact:
       | even on x86 on recent server CPUs, cache-coherency protocols may
       | be operating at a different granularity than the cache line size.
       | A typical case with new Intel server CPUs is operating at the
       | granularity of 2 consecutive cache lines. Some thread-pool
       | implementations like CrossBeam in Rust and my ForkUnion in Rust
       | and C++, explicitly document that and align objects to 128 bytes
       | [1]:                 /**        *  @brief Defines variable
       | alignment to avoid false sharing.        *  @see https://en.cppre
       | ference.com/w/cpp/thread/hardware_destructive_interference_size
       | *  @see https://docs.rs/crossbeam-
       | utils/latest/crossbeam_utils/struct.CachePadded.html        *
       | *  The C++ STL way to do it is to use
       | `std::hardware_destructive_interference_size` if available:
       | *        *  @code{.cpp}        *  #if
       | defined(__cpp_lib_hardware_interference_size)        *  static
       | constexpr std::size_t default_alignment_k =
       | std::hardware_destructive_interference_size;        *  #else
       | *  static constexpr std::size_t default_alignment_k =
       | alignof(std::max_align_t);        *  #endif        *  @endcode
       | *        *  That however results into all kinds of ABI warnings
       | with GCC, and suboptimal alignment choice,        *  unless you
       | hard-code `--param hardware_destructive_interference_size=64` or
       | disable the warning        *  with `-Wno-interference-size`.
       | */       static constexpr std::size_t default_alignment_k = 128;
       | 
       | As mentioned in the docstring above, using STL's
       | `std::hardware_destructive_interference_size` won't help you. On
       | ARM, this issue becomes even more pronounced, so concurrency-
       | heavy code should ideally be compiled multiple times for
       | different coherence protocols and leverage "dynamic dispatch",
       | similar to how I & others handle SIMD instructions in libraries
       | that need to run on a very diverse set of platforms.
       | 
       | [1]
       | https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...
        
         | Sesse__ wrote:
         | This makes attempts of cargo-culting
         | __attribute__((aligned(64))) without benchmarking even more
         | hilarious. :-)
        
           | rnrn wrote:
           | It's not a cargo cult if the actions directly cause cargo to
           | arrive based on well understood mechanics.
           | 
           | Regardless of whether it would be better in some situations
           | to align to 128 bytes, 64 bytes really is the cache line size
           | on all common x86 cpus and it is a good idea to avoid threads
           | modifying the same cacheline.
        
             | Sesse__ wrote:
             | It indeed isn't, but I've seen my share of systems where
             | nobody checked if cargo arrived. (The code was checked in
             | without any benchmarks done, and after many years, it was
             | found that the macros used were effectively no-ops :-) )
        
         | rnrn wrote:
         | > even on x86 on recent server CPUs, cache-coherency protocols
         | may be operating at a different granularity than the cache line
         | size. A typical case with new Intel server CPUs is operating at
         | the granularity of 2 consecutive cache lines
         | 
         | I don't think it is accurate that Intel CPUs use 2 cache lines
         | / 128 bytes as the coherency protocol granule.
         | 
         | Yes, there can be additional destructive interference effects
         | at that granularity, but that's due to prefetching (of two
         | cachelines with coherency managed independently) rather than
         | having coherency operating on one 128 byte granule
         | 
         | AFAIK 64 bytes is still the correct granule for avoiding false
         | sharing, with two cores modifying two consecutive cachelines
         | having way less destructive interference than two cores
         | modifying one cacheline.
        
       | j_seigh wrote:
       | Coherent cache is transparent to the memory model. So if someone
       | trying to explain memory model and ordering mentioned cache as
       | affecting the memory model, it was pretty much a sign they didn't
       | fully understand what they were talking about.
        
       | wpollock wrote:
       | This post, along with the tutorial links it and the comments
       | contain, provide good insights on the topic of caches, coherence,
       | and related topics. I would like to add a link that I feel is
       | also very good, maybe better:
       | 
       | <https://marabos.nl/atomics/hardware.html>
       | 
       | While the book this chapter is from is about Rust, this chapter
       | is pretty much language-agnostic.
        
       ___________________________________________________________________
       (page generated 2025-11-01 23:01 UTC)