[HN Gopher] Exploring How Cache Memory Works
       ___________________________________________________________________
        
       Exploring How Cache Memory Works
        
       Author : imadj
       Score  : 98 points
       Date   : 2024-06-21 18:04 UTC (5 days ago)
        
 (HTM) web link (pikuma.com)
 (TXT) w3m dump (pikuma.com)
        
       | emschwartz wrote:
       | In a similar vein, Andrew Kelly, the creator of Zig, gave a nice
       | talk about how to make use of the different speeds of different
       | CPU operations in designing programs: Practical Data-Oriented
       | Design https://vimeo.com/649009599
        
       | wyldfire wrote:
       | Drepper's "What Every Programmer Should Know About Memory" [1] is
       | a good resource on a similar topic. Not so long ago, there was an
       | analysis done on it in a series of blog posts [2] from a more
       | modern perspective.
       | 
       | [1] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
       | 
       | [2] https://samueleresca.net/analysis-of-what-every-
       | programmer-s...
        
       | seany62 wrote:
       | Super interesting. Thank you!
        
       | eikenberry wrote:
       | In case you are wondering about your cache-line size on a Linux
       | box, you can find it in sysfs.. something like..
       | cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
        
         | Hello71 wrote:
         | grep .
         | /sys/devices/system/cpu/cpu*/cache/index*/coherency_line_size
         | 
         | would be better, but                 lscpu -C
         | 
         | is more useful.
        
           | eikenberry wrote:
           | Didn't know about 'lscpu -C'.. thanks!
        
       | dangoldin wrote:
       | Really cool stuff and a nice introduction but curious how much
       | modern compilers do for you already. Especially if you shift to
       | the JIT world - what ends up being the difference between code
       | where people optimize for this vs write in a style optimized
       | around code readability/reuse/etc.
        
         | tux1968 wrote:
         | JIT compilers can't compensate for poorly organized data.
         | Ultimately, understanding these low-level concepts, affect
         | high-level algorithm design and selection.
         | 
         | Watching the Andrew Kelly video mentioned above, really drives
         | home the point that even if your compiler automatically
         | optimizes structure ordering, to minimize padding and alignment
         | issues, it can't fix other higher-level decisions. An example
         | being, using two separate lists of structs to maintain their
         | state data, rather than a single list with each struct having
         | an enum to record its state.
        
         | kllrnohj wrote:
         | JIT languages tend to have the worst language-provided locality
         | as they are often accompanied by GCs and lack of value types
         | (there are exceptions to this, but it's broadly the case). And
         | a JIT cannot re-arrange heap memory layout of objects as it
         | must be hot-swappable. This is why despite incredibly huge
         | investments in them such languages just never reach aot
         | performance despite how much theoretical advantage a jit could
         | have.
         | 
         | AOT'd languages _could_ re-arrange a struct for better locality
         | however the majority (if not all) languages rigidly require the
         | fields are laid out in the order defined for various reasons.
        
       | hinkley wrote:
       | Wait wait wait.
       | 
       | M2 processors have 128 byte wide cache lines?? That's a big deal.
       | We've been at 64 bytes since what, the Pentium?
        
         | monocasa wrote:
         | Yeah, 64 bytes is kind of an unstated x86 thing. It'd be hell
         | for them to change that, a lot of perf conscious code aligns to
         | 64 byte boundaries to combat false sharing.
        
           | kllrnohj wrote:
           | all ARM-designed cores are also 64-bytes. It's not _just_ an
           | x86 thing
        
             | monocasa wrote:
             | The Cortex A9 had 32 byte cache lines for one prominent
             | counterexample.
             | 
             | But my point was more that the size is baked into x86 in a
             | pretty deep way these days. You'd be looking at new
             | releases from all software that cares about such things on
             | x86 to support a different cache line size without major
             | perf regressions. So all of the major kernels, probably the
             | JVM and CLR, game engines (and good luck there).
             | 
             | IMO Intel should stick a "please query the size of the
             | cache line if you care about it's length" clause into APX,
             | to push code today to stop #defining CACHE_LINE_SIZE (64)
             | on x86.
        
               | jcranmer wrote:
               | > IMO Intel should stick a "please query the size of the
               | cache line if you care about it's length" clause into
               | APX, to push code today to stop #defining CACHE_LINE_SIZE
               | (64) on x86.
               | 
               | CPUID EAX=1, bits 8-15 (i.e., second byte) of EBX in the
               | result tell you the cache line size. It's been there
               | since Pentium 4, apparently.
               | 
               | You can also get line size for each cache level with
               | CPUID EAX=4, along with the set-associativity and other
               | low-level cache parameters.
        
               | kllrnohj wrote:
               | > The Cortex A9 had 32 byte cache lines for one prominent
               | counterexample.
               | 
               | Ok, all arm-designed cores for the last 15 years then :)
        
             | 201984 wrote:
             | Some Cortex-A53s have 16-byte cachelines, which I found out
             | the hard way recently.
        
         | CyberDildonics wrote:
         | In practicality intel CPUs have pulled down 128 bytes at a
         | minimum when you access memory for a very long time.
         | 
         | 64 byte cache lines are there an part of other alignment
         | boundaries for things like atomics, but accessing memory pull
         | down two cache lines at time.
        
       | boshalfoshal wrote:
       | I think cache coherency protocols are less intuitive and less
       | talked about when people discuss about caching, so it would be
       | nice to have some discussion on that too.
       | 
       | But otherwise this is a good general overview of how caching is
       | useful.
        
       | branko_d wrote:
       | Why is the natural alignment of structs equal to the size of
       | their largest member?
        
         | kllrnohj wrote:
         | To ensure that member is itself still aligned properly in
         | "global space". The start of the struct is assumed to be
         | universally aligned (malloc, etc.. make that a requirement in
         | fact) or aligned for the requirements of the struct itself (eg,
         | array). Thus any offset into the struct only needs to be
         | aligned to the requirements of the largest type.
         | 
         | https://www.kernel.org/doc/html/latest/core-api/unaligned-me...
         | has a lot more general context on alignment and why it's
         | important
        
         | jcranmer wrote:
         | It's not. It's equal the maximum alignment of their members.
         | For primitive types (like integers, floating-point types and
         | pointers), size == alignment on most machines nowadays
         | (although on 32-bit machines, it can be a toss-up whether a
         | 64-bit integer is 64-bit aligned or 32-bit aligned), so it can
         | look like it's based on size though.
        
       | ThatNiceGuyy wrote:
       | Great article. I have always had an open question in my mind
       | about struct alignment and this explained it very succinctly.
        
       ___________________________________________________________________
       (page generated 2024-06-26 23:00 UTC)