[HN Gopher] Why do CPUs have multiple cache levels? (2016)
       ___________________________________________________________________
        
       Why do CPUs have multiple cache levels? (2016)
        
       Author : aragonite
       Score  : 75 points
       Date   : 2024-05-15 10:58 UTC (1 days ago)
        
 (HTM) web link (fgiesen.wordpress.com)
 (TXT) w3m dump (fgiesen.wordpress.com)
        
       | aragonite wrote:
       | Discussion at the time (121 comments):
       | 
       | https://news.ycombinator.com/item?id=12245458
        
       | jauntywundrkind wrote:
       | We do the same things with decoupling capacitors. Rather than put
       | all our store in one unit, we have multiple stores of differing
       | size to buffer different load profiles.
        
         | 5ADBEEF wrote:
         | I am not certain that this provides a lot of benefit. Worth
         | discussion. Feels like most engineers just copy the application
         | circuit nowadays though.
         | https://www.signalintegrityjournal.com/articles/1589-the-myt...
        
       | jeffbee wrote:
       | This hardly explains anything. It doesn't talk about the tags,
       | how loads and stores actually work, and how much each kind of
       | cache costs. On most Intel CPUs for example the L1 cache is
       | physically larger than the L2 cache, despite being logically 16x
       | smaller. It just costs more to make a faster cache. It has more
       | wires and gates. So that's why we have different kinds of cache
       | memory: money.
        
         | maksut wrote:
         | That is interesting. I wonder if L1 is denser because it has to
         | have more bandwidth. But doesn't that point to a space
         | constraint rather than money? A combination of L1 & L2 will
         | have a bigger capacity so it would be faster than pure L1 cache
         | in the same space (for some/most workloads)?
         | 
         | I always thought cache layers was because of locality but that
         | is my imagination :) The article talks about different access
         | patterns of the cache layers which makes sense in my mind.
         | 
         | It also mentions density briefly:
         | 
         | > Only what misses L1 needs to be passed on to higher cache
         | levels, which don't need to be nearly as fast, nor have as much
         | bandwidth. They can worry more about power efficiency and
         | density instead.
        
           | bhaney wrote:
           | > doesn't that point to a space constraint rather than money?
           | 
           | The space constraints are also caused by money. The reason we
           | don't just add more L1 cache is that it would take up a lot
           | of space, necessitating larger dies, which lowers yields and
           | significantly increases the cost per chip.
        
             | jeffbee wrote:
             | Also it draws a huge amount of power.
        
             | IcyWindows wrote:
             | I would say it's physics, not money.
             | 
             | Space constraints are caused by power and latency limits
             | even with infinite money.
        
             | Symmetry wrote:
             | That isn't true at all. The limited speed at which a signal
             | can propagate itself across a chip and the added levels of
             | muxing necessarily mean that there's a limit to how low the
             | latency of a cache can be that's roughly proportional to
             | the square root of its capacity.
        
               | bhaney wrote:
               | It actually is true. You're also right that physics would
               | eventually constrain die size, but it isn't the
               | bottleneck that's keeping CPUs at their typical current
               | size. This should be pretty obvious from the existence of
               | (rare and expensive) specialty CPU dies that are much
               | larger than typical ones. They're clearly not physically
               | impossible to produce, just very expensive. The current
               | bottleneck holding back die sizes is in fact costs, since
               | larger die sizes cause the inevitable blemishes to ruin
               | larger chunks of your silicon wafer each, cratering
               | yields.
               | 
               | > added levels of muxing necessarily mean that there's a
               | limit to how low the latency of a cache can be
               | 
               | L1 cache avoids muxing as much as possible, which is why
               | it takes up so much die space in the first place.
        
         | Symmetry wrote:
         | Are you including address translation in the area you're
         | ascribing to the L1$ there? I haven't looked at detailed area
         | numbers for recent Intel designs but having equal area for L1$
         | and L2$ seems really weird to me based on numbers from other
         | designs I've seen.
         | 
         | I'm having a hard time mentally come up with a way a larger L1$
         | could be _faster_. Have more ways, sure. Or more read ports. Or
         | more bandwidth. And I 'm given to understand that Intel tags
         | (or used to tag) instruction boundaries in their L1I$. But how
         | do you reduce latency? Physically larger transistors can more
         | quickly overcome the capacitance of the lines their destination
         | and the capacitance of their destinations but a large cache
         | makes the distance and line capacitance correspondingly larger.
         | You can have speed differences with 6T versus 8T SRAM cell
         | designs but as I understand it Intel went to use 8T everywhere
         | in Nahelem for energy efficiency reasons. I guess maybe changes
         | in transistor technology could have made them revisit that, but
         | 8 transistors isn't that much physically larger than 6.
         | 
         | But in general there are a lot of complicated things about how
         | the memory subsystem of a CPU work that are important to
         | performance, add add a floor to how low a first level cache's
         | latency can be, but they don't really contradict anything that
         | was said in the article.
        
       | charleshn wrote:
       | This answer by Peter Cordes is much more complete:
       | https://stackoverflow.com/a/38549736
        
         | wahern wrote:
         | No answer is complete without getting into fundamental physics,
         | information theory, and blackholes:
         | https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
         | 
         | TL;DR: Time complexity of memory access scales O([?]N), not
         | O(1), where N is your data set size. This applies even without
         | hitting theoretical information density limits, whatever your
         | best process size and latency. For optimal performance on any
         | particular task you'll always need to consider locality of your
         | working data set, i.e. memory hierarchy, caching, etc. And to
         | be clear, the implication is that temporal and spatial locality
         | are fundamentally related.
         | 
         | Yes, some cache levels have lower intrinsic latency than
         | others, but at scale (even at-home, DIY, toy project scales
         | these days) it's simpler to think of that as a consequence of
         | space-time physics, not a property of manufacturing processes,
         | packaging techniques, or algorithms. This is liberating in the
         | sense that you can predict and design against both today's and
         | tomorrow's systems without fixating on specific technology or
         | benchmarks.
        
       | nomel wrote:
       | Related, here's a neat discussion about how this poisons Big O
       | notation: https://news.ycombinator.com/item?id=38337989
        
         | hi-v-rocknroll wrote:
         | Big O notation is theoretical worst-case analysis of runtime or
         | space that rarely/never maps to actual performance. It's nice
         | to wax about, but what really matters is quality benchmark data
         | of actual code ran on real systems.
        
           | VS1999 wrote:
           | It's overvalued but still useful for the average person to
           | have an easy way to think about if they're about to run 1000
           | operations or 1000^3 operations on something.
        
       | neglesaks wrote:
       | Working set size, latency and pesky laws of economics.
        
       | tdsanchez wrote:
       | CPUs have multiple cache levels because the machine cycle at the
       | CPU die is ~500ps while writing to main memory and then need to
       | read it at the same latency, that's going to be around 200ns
       | while the CPU is idle.
       | 
       | To mask this, we write back to cache and rely on cache coherency
       | algorithms and multiway, multilevel caches to make sure main
       | memory is written back to and read when cache tags are
       | invalidated.
       | 
       | tl;dr - Current process technologies make SRAM very much faster
       | than DRAM and multiple levels of multiway caches create a time
       | based interface to maximise memory throughput to the CPU
       | regsisters while maintaining coherent memory write backs.
       | 
       | It's worth noting that Apple Silicon is fast because their DRAM
       | bandwidth is much closer to the same machine cycle latency as the
       | APU cores'caches and registers.
        
       | hi-v-rocknroll wrote:
       | Sigh. Memory/storage hierarchy vs. cost tradeoffs is why.
        
       ___________________________________________________________________
       (page generated 2024-05-16 23:01 UTC)