[HN Gopher] Why do CPUs have multiple cache levels? (2016)
___________________________________________________________________
Why do CPUs have multiple cache levels? (2016)
Author : aragonite
Score : 75 points
Date : 2024-05-15 10:58 UTC (1 days ago)
(HTM) web link (fgiesen.wordpress.com)
(TXT) w3m dump (fgiesen.wordpress.com)
| aragonite wrote:
| Discussion at the time (121 comments):
|
| https://news.ycombinator.com/item?id=12245458
| jauntywundrkind wrote:
| We do the same things with decoupling capacitors. Rather than put
| all our store in one unit, we have multiple stores of differing
| size to buffer different load profiles.
| 5ADBEEF wrote:
| I am not certain that this provides a lot of benefit. Worth
| discussion. Feels like most engineers just copy the application
| circuit nowadays though.
| https://www.signalintegrityjournal.com/articles/1589-the-myt...
| jeffbee wrote:
| This hardly explains anything. It doesn't talk about the tags,
| how loads and stores actually work, and how much each kind of
| cache costs. On most Intel CPUs for example the L1 cache is
| physically larger than the L2 cache, despite being logically 16x
| smaller. It just costs more to make a faster cache. It has more
| wires and gates. So that's why we have different kinds of cache
| memory: money.
| maksut wrote:
| That is interesting. I wonder if L1 is denser because it has to
| have more bandwidth. But doesn't that point to a space
| constraint rather than money? A combination of L1 & L2 will
| have a bigger capacity so it would be faster than pure L1 cache
| in the same space (for some/most workloads)?
|
| I always thought cache layers was because of locality but that
| is my imagination :) The article talks about different access
| patterns of the cache layers which makes sense in my mind.
|
| It also mentions density briefly:
|
| > Only what misses L1 needs to be passed on to higher cache
| levels, which don't need to be nearly as fast, nor have as much
| bandwidth. They can worry more about power efficiency and
| density instead.
| bhaney wrote:
| > doesn't that point to a space constraint rather than money?
|
| The space constraints are also caused by money. The reason we
| don't just add more L1 cache is that it would take up a lot
| of space, necessitating larger dies, which lowers yields and
| significantly increases the cost per chip.
| jeffbee wrote:
| Also it draws a huge amount of power.
| IcyWindows wrote:
| I would say it's physics, not money.
|
| Space constraints are caused by power and latency limits
| even with infinite money.
| Symmetry wrote:
| That isn't true at all. The limited speed at which a signal
| can propagate itself across a chip and the added levels of
| muxing necessarily mean that there's a limit to how low the
| latency of a cache can be that's roughly proportional to
| the square root of its capacity.
| bhaney wrote:
| It actually is true. You're also right that physics would
| eventually constrain die size, but it isn't the
| bottleneck that's keeping CPUs at their typical current
| size. This should be pretty obvious from the existence of
| (rare and expensive) specialty CPU dies that are much
| larger than typical ones. They're clearly not physically
| impossible to produce, just very expensive. The current
| bottleneck holding back die sizes is in fact costs, since
| larger die sizes cause the inevitable blemishes to ruin
| larger chunks of your silicon wafer each, cratering
| yields.
|
| > added levels of muxing necessarily mean that there's a
| limit to how low the latency of a cache can be
|
| L1 cache avoids muxing as much as possible, which is why
| it takes up so much die space in the first place.
| Symmetry wrote:
| Are you including address translation in the area you're
| ascribing to the L1$ there? I haven't looked at detailed area
| numbers for recent Intel designs but having equal area for L1$
| and L2$ seems really weird to me based on numbers from other
| designs I've seen.
|
| I'm having a hard time mentally come up with a way a larger L1$
| could be _faster_. Have more ways, sure. Or more read ports. Or
| more bandwidth. And I 'm given to understand that Intel tags
| (or used to tag) instruction boundaries in their L1I$. But how
| do you reduce latency? Physically larger transistors can more
| quickly overcome the capacitance of the lines their destination
| and the capacitance of their destinations but a large cache
| makes the distance and line capacitance correspondingly larger.
| You can have speed differences with 6T versus 8T SRAM cell
| designs but as I understand it Intel went to use 8T everywhere
| in Nahelem for energy efficiency reasons. I guess maybe changes
| in transistor technology could have made them revisit that, but
| 8 transistors isn't that much physically larger than 6.
|
| But in general there are a lot of complicated things about how
| the memory subsystem of a CPU work that are important to
| performance, add add a floor to how low a first level cache's
| latency can be, but they don't really contradict anything that
| was said in the article.
| charleshn wrote:
| This answer by Peter Cordes is much more complete:
| https://stackoverflow.com/a/38549736
| wahern wrote:
| No answer is complete without getting into fundamental physics,
| information theory, and blackholes:
| https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
|
| TL;DR: Time complexity of memory access scales O([?]N), not
| O(1), where N is your data set size. This applies even without
| hitting theoretical information density limits, whatever your
| best process size and latency. For optimal performance on any
| particular task you'll always need to consider locality of your
| working data set, i.e. memory hierarchy, caching, etc. And to
| be clear, the implication is that temporal and spatial locality
| are fundamentally related.
|
| Yes, some cache levels have lower intrinsic latency than
| others, but at scale (even at-home, DIY, toy project scales
| these days) it's simpler to think of that as a consequence of
| space-time physics, not a property of manufacturing processes,
| packaging techniques, or algorithms. This is liberating in the
| sense that you can predict and design against both today's and
| tomorrow's systems without fixating on specific technology or
| benchmarks.
| nomel wrote:
| Related, here's a neat discussion about how this poisons Big O
| notation: https://news.ycombinator.com/item?id=38337989
| hi-v-rocknroll wrote:
| Big O notation is theoretical worst-case analysis of runtime or
| space that rarely/never maps to actual performance. It's nice
| to wax about, but what really matters is quality benchmark data
| of actual code ran on real systems.
| VS1999 wrote:
| It's overvalued but still useful for the average person to
| have an easy way to think about if they're about to run 1000
| operations or 1000^3 operations on something.
| neglesaks wrote:
| Working set size, latency and pesky laws of economics.
| tdsanchez wrote:
| CPUs have multiple cache levels because the machine cycle at the
| CPU die is ~500ps while writing to main memory and then need to
| read it at the same latency, that's going to be around 200ns
| while the CPU is idle.
|
| To mask this, we write back to cache and rely on cache coherency
| algorithms and multiway, multilevel caches to make sure main
| memory is written back to and read when cache tags are
| invalidated.
|
| tl;dr - Current process technologies make SRAM very much faster
| than DRAM and multiple levels of multiway caches create a time
| based interface to maximise memory throughput to the CPU
| regsisters while maintaining coherent memory write backs.
|
| It's worth noting that Apple Silicon is fast because their DRAM
| bandwidth is much closer to the same machine cycle latency as the
| APU cores'caches and registers.
| hi-v-rocknroll wrote:
| Sigh. Memory/storage hierarchy vs. cost tradeoffs is why.
___________________________________________________________________
(page generated 2024-05-16 23:01 UTC)