[HN Gopher] Memory Bandwidth
       ___________________________________________________________________
        
       Memory Bandwidth
        
       Author : luu
       Score  : 26 points
       Date   : 2021-08-10 01:05 UTC (21 hours ago)
        
 (HTM) web link (fgiesen.wordpress.com)
 (TXT) w3m dump (fgiesen.wordpress.com)
        
       | dragontamer wrote:
       | This is a good start to the discussion.
       | 
       | Let me add my $0.02. CPUs seem to hide memory issues with more-
       | and-more cache. The most recent AMD EPYC chips have 256MB of L3
       | cache. Intel has less L3 cache, but has something like 1MB of L2
       | cache per core now (!!!). AMD has shown us the future: 64MB SRAM
       | being grafted onto these chips to maybe add +64MB per die (or
       | maybe 768 MB of L3 cache in a hypothetical EPYC in the future).
       | 
       | GPUs have a different design. GPUs have almost no caches worth
       | discussing. Instead, GPUs have manually managed scratch RAM, also
       | known as __shared__ memory (which roughly fits in the L1 cache
       | speeds). __shared__ is pretty small though, maybe only 64kB or
       | so. The real important thing about __shared__ is that any SIMD
       | lane can read/write to it at high speeds, so it serves as a good
       | area for inter-thread communications. ("Thread" is a bit
       | ambiguous, but... lets just simplify it and call them cuda-
       | threads).
       | 
       | The other bit GPUs have are a ridiculous number of registers. AMD
       | RDNA has 1024-registers, of which up to 256 can be allocated into
       | a kernel. (The device driver of modern GPUs will further
       | parallelize loads by requesting only some of the registers per
       | invocation. A well written memcpy kernel may only need 64 vector-
       | registers for example... the device driver may load 16 parallel
       | instances of this hypothetical kernel to run in parallel on the
       | GPU core so all 1024 registers are used).
       | 
       | These 1024-registers are 32-bits and 32-SIMD-lanes wide on AMD
       | RDNA, or 128kB of registers per WGP (workgroup processor, an AMD
       | RDNA "core"). I kid you not: AMD RDNA has more registers per WGP
       | than Intel Skylake had L1 cache per core!!
       | 
       | ---------------
       | 
       | As such, we can immediately see how modern GPUs and CPUs get
       | around the memory bandwidth issue. CPUs try to hide it with
       | increasingly large caches... while GPU register sizes are so
       | large that it borders on insanity.
        
         | sephamorr wrote:
         | Isn't this a bit dated? I certainly agree that the shared
         | memory semantics are a critical distinction, but in addition to
         | the 256Kb per SM shared memory, NVIDIA Volta GPUs have 128kB
         | L1$ per SM, and a unified 6MB L2$. I don't think these caches
         | are entirely /not worth discussing/.
        
           | dragontamer wrote:
           | > I don't think these caches are entirely /not worth
           | discussing/
           | 
           | Hmmm... NVidia's front is clearly aiming at just punching
           | through the memory-bandwidth problem with GDDR6x (2-bits per
           | clock tick since its got 4-level encoding).
           | 
           | That's the thing, NVidia isn't really pushing memory
           | bandwidth limits on the L2 or even L1 cache sides IMO. Even
           | the 128kB L1$ per SM is only roughly the size of the SM's
           | register space. Their most interesting move really is GDDR6x,
           | which is the brute-force way to solve that problem...
           | 
           | --------
           | 
           | AMD's "Infinity Cache" on RDNA2 is 128MB of L3$, but AMD is
           | using only standard GDDR6 (1-bit per clock tick transferred).
           | AMD's RDNA2 is very strange: L0, L1, L2, and L3 caches, when
           | AMD GCN was just L1 and L2 layers of cache.
           | 
           | That "infinity cache" is worth talking about I guess... its
           | large enough to be relevant in a number of gaming situations.
           | 
           | ------
           | 
           | I guess AMD and NVidia are both using HBM at the high end for
           | 1TBps to 2TBps bandwidths. But those chips aren't in the
           | consumer realm anymore. The ultimate brute force solution:
           | spend more money.
        
       | ksec wrote:
       | The golden rule for CPU is _Latency_ and Bandwidth for GPU. Which
       | is why having higher L2 and L3 cache is much more important than
       | say raw DRAM bandwidth. The extra bandwidth on SoC  / APU are
       | mostly for GPU purposes.
       | 
       | And DDR5 / LPDDR5 seems to have enough head room for us for
       | another 5 years. So we seem to be doing fine ( For now )
       | 
       | Edit: Read Dragontamer's post above / below.
        
       ___________________________________________________________________
       (page generated 2021-08-10 23:00 UTC)