[HN Gopher] Memory Bandwidth
___________________________________________________________________
Memory Bandwidth
Author : luu
Score : 26 points
Date : 2021-08-10 01:05 UTC (21 hours ago)
(HTM) web link (fgiesen.wordpress.com)
(TXT) w3m dump (fgiesen.wordpress.com)
| dragontamer wrote:
| This is a good start to the discussion.
|
| Let me add my $0.02. CPUs seem to hide memory issues with more-
| and-more cache. The most recent AMD EPYC chips have 256MB of L3
| cache. Intel has less L3 cache, but has something like 1MB of L2
| cache per core now (!!!). AMD has shown us the future: 64MB SRAM
| being grafted onto these chips to maybe add +64MB per die (or
| maybe 768 MB of L3 cache in a hypothetical EPYC in the future).
|
| GPUs have a different design. GPUs have almost no caches worth
| discussing. Instead, GPUs have manually managed scratch RAM, also
| known as __shared__ memory (which roughly fits in the L1 cache
| speeds). __shared__ is pretty small though, maybe only 64kB or
| so. The real important thing about __shared__ is that any SIMD
| lane can read/write to it at high speeds, so it serves as a good
| area for inter-thread communications. ("Thread" is a bit
| ambiguous, but... lets just simplify it and call them cuda-
| threads).
|
| The other bit GPUs have are a ridiculous number of registers. AMD
| RDNA has 1024-registers, of which up to 256 can be allocated into
| a kernel. (The device driver of modern GPUs will further
| parallelize loads by requesting only some of the registers per
| invocation. A well written memcpy kernel may only need 64 vector-
| registers for example... the device driver may load 16 parallel
| instances of this hypothetical kernel to run in parallel on the
| GPU core so all 1024 registers are used).
|
| These 1024-registers are 32-bits and 32-SIMD-lanes wide on AMD
| RDNA, or 128kB of registers per WGP (workgroup processor, an AMD
| RDNA "core"). I kid you not: AMD RDNA has more registers per WGP
| than Intel Skylake had L1 cache per core!!
|
| ---------------
|
| As such, we can immediately see how modern GPUs and CPUs get
| around the memory bandwidth issue. CPUs try to hide it with
| increasingly large caches... while GPU register sizes are so
| large that it borders on insanity.
| sephamorr wrote:
| Isn't this a bit dated? I certainly agree that the shared
| memory semantics are a critical distinction, but in addition to
| the 256Kb per SM shared memory, NVIDIA Volta GPUs have 128kB
| L1$ per SM, and a unified 6MB L2$. I don't think these caches
| are entirely /not worth discussing/.
| dragontamer wrote:
| > I don't think these caches are entirely /not worth
| discussing/
|
| Hmmm... NVidia's front is clearly aiming at just punching
| through the memory-bandwidth problem with GDDR6x (2-bits per
| clock tick since its got 4-level encoding).
|
| That's the thing, NVidia isn't really pushing memory
| bandwidth limits on the L2 or even L1 cache sides IMO. Even
| the 128kB L1$ per SM is only roughly the size of the SM's
| register space. Their most interesting move really is GDDR6x,
| which is the brute-force way to solve that problem...
|
| --------
|
| AMD's "Infinity Cache" on RDNA2 is 128MB of L3$, but AMD is
| using only standard GDDR6 (1-bit per clock tick transferred).
| AMD's RDNA2 is very strange: L0, L1, L2, and L3 caches, when
| AMD GCN was just L1 and L2 layers of cache.
|
| That "infinity cache" is worth talking about I guess... its
| large enough to be relevant in a number of gaming situations.
|
| ------
|
| I guess AMD and NVidia are both using HBM at the high end for
| 1TBps to 2TBps bandwidths. But those chips aren't in the
| consumer realm anymore. The ultimate brute force solution:
| spend more money.
| ksec wrote:
| The golden rule for CPU is _Latency_ and Bandwidth for GPU. Which
| is why having higher L2 and L3 cache is much more important than
| say raw DRAM bandwidth. The extra bandwidth on SoC / APU are
| mostly for GPU purposes.
|
| And DDR5 / LPDDR5 seems to have enough head room for us for
| another 5 years. So we seem to be doing fine ( For now )
|
| Edit: Read Dragontamer's post above / below.
___________________________________________________________________
(page generated 2021-08-10 23:00 UTC)