[HN Gopher] Zen, CUDA, and Tensor Cores, Part I: The Silicon
       ___________________________________________________________________
        
       Zen, CUDA, and Tensor Cores, Part I: The Silicon
        
       Author : throwaway71271
       Score  : 146 points
       Date   : 2024-09-03 19:34 UTC (4 days ago)
        
 (HTM) web link (www.computerenhance.com)
 (TXT) w3m dump (www.computerenhance.com)
        
       | downvotetruth wrote:
       | I refused to buy the so determined defective chips even if they
       | represented better value because if the intent was truly to try
       | and max yield then there should be for Ryzen for example good 7
       | core versions with only 1 core that was found to be defective.
       | Since no 7 core zens exist, then at least some of the CPUs with 6
       | core CCDs have intentionally had 1 of the cores destroyed for
       | reasons unknown, which could be to meet volume targets. If this
       | is because for Ryzen the cores can only be disabled in pairs,
       | then it boggles my mind that it would not be economic given the $
       | diff of tens to hundreds of dollars between the 6 and 8 core
       | versions that is does not make sense to add the circuits to allow
       | each core to be individually fused off and allow further product
       | differentiation, especially considering how much effort and # of
       | SKUs have been put forth with the frequency binning in AM4
       | (5700x, 5800, 5800x, 5800xt, etc.), rather than bigger market
       | segmentation jumps.
        
         | AnthonyMouse wrote:
         | > if the intent was truly to try and max yield then there
         | should be for Ryzen for example good 7 core versions with only
         | 1 core that was found to be defective. Since no 7 core zens
         | exist
         | 
         | There are Zen processors that use 7 cores per CCD, e.g. Epyc
         | 7663, 7453, 9634.
         | 
         | The difference between Ryzen and Epyc is the I/O die. The CCDs
         | are the same so that's presumably where they go.
         | 
         | Another reason you might not see this on the consumer chips is
         | that they have higher base clocks. If you have a CCD where one
         | core is bad and another isn't exactly bad but can't hit the
         | same frequencies as the other six, it doesn't take a lot of
         | difference before it makes more sense to turn off the slowest
         | than lower the base clock for the whole processor. 6 x 4.7GHz
         | is faster than 7 x 4.0GHz, much less 7 x 2.5GHz.
         | 
         | In theory you could let that one core run at a significantly
         | lower speed than the others, but there is _a lot_ of naive
         | software that will misbehave in that context. Whereas the base
         | clock for the Epyc 9634 is 2.25GHz, because it has twelve
         | 7-core CCDs so it 's nearly 300W, and doesn't want to be nearly
         | 1300W regardless of whether or not most of the cores could do
         | >4GHz.
        
           | downvotetruth wrote:
           | To correct the example for the Epyc line, models appears to
           | exist with 1 through 8 cores available except for 5.
        
             | AnthonyMouse wrote:
             | The Epyc models with lower core counts per CCD probably
             | don't exist because of yields though. The 73F3 has two
             | cores per CCD, so with eight CCDs it only has 16 cores. The
             | 7303 also has 16 cores but two CCDs, so all eight cores per
             | CCD are active. The 73F3 costs more than five times as
             | much. That's weird if the 73F3 is the dumping ground for
             | broken dice. Not so weird when you consider that it has
             | four times as much L3 cache and higher clock speeds.
             | 
             | The extra cores in the 73F3 aren't necessarily bad, they're
             | disabled so the others can have their L3 cache and so they
             | can pick the two cores from each CCD that hit the highest
             | clock speeds. Doing that is expensive, especially if the
             | other cores _aren 't_ all bad, but then you get better
             | performance per core. Which some people will pay a premium
             | for, so they offer models like that even if yields are good
             | and there aren't that many CCDs with that many bad cores.
             | 
             | At which point your premise is invalid because processors
             | are being sold with cores disabled for performance reasons
             | rather than yield reasons.
        
               | downvotetruth wrote:
               | > they're disabled so the others can have their L3 cache
               | and so they can pick the two cores from each CCD that hit
               | the highest clock speeds
               | 
               | what or where does that follow from? One can take a CCD
               | with 2+ cores and pin a process to a set (of the fastest)
               | cores based on profiling the cores and those 2+ cores
               | could use the L3 cache as needed; disabling cores at the
               | hardware level is the waste as if they were not disabled,
               | then that would allow other processes to be able to
               | benefit from more than 2 cores to run when desired. The
               | latter point of disabling cores for "better [frequency]
               | performance per core Which some people will pay a premium
               | for" is dubious especially for the Epyc server line. If
               | that were true, then there should at least be 4 core or
               | fewer SKUs for desktop Ryzen variant where apps like
               | games are more likely to benefit from the higher clock.
        
         | tverbeure wrote:
         | That's first sentence is a spectacular non-sequitur.
        
         | MobiusHorizons wrote:
         | I would guess that there is a desire to not create too many
         | product tiers. I believe 6 core parts are made from 2 3-core
         | CCXs, (rather than 4 and 2) so only one core is disabled per
         | ccx.
        
           | cinnamonteal wrote:
           | Current Ryzen and EPYC processors have 8 core CCXs. The 6
           | core parts used to be as you described, but are now a single
           | CCX. The Zen C dies have two CCXs, but they are still 8 core
           | CCXs, and are always symmetrical in core count.
           | 
           | The big exception is that the new Zen 5 Strix Point chip has
           | a 4 core CCX for the non-C cores. I think the Zen 4 based Z1
           | has a similar setup but don't remember and couldn't quickly
           | find the actual information to confirm.
        
             | wtallis wrote:
             | The Ryzen Z1 was a weird one: two Zen4 cores plus four
             | Zen4c cores all in one cluster, sharing the same 16MB L3
             | cache.
        
           | Symmetry wrote:
           | It would be sort of cool if they could do direct to consumer
           | sales with every core going at whatever its maximum speed is
           | or turned off if to disrupted. But that's not something you
           | could do through existing distribution channels, everyone
           | presumes a fairly limited number of SKUs.
        
       | fulafel wrote:
       | The answer to the leading question "What's the difference between
       | a Zen core, a CUDA core, and a Tensor core?" is not covered in
       | Part 1, so you may want to wait if this interests you more than
       | chip layouts.
        
         | raphlinus wrote:
         | Here's my quick take.
         | 
         | A top of the line Zen core is a powerful CPU with wide SIMD
         | (AVX-512 is 16 lanes of 32 bit quantities), significant
         | superscalar parallelism (capable of issuing approximately 4
         | SIMD operations per clock), and a high clock rate (over 5GHz).
         | There isn't a lot of confusion about what constitutes a "core,"
         | though multithreading can inflate the "thread" count. See [1]
         | for a detailed analysis of the Zen 5 line.
         | 
         | A single Granite Ridge core has peak 32 bit multiply-add
         | performance of about 730 GFLOPS.
         | 
         | Nvidia, by contrast, uses the marketing term "core" to refer to
         | a single SIMD lane. Their GPUs are organized as 32 SIMD lanes
         | grouped into each "warp," and 4 warps grouped into a Streaming
         | Multiprocessor (SM). CPU and GPU architectures can't be
         | directly compared, but just going by peak floating point
         | performance, the most comparable granularity to a CPU core is
         | the SM. A warp is in some ways more powerful than a CPU core
         | (generally wider SIMD, larger register file, more local SRAM,
         | better latency hiding) but in other ways less (much less
         | superscalar parallelism, lower clock, around 2.5GHz). A 4090
         | has 128 SMs, which is a lot and goes a long way to explaining
         | why a GPU has so much throughput. A 1080, by contrast, has 20
         | SMs - still a goodly number but not mind-meltingly bigger than
         | a high end CPU. See the Nvidia Ada whitepaper [2] for an
         | extremely detailed breakdown of 4090 specs (among other
         | things).
         | 
         | A single Nvidia 4090 "core" has peak 32 bit multiply-add
         | performance of about 5 GFLOPS, while an SM has 640 GFLOPS.
         | 
         | I don't know anybody who counts tensor cores by core count, as
         | the capacity of a "core" varies pretty widely by generation.
         | It's almost certainly best just to compare TFLOPS - also a bit
         | of a slippery concept, as that depends on the precision and
         | also whether the application can make use of the sparsity
         | feature.
         | 
         | I'll also note that not all GPU vendors follow Nvidia's lead in
         | counting individual SIMD lanes as "cores." Apple Silicon, by
         | contrast, uses "core" to refer to a grouping of 128 SIMD lanes,
         | similar to an Nvidia SM. A top of the line M2 Ultra contains 76
         | such cores, for 9728 SIMD lanes. I found Philip Turner's Metal
         | benchmarks [3] useful for understanding the quantitative
         | similarities and differences between Apple, AMD, and Nvidia
         | GPUs.
         | 
         | [1]:
         | http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...
         | 
         | [2]: https://images.nvidia.com/aem-dam/Solutions/Data-
         | Center/l4/n...
         | 
         | [3]: https://github.com/philipturner/metal-benchmarks
        
           | JonChesterfield wrote:
           | An x64 core roughly corresponding to a SM, or in the amdgpu
           | world a compute unit (CU) seems right. It's in the same
           | ballpark for power consumption, represents the component
           | handling an instruction pointer and a local register file and
           | so forth.
           | 
           | A really big CPU is a couple of hundred cores, a big GPU is a
           | few hundred SM / CUs. Some low power chips are 8 x64 cores
           | and 8 CUs on the same package. All roughly lines up.
        
             | openrisk wrote:
             | If SIMD lanes come to vastly dominate the composition of a
             | typical computer chip (in terms, e.g., of where power is
             | consumed) will the distinction between CPU/GPU continue to
             | be meaningful?
             | 
             | For decades the GPU was "special purpose" hardware
             | dedicated to the math of screen graphics. If the type of
             | larger-scale numerical computation that has been
             | popularised with LLM's is now deemed "typical use", then
             | the distinction may be becoming irrelevant (and even
             | counterproductive from a software development perspective).
        
               | JonChesterfield wrote:
               | The x64 cores putting more hardware into the vector units
               | and amdgpu changing from 64 wide to 32 wide simd (at
               | least for some chips) looks like convergent evolution to
               | me. My personal belief is that the speculation and
               | pipelining approach is worse than the many tasks and
               | swapping between them.
               | 
               | I think the APU designs from AMD are the transition
               | pointing to the future. The GPU cores will gain
               | increasing access to the raw hardware and the user
               | interface until the CPU cores are optional and ultimately
               | discarded.
        
               | adrian_b wrote:
               | There is little relationship between the reasons that
               | determine the width of SIMD in CPUs and GPUs, so there is
               | no convergence between them.
               | 
               | In the Intel/AMD CPUs, the 512-bit width, i.e. 64 bytes
               | or 16 FP32 numbers, matches the width of the cache line
               | and the width of a DRAM burst transfer, which simplifies
               | the writing of optimized programs. This SIMD width also
               | provides a good ratio between the power consumed in the
               | execution units and the power wasted in the control part
               | of the CPU (around 80% of the total power consumption
               | goes to the execution units, which is much more than when
               | using narrower SIMD instructions).
               | 
               | Increasing the SIMD width more than that in CPUs would
               | complicate the interaction with the cache memories and
               | with the main memory, while providing only a negligible
               | improvement in the energy efficiency, so there is no
               | reason to do this. At least in the following decade it is
               | very unlikely that any CPU would increase the SIMD width
               | beyond 16 FP32 numbers per operation.
               | 
               | On the other hand, the AMD GPUs before RDNA had a SIMD
               | width of 64 FP32 numbers, but the operations were
               | pipelined and executed in 4 clock cycles, so only 16 FP32
               | numbers were processed per clock cycle.
               | 
               | RDNA has doubled the width of the SIMD execution,
               | processing 32 FP32 numbers per clock cycle. For this,
               | SIMD instructions with a reduced width of 32 FP32 have
               | been introduced, but they are executed in one clock cycle
               | versus the old 64 FP32 instructions that were executed in
               | four clock cycles. For backwards compatibility, RDNA has
               | kept 64 FP32 instructions, which are executed in two
               | clock cycles, but these were not recommended for new
               | programs.
               | 
               | RDNA 3 has changed again all this, because now sometimes
               | the 64 FP32 instructions can be executed in a single
               | clock cycle, so they may be again preferable instead of
               | the 32 FP32 instructions. However it is possible to take
               | advantage of the increased width of the RDNA 3 SIMD
               | execution units also when using 32 FP32 instructions, if
               | certain new instructions are used, which encode double
               | operations.
               | 
               | So the AMD GPUs have continuously evolved towards wider
               | SIMD execution units, from 16 FP32 before RDNA, to 32
               | FP32 in RDNA and finally to 64 FP32 in RDNA 3.
               | 
               | The distance from CPUs has been steadily increasing,
               | there is no convergence.
        
               | Symmetry wrote:
               | There are still a lot of differences, even if you put in
               | a lot more SIMD lanes to the CPU. CPUs keep their
               | execution resources fed by by aggressive caching,
               | prefetching, and out of order execution while GPUs rely
               | on having lots of threads around so that if one stalls
               | another is able to execute.
        
               | bee_rider wrote:
               | Xeon Phi was that, a bunch of little cores with a ton of
               | SIMD lanes each.
               | 
               | It didn't really work out, in part because it was too far
               | from a regular old Xeon to run your normal code well
               | without optimizing. On the other side, Intel couldn't
               | keep up with NVIDIA on the sort of metrics people care
               | about for these compute accelerators: memory bandwidth
               | mostly. If you are going to have to refactor your whole
               | project anyway to use a compute accelerator, you probably
               | want a pretty big reward. It isn't obvious (to me at
               | least) if this is the result of the fact that the Phi
               | cores, simple as they were, were still a lot more complex
               | than a GPU "core," maybe the design just had various
               | hidden bottlenecks that were too hard to work out due to
               | that complexity. Or if it is because Intel just wasn't
               | executing very well at the time, especially compared to
               | NVIDIA (it is Intel's dark age vs NVIDIA's golden age,
               | really). The programmer's "logical or" joke is a possible
               | answer here.
               | 
               | But, you can't do everything in parallel. It is a shame
               | the Phi didn't survive into the age where Intel is also
               | doing big/little cores (in a single chip). A big Xeon
               | core (for latency) surrounded by a bunch of little Phi
               | cores (for throughput) could have been a really
               | interesting device.
        
               | dahart wrote:
               | The special purpose graphics distinction is already
               | mostly irrelevant and has been for 10 or 20 years for
               | anyone doing High Performance Computing (HPC) or AI. It
               | predates LLMs. For a while we had the acronym GPGPU -
               | General Purpose computing on Graphics Processing Units
               | [1]. But even that is now an anachronism, it started
               | dying in 2007 when CUDA was released. With CUDA and
               | OpenCL and compute shaders all being standard, it is now
               | widely understood that today's GPUs are used for general
               | purpose compute and might not do any graphics. The bulk
               | of chip area is general purpose and has been for some
               | time. From a software development perspective GPU is just
               | a legacy name but is not causing productivity problems or
               | confusion.
               | 
               | To be fair, yes most GPUs still do come with things like
               | texture units, video transcode units, ray tracing cores,
               | and a framebuffer and video output. But that's already
               | changing and you have, for example, some GPUs with ray
               | tracing, and some without that are more designed for data
               | centers. And you don't have to use the graphics
               | functionality; for GPU supercomputers it's common for the
               | majority of GPU nodes to be compute-only.
               | 
               | In the mean time we now have CPUs with embedded GPUs (aka
               | iGPUs), GPUs with embedded CPUs, GPUs that come paired
               | with CPUs and a wide interconnect (like Nvidia Grace
               | Hopper), CPU-GPU chips (like Apple M1), and yes CPUs in
               | general have more and more SIMD.
               | 
               | It's useful to have a name or a way to distinguish
               | between a processor that mostly uses a single threaded
               | SISD programming model and has a small handful of
               | hardware threads, versus a processor that uses a
               | SIMD/SIMT model and has tens of thousands of threads.
               | That might be mainly a question of workloads and
               | algorithms, but the old line between CPU and GPU is very
               | blurry, headed towards extinction, and the "graphics"
               | part has already lost meaning.
               | 
               | [1] https://en.wikipedia.org/wiki/General-
               | purpose_computing_on_g...
        
               | adrian_b wrote:
               | The display controller, which handles the frame buffers
               | and the video outputs, and the video decoding/encoding
               | unit are two blocks that are usually well separated from
               | the remainder of the GPU.
               | 
               | In many systems-on-a-chip, the 3 blocks, GPU in the
               | strict sense, video decoder/encoder and display
               | controller may even be licensed from different IP vendors
               | and then combined in a single chip. Also in the CPUs with
               | intgrated GPU, like Intel Lunar Lake and AMD Strix Point,
               | these 3 blocks can be found in well separated locations
               | on the silicon die.
               | 
               | What belongs into the GPU proper from the graphics-
               | specific functions, because these perform operations that
               | are mixed with the general-purpose computations done by
               | shaders, are the ray-tracing units, the texture units and
               | the rasterization units.
        
           | Remnant44 wrote:
           | Hi Raph, first of all thank you for all of your contributions
           | and writings - I've learned a ton from reading your blog!
           | 
           | A minor quibble amidst your good comparison above ;)
           | 
           | For a zen5 core, we have 16-wide SIMD with 4 pipes; 2 are FMA
           | (2 flop), and 2 are FADD @ ~5GHZ. I math that out to 16 * 6 *
           | 5 = 480 GFLOP/core... am I missing something?
        
             | raphlinus wrote:
             | Thanks for the kind words and the clarification. I'm sure
             | you're right; I was just multiplying things together
             | without taking into account the different capabilities of
             | the different execution units. Hopefully that doesn't
             | invalidate the major points I was making.
        
             | adrian_b wrote:
             | According to the initial reviews, it appears that when
             | 512-bit instructions are executed at the maximum rate, this
             | increases the power consumption enough so that the clock
             | frequency drops to around 4 GHz for a 9950X
             | 
             | So a 9950X can do 256 FMA + 256 FADD for FP64 or 512 FMA +
             | 512 FADD for FP32, per clock cycle.
             | 
             | Using FP32, because it can be compared with the GPUs, there
             | are 1536 Flop per clock cycle, therefore about 6 FP32
             | Tflop/s @ 4 GHz for a 9950X (around 375 FP32 Gflop/s per
             | core, but this number is irrelevant, because a single
             | active core would go to a much higher clock frequency,
             | probably over 5 GHz). For an application that uses only
             | FMA, like matrix multiplication, the throughput would drop
             | to around 4 FP32 Tflop/s or 2 FP64 Tflop/s.
             | 
             | The values for the FP32 throughput are similar to those of
             | the best integrated GPUs that exist at this time. Therefore
             | doing graphics rendering on the CPU on a 9950X might be
             | similarly fast to doing graphics rendering on the iGPU on
             | the best mobile CPUs. Doing graphics rendering on a 9950X
             | can still leverage the graphics and video specific blocks
             | contained in the anemic GPU included in 9950X, whose only
             | problem is that it has a very small number of compute
             | shaders, but their functions can be augmented by the strong
             | CPU.
        
           | tjoff wrote:
           | For those of us not fluent in codenames:
           | 
           | Granite Ridge core = Zen 5 core.
        
           | bee_rider wrote:
           | > It's almost certainly best just to compare TFLOPS - also a
           | bit of a slippery concept, as that depends on the precision
           | 
           | Agreed. Some quibbles about the slipperiness of the concept.
           | 
           | flops are floating point operations. IMO it should not be
           | confusing at all, just count single precision floating point
           | operations, which all devices can do, and which are
           | explicitly defined in the IEEE standard.
           | 
           | Half precision flops are interesting but should be called out
           | for the non-standard metric they are. Anyone using half
           | precision flops as a flop is either being intentionally
           | misleading or is confused about user expectations.
           | 
           | On the other side, lots of scientific computing folks would
           | rather have doubles, but IMO we should get with the times and
           | learn to deal with less precision. It is fun, you get to make
           | some trade-offs and you can see if your algorithms are really
           | as robust as you expected. A free 2x speed up even on CPUs is
           | pretty nice.
           | 
           | > and also whether the application can make use of the
           | sparsity feature
           | 
           | Eh, I don't like it. Flops are flops. Avoiding a computation
           | exploiting sparsity is not a flop. If we want to take credit
           | for flops not executed via sparsity, there's a whole
           | ecosystem of mostly-CPU "sparse matrix" codes to consider. Of
           | course, GPUs have this nice 50% sparse feature, but nobody
           | wants to compete against PARDISO or iterative solvers for
           | _really_ sparse problems, right? Haha.
        
             | leogao wrote:
             | In domains like ML, people care way more about the half
             | precision FLOPs than single precision.
        
               | bee_rider wrote:
               | They don't have much application outside ML, at least as
               | far as I know. Just call them ML ops, and then they can
               | include things like those funky shared exponent floating
               | point formats, and or stuff with ints.
               | 
               | Or they could be measured in bits per second.
               | 
               | Actually I'm pretty interested in figuring out if we can
               | use them for numerical linear algebra stuff, but I think
               | it'd take some doing.
        
           | dundarious wrote:
           | > It's almost certainly best just to compare TFLOPS
           | 
           | Depends on what you're comparing with what, and the context,
           | of course.
           | 
           | Casey is doing education, so that people learn how best to
           | program these devices. A mere comparison of TFLOPS of CPU vs
           | GPU would be useless towards those ends. Similarly, just a
           | bare comparison of TFLOPS between different GPUs even of the
           | same generation would mask architectural differences in how
           | to in practice achieve those theoretical TFLOPS upper bounds.
           | 
           | I think Casey believes most people _don 't_ know how to
           | program well for these devices/architectures. In that
           | context, I think it's appropriate to be almost dismissive of
           | TFLOPS comparison talk.
        
       | diabllicseagull wrote:
       | It was a good read. I wonder what hot takes he'll have in the
       | second part if any.
        
       | kvemkon wrote:
       | > Each of the tiles on the CPU side is actually a Zen 4 core,
       | complete with its dedicated L2 cache.
       | 
       | Perhaps, it could be more interesting to compare without L2
       | cache.
        
         | Symmetry wrote:
         | Or maybe a CUDA core versus one of Zen's SIMD ports.
        
         | adrian_b wrote:
         | The L2 really belongs to the core, a comparison without it does
         | not make much sense.
         | 
         | The GPU cores (in the classic sense, i.e. not what NVIDIA names
         | as "cores") also include cache memories and also local memories
         | that are directly addressable.
         | 
         | The only confusion is caused by the fact that first NVIDIA, and
         | then ATI/AMD too, have started to use an obfuscated terminology
         | where they have replaced a large number of terms that had been
         | used for decades in the computing literature with other terms.
         | 
         | For maximum confusion, many terms that previously had clear
         | meanings, like "thread" or "core", have been reused with new
         | meanings and ATI/AMD has invented a set of terms corresponding
         | to those used by NVIDIA but with completely different word
         | choices.
         | 
         | I hate the employees of NVIDIA and ATI/AMD who thought that it
         | is a good idea to replace all the traditional terms without
         | having any reason for this.
         | 
         | The traditional meaning of a thread is that for each thread
         | there exists a distinct program counter a.k.a. instruction
         | pointer, which is used to fetch and execute instructions from a
         | program stored in the memory.
         | 
         | The traditional meaning of a core is that it is a block that is
         | equivalent with a traditional independent processor, i.e.
         | equivalent with a complete computer minus the main memory and
         | the peripherals.
         | 
         | A core may have only one program counter, when it can execute a
         | single thread at a time, or it may have multiple program
         | counters (with associated register sets) when it can execute
         | multiple threads, using either FGMT (fine-grained
         | multithreading) or SMT (simultaneous multithreading).
         | 
         | The traditional terms were very clear and they have direct
         | correspondents in GPUs, but NVIDIA and AMD use other words for
         | those instead of "thread" and "core" and they reuse the words
         | "thread" and "core" for very different things, for maximum
         | obfuscation. For instance, NVIDIA uses "warp" instead of
         | "thread", while AMD uses "wavefront" instead of "thread".
         | NVIDIA uses "thread" to designate what was traditionally named
         | the body of a "parallel for" a.k.a. "parallel do" program
         | structure (which when executed on a GPU or multi-core CPU is
         | unrolled and distributed over cores, threads and SIMD lanes).
        
       ___________________________________________________________________
       (page generated 2024-09-07 23:00 UTC)