[HN Gopher] Zen, CUDA, and Tensor Cores, Part I: The Silicon
___________________________________________________________________
Zen, CUDA, and Tensor Cores, Part I: The Silicon
Author : throwaway71271
Score : 146 points
Date : 2024-09-03 19:34 UTC (4 days ago)
(HTM) web link (www.computerenhance.com)
(TXT) w3m dump (www.computerenhance.com)
| downvotetruth wrote:
| I refused to buy the so determined defective chips even if they
| represented better value because if the intent was truly to try
| and max yield then there should be for Ryzen for example good 7
| core versions with only 1 core that was found to be defective.
| Since no 7 core zens exist, then at least some of the CPUs with 6
| core CCDs have intentionally had 1 of the cores destroyed for
| reasons unknown, which could be to meet volume targets. If this
| is because for Ryzen the cores can only be disabled in pairs,
| then it boggles my mind that it would not be economic given the $
| diff of tens to hundreds of dollars between the 6 and 8 core
| versions that is does not make sense to add the circuits to allow
| each core to be individually fused off and allow further product
| differentiation, especially considering how much effort and # of
| SKUs have been put forth with the frequency binning in AM4
| (5700x, 5800, 5800x, 5800xt, etc.), rather than bigger market
| segmentation jumps.
| AnthonyMouse wrote:
| > if the intent was truly to try and max yield then there
| should be for Ryzen for example good 7 core versions with only
| 1 core that was found to be defective. Since no 7 core zens
| exist
|
| There are Zen processors that use 7 cores per CCD, e.g. Epyc
| 7663, 7453, 9634.
|
| The difference between Ryzen and Epyc is the I/O die. The CCDs
| are the same so that's presumably where they go.
|
| Another reason you might not see this on the consumer chips is
| that they have higher base clocks. If you have a CCD where one
| core is bad and another isn't exactly bad but can't hit the
| same frequencies as the other six, it doesn't take a lot of
| difference before it makes more sense to turn off the slowest
| than lower the base clock for the whole processor. 6 x 4.7GHz
| is faster than 7 x 4.0GHz, much less 7 x 2.5GHz.
|
| In theory you could let that one core run at a significantly
| lower speed than the others, but there is _a lot_ of naive
| software that will misbehave in that context. Whereas the base
| clock for the Epyc 9634 is 2.25GHz, because it has twelve
| 7-core CCDs so it 's nearly 300W, and doesn't want to be nearly
| 1300W regardless of whether or not most of the cores could do
| >4GHz.
| downvotetruth wrote:
| To correct the example for the Epyc line, models appears to
| exist with 1 through 8 cores available except for 5.
| AnthonyMouse wrote:
| The Epyc models with lower core counts per CCD probably
| don't exist because of yields though. The 73F3 has two
| cores per CCD, so with eight CCDs it only has 16 cores. The
| 7303 also has 16 cores but two CCDs, so all eight cores per
| CCD are active. The 73F3 costs more than five times as
| much. That's weird if the 73F3 is the dumping ground for
| broken dice. Not so weird when you consider that it has
| four times as much L3 cache and higher clock speeds.
|
| The extra cores in the 73F3 aren't necessarily bad, they're
| disabled so the others can have their L3 cache and so they
| can pick the two cores from each CCD that hit the highest
| clock speeds. Doing that is expensive, especially if the
| other cores _aren 't_ all bad, but then you get better
| performance per core. Which some people will pay a premium
| for, so they offer models like that even if yields are good
| and there aren't that many CCDs with that many bad cores.
|
| At which point your premise is invalid because processors
| are being sold with cores disabled for performance reasons
| rather than yield reasons.
| downvotetruth wrote:
| > they're disabled so the others can have their L3 cache
| and so they can pick the two cores from each CCD that hit
| the highest clock speeds
|
| what or where does that follow from? One can take a CCD
| with 2+ cores and pin a process to a set (of the fastest)
| cores based on profiling the cores and those 2+ cores
| could use the L3 cache as needed; disabling cores at the
| hardware level is the waste as if they were not disabled,
| then that would allow other processes to be able to
| benefit from more than 2 cores to run when desired. The
| latter point of disabling cores for "better [frequency]
| performance per core Which some people will pay a premium
| for" is dubious especially for the Epyc server line. If
| that were true, then there should at least be 4 core or
| fewer SKUs for desktop Ryzen variant where apps like
| games are more likely to benefit from the higher clock.
| tverbeure wrote:
| That's first sentence is a spectacular non-sequitur.
| MobiusHorizons wrote:
| I would guess that there is a desire to not create too many
| product tiers. I believe 6 core parts are made from 2 3-core
| CCXs, (rather than 4 and 2) so only one core is disabled per
| ccx.
| cinnamonteal wrote:
| Current Ryzen and EPYC processors have 8 core CCXs. The 6
| core parts used to be as you described, but are now a single
| CCX. The Zen C dies have two CCXs, but they are still 8 core
| CCXs, and are always symmetrical in core count.
|
| The big exception is that the new Zen 5 Strix Point chip has
| a 4 core CCX for the non-C cores. I think the Zen 4 based Z1
| has a similar setup but don't remember and couldn't quickly
| find the actual information to confirm.
| wtallis wrote:
| The Ryzen Z1 was a weird one: two Zen4 cores plus four
| Zen4c cores all in one cluster, sharing the same 16MB L3
| cache.
| Symmetry wrote:
| It would be sort of cool if they could do direct to consumer
| sales with every core going at whatever its maximum speed is
| or turned off if to disrupted. But that's not something you
| could do through existing distribution channels, everyone
| presumes a fairly limited number of SKUs.
| fulafel wrote:
| The answer to the leading question "What's the difference between
| a Zen core, a CUDA core, and a Tensor core?" is not covered in
| Part 1, so you may want to wait if this interests you more than
| chip layouts.
| raphlinus wrote:
| Here's my quick take.
|
| A top of the line Zen core is a powerful CPU with wide SIMD
| (AVX-512 is 16 lanes of 32 bit quantities), significant
| superscalar parallelism (capable of issuing approximately 4
| SIMD operations per clock), and a high clock rate (over 5GHz).
| There isn't a lot of confusion about what constitutes a "core,"
| though multithreading can inflate the "thread" count. See [1]
| for a detailed analysis of the Zen 5 line.
|
| A single Granite Ridge core has peak 32 bit multiply-add
| performance of about 730 GFLOPS.
|
| Nvidia, by contrast, uses the marketing term "core" to refer to
| a single SIMD lane. Their GPUs are organized as 32 SIMD lanes
| grouped into each "warp," and 4 warps grouped into a Streaming
| Multiprocessor (SM). CPU and GPU architectures can't be
| directly compared, but just going by peak floating point
| performance, the most comparable granularity to a CPU core is
| the SM. A warp is in some ways more powerful than a CPU core
| (generally wider SIMD, larger register file, more local SRAM,
| better latency hiding) but in other ways less (much less
| superscalar parallelism, lower clock, around 2.5GHz). A 4090
| has 128 SMs, which is a lot and goes a long way to explaining
| why a GPU has so much throughput. A 1080, by contrast, has 20
| SMs - still a goodly number but not mind-meltingly bigger than
| a high end CPU. See the Nvidia Ada whitepaper [2] for an
| extremely detailed breakdown of 4090 specs (among other
| things).
|
| A single Nvidia 4090 "core" has peak 32 bit multiply-add
| performance of about 5 GFLOPS, while an SM has 640 GFLOPS.
|
| I don't know anybody who counts tensor cores by core count, as
| the capacity of a "core" varies pretty widely by generation.
| It's almost certainly best just to compare TFLOPS - also a bit
| of a slippery concept, as that depends on the precision and
| also whether the application can make use of the sparsity
| feature.
|
| I'll also note that not all GPU vendors follow Nvidia's lead in
| counting individual SIMD lanes as "cores." Apple Silicon, by
| contrast, uses "core" to refer to a grouping of 128 SIMD lanes,
| similar to an Nvidia SM. A top of the line M2 Ultra contains 76
| such cores, for 9728 SIMD lanes. I found Philip Turner's Metal
| benchmarks [3] useful for understanding the quantitative
| similarities and differences between Apple, AMD, and Nvidia
| GPUs.
|
| [1]:
| http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...
|
| [2]: https://images.nvidia.com/aem-dam/Solutions/Data-
| Center/l4/n...
|
| [3]: https://github.com/philipturner/metal-benchmarks
| JonChesterfield wrote:
| An x64 core roughly corresponding to a SM, or in the amdgpu
| world a compute unit (CU) seems right. It's in the same
| ballpark for power consumption, represents the component
| handling an instruction pointer and a local register file and
| so forth.
|
| A really big CPU is a couple of hundred cores, a big GPU is a
| few hundred SM / CUs. Some low power chips are 8 x64 cores
| and 8 CUs on the same package. All roughly lines up.
| openrisk wrote:
| If SIMD lanes come to vastly dominate the composition of a
| typical computer chip (in terms, e.g., of where power is
| consumed) will the distinction between CPU/GPU continue to
| be meaningful?
|
| For decades the GPU was "special purpose" hardware
| dedicated to the math of screen graphics. If the type of
| larger-scale numerical computation that has been
| popularised with LLM's is now deemed "typical use", then
| the distinction may be becoming irrelevant (and even
| counterproductive from a software development perspective).
| JonChesterfield wrote:
| The x64 cores putting more hardware into the vector units
| and amdgpu changing from 64 wide to 32 wide simd (at
| least for some chips) looks like convergent evolution to
| me. My personal belief is that the speculation and
| pipelining approach is worse than the many tasks and
| swapping between them.
|
| I think the APU designs from AMD are the transition
| pointing to the future. The GPU cores will gain
| increasing access to the raw hardware and the user
| interface until the CPU cores are optional and ultimately
| discarded.
| adrian_b wrote:
| There is little relationship between the reasons that
| determine the width of SIMD in CPUs and GPUs, so there is
| no convergence between them.
|
| In the Intel/AMD CPUs, the 512-bit width, i.e. 64 bytes
| or 16 FP32 numbers, matches the width of the cache line
| and the width of a DRAM burst transfer, which simplifies
| the writing of optimized programs. This SIMD width also
| provides a good ratio between the power consumed in the
| execution units and the power wasted in the control part
| of the CPU (around 80% of the total power consumption
| goes to the execution units, which is much more than when
| using narrower SIMD instructions).
|
| Increasing the SIMD width more than that in CPUs would
| complicate the interaction with the cache memories and
| with the main memory, while providing only a negligible
| improvement in the energy efficiency, so there is no
| reason to do this. At least in the following decade it is
| very unlikely that any CPU would increase the SIMD width
| beyond 16 FP32 numbers per operation.
|
| On the other hand, the AMD GPUs before RDNA had a SIMD
| width of 64 FP32 numbers, but the operations were
| pipelined and executed in 4 clock cycles, so only 16 FP32
| numbers were processed per clock cycle.
|
| RDNA has doubled the width of the SIMD execution,
| processing 32 FP32 numbers per clock cycle. For this,
| SIMD instructions with a reduced width of 32 FP32 have
| been introduced, but they are executed in one clock cycle
| versus the old 64 FP32 instructions that were executed in
| four clock cycles. For backwards compatibility, RDNA has
| kept 64 FP32 instructions, which are executed in two
| clock cycles, but these were not recommended for new
| programs.
|
| RDNA 3 has changed again all this, because now sometimes
| the 64 FP32 instructions can be executed in a single
| clock cycle, so they may be again preferable instead of
| the 32 FP32 instructions. However it is possible to take
| advantage of the increased width of the RDNA 3 SIMD
| execution units also when using 32 FP32 instructions, if
| certain new instructions are used, which encode double
| operations.
|
| So the AMD GPUs have continuously evolved towards wider
| SIMD execution units, from 16 FP32 before RDNA, to 32
| FP32 in RDNA and finally to 64 FP32 in RDNA 3.
|
| The distance from CPUs has been steadily increasing,
| there is no convergence.
| Symmetry wrote:
| There are still a lot of differences, even if you put in
| a lot more SIMD lanes to the CPU. CPUs keep their
| execution resources fed by by aggressive caching,
| prefetching, and out of order execution while GPUs rely
| on having lots of threads around so that if one stalls
| another is able to execute.
| bee_rider wrote:
| Xeon Phi was that, a bunch of little cores with a ton of
| SIMD lanes each.
|
| It didn't really work out, in part because it was too far
| from a regular old Xeon to run your normal code well
| without optimizing. On the other side, Intel couldn't
| keep up with NVIDIA on the sort of metrics people care
| about for these compute accelerators: memory bandwidth
| mostly. If you are going to have to refactor your whole
| project anyway to use a compute accelerator, you probably
| want a pretty big reward. It isn't obvious (to me at
| least) if this is the result of the fact that the Phi
| cores, simple as they were, were still a lot more complex
| than a GPU "core," maybe the design just had various
| hidden bottlenecks that were too hard to work out due to
| that complexity. Or if it is because Intel just wasn't
| executing very well at the time, especially compared to
| NVIDIA (it is Intel's dark age vs NVIDIA's golden age,
| really). The programmer's "logical or" joke is a possible
| answer here.
|
| But, you can't do everything in parallel. It is a shame
| the Phi didn't survive into the age where Intel is also
| doing big/little cores (in a single chip). A big Xeon
| core (for latency) surrounded by a bunch of little Phi
| cores (for throughput) could have been a really
| interesting device.
| dahart wrote:
| The special purpose graphics distinction is already
| mostly irrelevant and has been for 10 or 20 years for
| anyone doing High Performance Computing (HPC) or AI. It
| predates LLMs. For a while we had the acronym GPGPU -
| General Purpose computing on Graphics Processing Units
| [1]. But even that is now an anachronism, it started
| dying in 2007 when CUDA was released. With CUDA and
| OpenCL and compute shaders all being standard, it is now
| widely understood that today's GPUs are used for general
| purpose compute and might not do any graphics. The bulk
| of chip area is general purpose and has been for some
| time. From a software development perspective GPU is just
| a legacy name but is not causing productivity problems or
| confusion.
|
| To be fair, yes most GPUs still do come with things like
| texture units, video transcode units, ray tracing cores,
| and a framebuffer and video output. But that's already
| changing and you have, for example, some GPUs with ray
| tracing, and some without that are more designed for data
| centers. And you don't have to use the graphics
| functionality; for GPU supercomputers it's common for the
| majority of GPU nodes to be compute-only.
|
| In the mean time we now have CPUs with embedded GPUs (aka
| iGPUs), GPUs with embedded CPUs, GPUs that come paired
| with CPUs and a wide interconnect (like Nvidia Grace
| Hopper), CPU-GPU chips (like Apple M1), and yes CPUs in
| general have more and more SIMD.
|
| It's useful to have a name or a way to distinguish
| between a processor that mostly uses a single threaded
| SISD programming model and has a small handful of
| hardware threads, versus a processor that uses a
| SIMD/SIMT model and has tens of thousands of threads.
| That might be mainly a question of workloads and
| algorithms, but the old line between CPU and GPU is very
| blurry, headed towards extinction, and the "graphics"
| part has already lost meaning.
|
| [1] https://en.wikipedia.org/wiki/General-
| purpose_computing_on_g...
| adrian_b wrote:
| The display controller, which handles the frame buffers
| and the video outputs, and the video decoding/encoding
| unit are two blocks that are usually well separated from
| the remainder of the GPU.
|
| In many systems-on-a-chip, the 3 blocks, GPU in the
| strict sense, video decoder/encoder and display
| controller may even be licensed from different IP vendors
| and then combined in a single chip. Also in the CPUs with
| intgrated GPU, like Intel Lunar Lake and AMD Strix Point,
| these 3 blocks can be found in well separated locations
| on the silicon die.
|
| What belongs into the GPU proper from the graphics-
| specific functions, because these perform operations that
| are mixed with the general-purpose computations done by
| shaders, are the ray-tracing units, the texture units and
| the rasterization units.
| Remnant44 wrote:
| Hi Raph, first of all thank you for all of your contributions
| and writings - I've learned a ton from reading your blog!
|
| A minor quibble amidst your good comparison above ;)
|
| For a zen5 core, we have 16-wide SIMD with 4 pipes; 2 are FMA
| (2 flop), and 2 are FADD @ ~5GHZ. I math that out to 16 * 6 *
| 5 = 480 GFLOP/core... am I missing something?
| raphlinus wrote:
| Thanks for the kind words and the clarification. I'm sure
| you're right; I was just multiplying things together
| without taking into account the different capabilities of
| the different execution units. Hopefully that doesn't
| invalidate the major points I was making.
| adrian_b wrote:
| According to the initial reviews, it appears that when
| 512-bit instructions are executed at the maximum rate, this
| increases the power consumption enough so that the clock
| frequency drops to around 4 GHz for a 9950X
|
| So a 9950X can do 256 FMA + 256 FADD for FP64 or 512 FMA +
| 512 FADD for FP32, per clock cycle.
|
| Using FP32, because it can be compared with the GPUs, there
| are 1536 Flop per clock cycle, therefore about 6 FP32
| Tflop/s @ 4 GHz for a 9950X (around 375 FP32 Gflop/s per
| core, but this number is irrelevant, because a single
| active core would go to a much higher clock frequency,
| probably over 5 GHz). For an application that uses only
| FMA, like matrix multiplication, the throughput would drop
| to around 4 FP32 Tflop/s or 2 FP64 Tflop/s.
|
| The values for the FP32 throughput are similar to those of
| the best integrated GPUs that exist at this time. Therefore
| doing graphics rendering on the CPU on a 9950X might be
| similarly fast to doing graphics rendering on the iGPU on
| the best mobile CPUs. Doing graphics rendering on a 9950X
| can still leverage the graphics and video specific blocks
| contained in the anemic GPU included in 9950X, whose only
| problem is that it has a very small number of compute
| shaders, but their functions can be augmented by the strong
| CPU.
| tjoff wrote:
| For those of us not fluent in codenames:
|
| Granite Ridge core = Zen 5 core.
| bee_rider wrote:
| > It's almost certainly best just to compare TFLOPS - also a
| bit of a slippery concept, as that depends on the precision
|
| Agreed. Some quibbles about the slipperiness of the concept.
|
| flops are floating point operations. IMO it should not be
| confusing at all, just count single precision floating point
| operations, which all devices can do, and which are
| explicitly defined in the IEEE standard.
|
| Half precision flops are interesting but should be called out
| for the non-standard metric they are. Anyone using half
| precision flops as a flop is either being intentionally
| misleading or is confused about user expectations.
|
| On the other side, lots of scientific computing folks would
| rather have doubles, but IMO we should get with the times and
| learn to deal with less precision. It is fun, you get to make
| some trade-offs and you can see if your algorithms are really
| as robust as you expected. A free 2x speed up even on CPUs is
| pretty nice.
|
| > and also whether the application can make use of the
| sparsity feature
|
| Eh, I don't like it. Flops are flops. Avoiding a computation
| exploiting sparsity is not a flop. If we want to take credit
| for flops not executed via sparsity, there's a whole
| ecosystem of mostly-CPU "sparse matrix" codes to consider. Of
| course, GPUs have this nice 50% sparse feature, but nobody
| wants to compete against PARDISO or iterative solvers for
| _really_ sparse problems, right? Haha.
| leogao wrote:
| In domains like ML, people care way more about the half
| precision FLOPs than single precision.
| bee_rider wrote:
| They don't have much application outside ML, at least as
| far as I know. Just call them ML ops, and then they can
| include things like those funky shared exponent floating
| point formats, and or stuff with ints.
|
| Or they could be measured in bits per second.
|
| Actually I'm pretty interested in figuring out if we can
| use them for numerical linear algebra stuff, but I think
| it'd take some doing.
| dundarious wrote:
| > It's almost certainly best just to compare TFLOPS
|
| Depends on what you're comparing with what, and the context,
| of course.
|
| Casey is doing education, so that people learn how best to
| program these devices. A mere comparison of TFLOPS of CPU vs
| GPU would be useless towards those ends. Similarly, just a
| bare comparison of TFLOPS between different GPUs even of the
| same generation would mask architectural differences in how
| to in practice achieve those theoretical TFLOPS upper bounds.
|
| I think Casey believes most people _don 't_ know how to
| program well for these devices/architectures. In that
| context, I think it's appropriate to be almost dismissive of
| TFLOPS comparison talk.
| diabllicseagull wrote:
| It was a good read. I wonder what hot takes he'll have in the
| second part if any.
| kvemkon wrote:
| > Each of the tiles on the CPU side is actually a Zen 4 core,
| complete with its dedicated L2 cache.
|
| Perhaps, it could be more interesting to compare without L2
| cache.
| Symmetry wrote:
| Or maybe a CUDA core versus one of Zen's SIMD ports.
| adrian_b wrote:
| The L2 really belongs to the core, a comparison without it does
| not make much sense.
|
| The GPU cores (in the classic sense, i.e. not what NVIDIA names
| as "cores") also include cache memories and also local memories
| that are directly addressable.
|
| The only confusion is caused by the fact that first NVIDIA, and
| then ATI/AMD too, have started to use an obfuscated terminology
| where they have replaced a large number of terms that had been
| used for decades in the computing literature with other terms.
|
| For maximum confusion, many terms that previously had clear
| meanings, like "thread" or "core", have been reused with new
| meanings and ATI/AMD has invented a set of terms corresponding
| to those used by NVIDIA but with completely different word
| choices.
|
| I hate the employees of NVIDIA and ATI/AMD who thought that it
| is a good idea to replace all the traditional terms without
| having any reason for this.
|
| The traditional meaning of a thread is that for each thread
| there exists a distinct program counter a.k.a. instruction
| pointer, which is used to fetch and execute instructions from a
| program stored in the memory.
|
| The traditional meaning of a core is that it is a block that is
| equivalent with a traditional independent processor, i.e.
| equivalent with a complete computer minus the main memory and
| the peripherals.
|
| A core may have only one program counter, when it can execute a
| single thread at a time, or it may have multiple program
| counters (with associated register sets) when it can execute
| multiple threads, using either FGMT (fine-grained
| multithreading) or SMT (simultaneous multithreading).
|
| The traditional terms were very clear and they have direct
| correspondents in GPUs, but NVIDIA and AMD use other words for
| those instead of "thread" and "core" and they reuse the words
| "thread" and "core" for very different things, for maximum
| obfuscation. For instance, NVIDIA uses "warp" instead of
| "thread", while AMD uses "wavefront" instead of "thread".
| NVIDIA uses "thread" to designate what was traditionally named
| the body of a "parallel for" a.k.a. "parallel do" program
| structure (which when executed on a GPU or multi-core CPU is
| unrolled and distributed over cores, threads and SIMD lanes).
___________________________________________________________________
(page generated 2024-09-07 23:00 UTC)