hngopher.com

       [HN Gopher] Zen 5's AVX-512 Frequency Behavior
       ___________________________________________________________________
        
       Zen 5's AVX-512 Frequency Behavior
        
       Author : matt_d
       Score  : 183 points
       Date   : 2025-03-01 04:10 UTC (18 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | kristianp wrote:
       | I find it irritating that they are comparing clock scaling to the
       | venerable Skylake-X. Surely Sapphire Rapids has been out for
       | almost 2 years by now.
        
         | fuhsnn wrote:
         | I think it's mostly the lack of comparable research other than
         | the Skylake-X one by Travis Downs. I too would like to see how
         | Zen 4 behaves in the situation with its double-pumping.
        
         | eqvinox wrote:
         | Seemed appropriate to me as comparing the "first core to use
         | full-width AVX-512 datapaths"; my interpretation is that AMD
         | threw more R&D into this than Intel before shipping it to
         | customers...
         | 
         | (It's also not really a comparative article at all? Skylake-X
         | is mostly just introduction...)
        
           | kristianp wrote:
           | > my interpretation is that AMD threw more R&D into this than
           | Intel before shipping it to customers
           | 
           | AMD had the benefit of learning from Intel's mistakes in
           | their first generation of AVX-512 chips. It seemed unfair to
           | compare an Intel chip that's so old (albeit long-lasting due
           | to Intel's scaling problems). Skylake-X chips were released
           | in 2017! [1]
           | 
           | [1] https://en.wikipedia.org/wiki/Skylake_(microarchitecture)
           | #Hi...
        
             | eqvinox wrote:
             | sure, but AMD's decision to start with a narrower datapath
             | happened without insight from Intel's mistakes and could
             | very well have backfired (if Intel had managed to produce a
             | better-working implementation faster, that could've cost
             | AMD a lot of market share). Intel had the benefit of
             | designing the instructions along with the implementation as
             | well, and also the choice of starting on a 2x256
             | datapath...
             | 
             | And again, yeah it's not great were it a comparison, but it
             | really just doesn't read as a comparison to me at all. It's
             | a _reference_.
        
               | adrian_b wrote:
               | AMD did not start with a narrower datapath, even if this
               | is a widespread myth. It only had a narrower path between
               | the inner CPU core and the L1 data cache memory.
               | 
               | The most recent Intel and AMD CPU cores (Lion Cove and
               | Zen 5) have identical vector datapath widths, but for
               | many years, for 256-bit AVX Intel had a narrower datapath
               | than AMD, 768-bit for Intel (3 x 256-bit) vs. 1024-bit
               | for AMD (4 x 256-bit).
               | 
               | Only when executing 512-bit AVX-512 instructions, the
               | vector datapath of Intel was extended to 1024-bit (2 x
               | 512-bit), matching the datapath used by AMD for any
               | vector instructions.
               | 
               | There were only 2 advantages of Intel AVX-512 vs. AMD
               | executing AVX or the initial AVX-512 implementation of
               | Zen 4.
               | 
               | The first was that some Intel CPU models, but only the
               | more expensive SKUs, i.e. most of the Gold and all of the
               | Platinum, had 2 x 512-bit FMA units, while the cheap
               | Intel CPUs and AMD Zen 4 had only one 512-bit FMA unit
               | (but AMD Zen 4 still had 2 x 512-bit FADD units).
               | Therefore Intel could do 2 FMUL or FMA per clock cycle,
               | while Zen 4 could do only 1 FMUL or FMA (+ 1 FADD).
               | 
               | The second was that Intel had a double width link to the
               | L1 cache, so it could do 2 x 512-bit loads + 1 x 512-bit
               | stores per clock cycle, while Zen 4 could do only 1 x
               | 512-bit loads per cycle + 1 x 512-bit stores every other
               | cycle. (In a balanced CPU core design the throughput for
               | vector FMA and for vector loads from the L1 cache must be
               | the same, which is true for both old and new Intel and
               | AMD CPU cores.)
               | 
               | With the exception of vector load/store and FMUL/FMA, Zen
               | 4 had the same or better AVX-512 throughput for most
               | instructions, in comparison with Intel Sapphire
               | Rapids/Emerald Rapids. There were a few instructions with
               | a poor implementation on Zen 4 and a few instructions
               | with a poor implementation on Intel, where either Intel
               | or AMD were significantly better than the other.
        
               | eqvinox wrote:
               | > AMD did not start with a narrower datapath, even if
               | this is a widespread myth. It only had a narrower path
               | between the inner CPU core and the L1 data cache memory.
               | 
               | https://www.mersenneforum.org/node/21615#post614191
               | 
               | "Thus as many of us predicted, 512-bit instructions are
               | split into 2 x 256-bit of the same instruction. And
               | 512-bit is always half-throughput of their 256-bit
               | versions."
               | 
               | Is that wrong?
               | 
               | There's a lot of it being described as "double pumped"
               | going around...
               | 
               | (tbh I couldn't care less about how wide the interface
               | buses are, as long as they can deliver in sum total a
               | reasonable bandwidth at a reasonable latency...
               | especially on the further out cache hierarchies the
               | latency overshadows the width so much it doesn't matter
               | if it comes down to 1x512 or 2x256. The question at hand
               | here is the total width of the ALUs and effective IPC.)
        
               | adrian_b wrote:
               | Sorry, but you did not read carefully that good article
               | and you did not read the AMD documentation and the Intel
               | documentation.
               | 
               | The AMD Zen cores had for several generations, until Zen
               | 4, 4 (four) vector execution units with a width of 256
               | bits, i.e. a total datapath width of 1024 bits.
               | 
               | On an 1024-bit datapath, you can execute either four
               | 256-bit instructions per clock cycle or two 512-bit
               | instructions per clock cycle.
               | 
               | While the number of instructions executed per cycle
               | varies, the data processing throughput is the same, 1024
               | bits per clock cycle, as determined by the datapath
               | width.
               | 
               | The use of the word "double-pumped" by the AMD CEO has
               | been a very unfortunate choice, because it has been
               | completely misunderstood by most people, who have never
               | read the AMD technical documentation and who have never
               | tested the behavior of the micro-architecture of the Zen
               | CPU cores.
               | 
               | On Zen 4, the advantage of using AVX-512 is not from a
               | different throughput, but it is caused by a better
               | instruction set and by the avoiding of bottlenecks in the
               | CPU core front-end, at instruction fetching, decoding,
               | renaming and dispatching.
               | 
               | On the Intel P cores before Lion Cove, the datapath for
               | 256-bit instructions had a width of 768 bits, as they had
               | three 256-bit execution units. For most 256-bit
               | instructions the throughput was of 768 bits per clock
               | cycle. However the three execution units were not
               | identical, so some 256-bit instructions had only a
               | throughput of 512 bits per cycle.
               | 
               | When the older Intel P cores executed 512-bit
               | instructions, the instructions with a 512 bit/cycle
               | throughput remained at that throughput, but most of the
               | instructions with a 768 bit/cycle throughput had their
               | throughput increased to 1024 bit/cycle, matching the AMD
               | throughput, by using an additional 256-bit datapath
               | section that stayed unused when executing 256-bit or
               | narrower instructions.
               | 
               | While what is said above applies to most vector
               | instructions, floating-point multiplication and FMA have
               | different rules, because their throughput is not
               | determined by the width of the datapath, but it may be
               | smaller, being determined by the number of available
               | floating-point multipliers.
               | 
               | Cheap Intel CPUs and AMD Zen 2/Zen 3/Zen 4 had FP
               | multipliers with a total throughput of 512 bits of
               | results per clock cycle, while the expensive Xeon Gold
               | and Platinum had FP multipliers with a total throughput
               | of 1024 bits of results per clock cycle.
               | 
               | The "double-pumped" term is applicable only to FP
               | multiplication, where Zen 4 and cheap Intel CPUs require
               | a double number of clock cycles to produce the same
               | results as expensive Intel CPUs. It may be also applied,
               | even if that is even less appropriate, to vector load and
               | store, where the path to the L1 data cache was narrower
               | in Zen 4 than in Intel CPUs.
               | 
               | The "double-pumped" term is not applicable to the very
               | large number of other AVX-512 instructions, whose
               | throughput is determined by the width of the vector
               | datapath, not by the width of the FP multipliers or by
               | the L1 data cache connection.
               | 
               | Zen 5 doubles the vector datapath width to 2048 bits, so
               | many 512-bit AVX-512 instructions have a 2048 bit/cycle
               | throughput, except FMUL/FMA, which have an 1024 bit/cycle
               | throughput, determined by the width of the FP
               | multipliers. (Because there are only 4 execution units,
               | 256-bit instructions cannot use the full datapath.)
               | 
               | Intel Diamond Rapids, expected by the end of 2026, is
               | likely to have the same vector throughput as Zen 5. Until
               | then, the Lion Cove cores from consumer CPUs, like Arrow
               | Lake S, Arrow Lake H and Lunar Lake, are crippled, by
               | having a half-width datapath of 1024 bits, which cannot
               | compete with a Zen 5 that executes AVX-512 instructions.
        
               | crest wrote:
               | Isn't it misleading to just add up the output width of
               | all SIMD ALU pipelines and call the sum "datapath width",
               | because you can't freely mix and match when the available
               | ALUs pipelines determine what operations you can compute
               | at full width?
        
               | adrian_b wrote:
               | You are right that in most CPUs the 3 or 4 vector
               | execution units are not completely identical.
               | 
               | Therefore some operations may use the entire datapath
               | width, while others may use only a fraction, e.g. only a
               | half or only two thirds or only three quarters.
               | 
               | However you cannot really discuss these details without
               | listing all such instructions, i.e. reproducing the
               | tables from the Intel or AMD optimization guides of from
               | Agner Fog's optimization documents.
               | 
               | For the purpose of this discussion thread, these details
               | are not really relevant, because for Intel and AMD the
               | classification of the instructions is mostly the same,
               | i.e. the cheap instructions, like addition operations,
               | can be executed in all execution units, using the entire
               | datapath width, while certain more expensive operations,
               | like multiplication/division/square root/shuffle may be
               | done only in a subset of the execution units, so they can
               | use only a fraction of the datapath width (but when
               | possible they will be coupled with simple instructions
               | using the remainder of the datapath, maintaining a total
               | throughput equal with the datapath width).
               | 
               | Because most instructions are classified by cost in the
               | same way by AMD and Intel, the throughput ratio between
               | AMD and Intel is typically the same both for instructions
               | using the full datapath width and for those using only a
               | fraction.
               | 
               | Like I have said, with very few exceptions (including
               | FMUL/FMA/LD/ST), the throughput for 512-bit instructions
               | has been the same for Zen 4 and the Intel CPUs with
               | AVX-512 support, as determined by the common 1024-bit
               | datapath width, including for the instructions that could
               | use only a half-width 512-bit datapath.
        
               | dzaima wrote:
               | Wouldn't it be 1536-bit for 2 256-bit FMA/cycle, with FMA
               | taking 3 inputs? (applies equally to both so doesn't
               | change anything materially; And even goes back to
               | Haswell, which too is capable of 2 256-bit FMA/cycle)
        
               | adrian_b wrote:
               | That is why I have written the throughput "for results",
               | to clarify the meaning (the throughput for output results
               | is determined by the number of execution units; it does
               | not depend on the number of input operands).
               | 
               | The vector register file has a number of read and write
               | ports, e.g. 10 x 512-bit read ports for recent AMD CPUs
               | (i.e. 10 ports can provide the input operands for 2 x FMA
               | + 2 FADD, when no store instructions are done
               | simultaneously).
               | 
               | So a detailed explanation of the "datapath widths", would
               | have to take into account the number of read and write
               | ports, because some combinations of instructions cannot
               | be executed simultaneously, even when there are available
               | execution units, because the paths between the register
               | file and the execution units are occupied.
               | 
               | Even more complicately, some combinations of instructions
               | that would be prohibited by not having enough register
               | read and write ports, can actually be done simultaneously
               | because there are bypass paths between the execution
               | units that allow the sharing of some input operands or
               | the direct use of output operands as input operands,
               | without passing through the register file.
               | 
               | The structure of the Intel vector execution units, with 3
               | x 256-bit execution units, 2 of which can do FMA, goes
               | indeed back to Haswell, as you say.
               | 
               | The Lion Cove core launched in 2024 is the first Intel
               | core that uses the enhanced structure used by AMD Zen for
               | many years, with 4 execution units, where 2 can do
               | FMA/FMUL, but all 4 can do FADD.
               | 
               | Starting with the Skylake Server CPUs, the Intel CPUs
               | with AVX-512 support retain the Haswell structure when
               | executing 256-bit or narrower instructions, but when
               | executing 512-bit instructions, 2 x 256-bit execution
               | units are paired to make a 512-bit execution unit, while
               | the third 256-bit execution unit is paired with an
               | otherwise unused 256-bit execution unit to make a second
               | 512-bit execution unit.
               | 
               | Of these 2 x 512-bit execution units, only one can do
               | FMA. Certain Intel SKUs add a second 512-bit FMA unit, so
               | in those both 512-bit execution units can do FMA (this
               | fact is mentioned where applicable in the CPU
               | descriptions from the Intel Ark site).
        
               | dzaima wrote:
               | So the 1024-bit number is the number of vector output
               | bits per cycle, i.e. 2xFMA+2xFADD = (2+2)x256-bit? Is the
               | term "datapath width" used for that anywhere else? (I
               | guess you've prefixed that with "total " in some places,
               | which makes much more sense)
        
               | dzaima wrote:
               | oops, haswell has only 3 SIMD ALUs, i.e. 768 bits of
               | output per cycle, not 1024.
        
               | eigenform wrote:
               | > Sorry, but you did not read carefully that good article
               | and you did not read the AMD documentation and the Intel
               | documentation.
               | 
               | I think _you_ are the one who hasn 't read documentation
               | or tested the behavior of Zen cores. Read literally any
               | AMD material about Zen4: it mentions that the AVX512
               | implementation is done over two cycles because there are
               | 256-bit datapaths.
               | 
               | On page 34 of the Zen4 Software Optimization Guide[^1],
               | it literally says:
               | 
               | > Because the data paths are 256 bits wide, the scheduler
               | uses two consecutive cycles to issue a 512-bit operation.
               | 
               | [^1]: https://www.amd.com/content/dam/amd/en/documents/pr
               | ocessor-t...
        
               | crest wrote:
               | AMD also has a long history of using half-width, double-
               | pumped SIMD implementations. It worked each time. With
               | Zen 4 the surprise wasn't that it worked at all, but how
               | well it worked and how little the core grew in any
               | relevant metric to support it: size, static and dynamic
               | power.
               | 
               | It makes me wonder why Intel didn't pair two efficiency
               | cores with half width SIMD data paths per core sharing a
               | single full width permute unit between them.
        
         | adrian_b wrote:
         | True, but who would bother to pay a lot of money for a CPU that
         | is known to be inferior to alternatives, only to be able to
         | test the details of its performance.
        
           | crest wrote:
           | A well funded investigative tech journalist?
        
             | adrian_b wrote:
             | With prices in the range $2,000 - $20,000 for the CPU, plus
             | a couple grand for at least MB, cooler, memory and PSU, the
             | journalist must be very well funded to spend that much for
             | publishing one article analyzing the CPU.
             | 
             | I would like to read such an article or to be able to test
             | myself such a CPU, but my curiosity is not so great as to
             | make me spend such money.
             | 
             | For now, the best one can do is to examine the results of
             | general-purpose benchmarks published on sites like:
             | 
             | https://www.servethehome.com/
             | 
             | https://www.phoronix.com/
             | 
             | where Intel, AMD or Ampere send some review systems with
             | Xeon, Epyc or Arm-based server CPUs.
             | 
             | These sites are useful, but a more thorough micro-
             | architectural investigation would have been nice.
        
           | kristianp wrote:
           | AWS, dedicated EC2 instance. A few dollars an hour.
        
         | formerly_proven wrote:
         | Cascade Lake improved the situation a bit, but then you had Ice
         | Lake where iirc the hard cutoffs were gone and you were just
         | looking at regular power and thermal steering. IIRC, that was
         | the first generation where we enabled AVX512 for all workloads.
        
         | Earw0rm wrote:
         | Depending on which CPU category. I think Intel HEDT stops at
         | Cascade Lake, which is essentially Skylake-X Refresh from 2019?
         | 
         | Whereas AMD has full-fat AVX512 even in gaming laptop CPUs.
        
       | Remnant44 wrote:
       | It's interesting that Zen5's FPUs running in full 512bit wide
       | mode doesn't actually seem to cause any trouble, but that
       | lighting up the load store units does. I don't know enough about
       | hardware-level design to know if this would be "expected".
       | 
       | The fully investigation in this article is really interesting,
       | but the TL;DR is: Light up enough of the core, and frequencies
       | will have to drop to maintain power envelope. The transition
       | period is done very smartly, but it still exists - but as opposed
       | to the old intel avx512 cores that got endless (deserved?) bad
       | press for their transition behavior, this is more or less
       | seamless.
        
         | eqvinox wrote:
         | Reading the section under "Load Another FP Pipe?" I'm coming
         | away with the impression that it's not the LSU but rather total
         | overall load that causes trouble. While that section is focused
         | on transition time, the end steady state is also slower...
        
           | tanelpoder wrote:
           | I haven't read the article yet, but back when I tried to get
           | to over 100 GB/s IO rate from a bunch of SSDs on Zen4 (just
           | fio direct IO workload without doing anything with the data),
           | I ended up disabling Core Boost states (or maybe something
           | else in BIOS too), to give more thermal allowance for the IO
           | hub on the chip. As RAM load/store traffic goes through the
           | IO hub too, maybe that's it?
        
             | eqvinox wrote:
             | I don't think these things are related, this is talking
             | about the LSU right inside the core. I'd also expect
             | oscillations if there were a thermal problem like you're
             | describing, i.e. core clocks up when IO hub delivers data,
             | IO hub stalls, causes core to stall as well, IO hub can run
             | again delivering data, repeat from beginning.
             | 
             | (Then again, boost clocks are an intentional oscillation
             | anyway...)
        
               | tanelpoder wrote:
               | Ok, I just read through the article. As I understand,
               | their tests were designed to run entirely on data on the
               | local cores' cache? I only see L1d mentioned there.
        
               | eqvinox wrote:
               | Yes, that's my understanding of "Zen 5 also doubles L1D
               | load bandwidth, and I'm exercising that by having each
               | FMA instruction source an input from the data cache."
               | Also, considering the author's other work, I'm pretty
               | sure they can isolate load-store performance from cache
               | performance from memory interface performance.
        
         | yaantc wrote:
         | On the L/S unit impact: data movement is expensive, computation
         | is cheap (relatively).
         | 
         | In "Computer Architecture, A Quantitative Approach" there are
         | numbers for the now old TSMC 45nm process: A 32 bits FP
         | multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8
         | kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its
         | tag comparison and LRU logic (more expansive).
         | 
         | Then I have some 2015 numbers for Intel 22nm process, old too.
         | A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write
         | from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM
         | 16.7 pJ. Basic SRAM here too, not a more expansive cache.
         | 
         | The cost of a multiplication is quadratic, and it should be
         | more linear for access, so the computation cost in the second
         | example is much heavier (compare the mantissa sizes, that's
         | what is multiplied).
         | 
         | The trend gets even worse with more advanced processes. Data
         | movement is usually what matters the most now, expect for
         | workloads with very high arithmetic intensity where computation
         | will dominate (in practice: large enough matrix
         | multiplications).
        
           | Remnant44 wrote:
           | Appreciate the detail! That explains a lot of what is going
           | on.. It also dovetails with some interesting facts I remember
           | reading about the relative power consumption for the zen
           | cores versus the infinity fabric connecting them - The
           | percentage of package power usage simply from running the
           | fabric interconnect was shocking.
        
           | formerly_proven wrote:
           | Random logic had also much better area scaling than SRAM
           | since EUV which implies that gap continues to widen at a
           | faster rate.
        
           | eigenform wrote:
           | AFAIK you have to think about how many different 512b paths
           | are being driven when this happens, like each cycle in the
           | steady-state case is simultaneously (in the case where you
           | can do _two_ vfmadd132ps per cycle):
           | 
           | - Capturing 2x512b from the L1D cache
           | 
           | - Sending 2x512b to the vector register file
           | 
           | - Capturing 4x512b values from the vector register file
           | 
           | - Actually multiplying 4x512b values
           | 
           | - Sending 2x512b results to the vector register file
           | 
           | .. and probably more?? That's already like 14*512 wires
           | [switching constantly at 5Ghz!!], and there are probably even
           | more intermediate stages?
        
           | Earw0rm wrote:
           | Right, but a SIMD single precision mul is linear (or even sub
           | linear) relative to it's scalar cousin. So a 16x32, 512-bit
           | MUL won't be even 16x the cost of a scalar mul, the decoder
           | has to do only the same amount of work for example.
        
             | kimixa wrote:
             | The calculations within each unit may be, true, but routing
             | and data transfer is probably the biggest limiting factor
             | on a modern chip. It should be clear that placing 16x units
             | of non-trivial size means that the average will likely be
             | further away from the data source than a single unit, and
             | transmitting data over distances can have greater-than-
             | linear increasing costs (not just resistance/capacitance
             | losses, but to hit timing targets you need faster
             | switching, which means higher voltages etc.)
        
         | rayiner wrote:
         | It seems even more interesting than the power envelope. It
         | looks like the core is limited by the ability of the power
         | supply to ramp up. So the dispatch rate drops momentarily and
         | then goes back up to allow power delivery to catch up.
        
         | deaddodo wrote:
         | To be clear, the problem with the Skylake implementation was
         | that triggering AVX-512 would downclock the entirety of the
         | CPU. It didn't do anything smart, it was fairly binary.
         | 
         | This AMD implementation instead seems to be better optimized
         | and plug into the normal thermal operations of the CPU for
         | better scaling.
        
         | bayindirh wrote:
         | > but as opposed to the old intel avx512 cores that got endless
         | (deserved?) bad press for their transition behavior, this is
         | more or less seamless.
         | 
         | The problem with Intel was, the AVX frequencies were _secrets_.
         | They were never disclosed in later cores where power envelope
         | got tight, and using AVX-512 killed performance throughout the
         | core. This meant that if there was a core using AVX-512, any
         | other cores in the same socket throttled down due to thermal
         | load and power cap on the core. This led to every process on
         | the same socket to suffer. Which is a big no-no for cloud or
         | HPC workloads where nodes are shared by many users.
         | 
         | Secrecy and downplaying of this effect made Intel's AVX-512
         | frequency and behavior infamous.
         | 
         | Oh, doing your own benchmarks on your own hardware which you
         | paid for and releasing the results to the public was _verboten_
         | , btw.
        
       | Szpadel wrote:
       | I'm curious how changing some OC parameters would affect those
       | results if that it caused by voltage drop, how load line
       | calibration affects it? is that's power constraint then how PBO
       | would affect it?
        
       | menaerus wrote:
       | I don't understand why 2x FMAs in CPU design poses such a
       | challenge when GPUs literally have hundreds of such ALUs? Both
       | operate at similar TDP so where's the catch? Much lower GPU clock
       | frequency?
        
         | eqvinox wrote:
         | It's not 2 FMAs, it's AVX-512 (and going with 32-bit words) =
         | 2*512/32 = 32 FMAs per core, 256 on an 8-core CPU. The unit
         | counts for GPUs - depending on which number you look at - count
         | these separately.
         | 
         | CPUs also have much more complicated program flow control,
         | versatility, and AFAIK latency (= flow control cost) of
         | individual instructions. GPUs are optimized for raw calculation
         | throughput meanwhile.
         | 
         | Also note that modern GPUs and CPUs don't have a clear pricing
         | relationship anymore, e.g. a desktop CPU is much cheaper than a
         | high-end GPU, and large server CPUs are more expensive than
         | either.
        
           | menaerus wrote:
           | 1x 512-bit FMA or 2x 256-bit FMAs or 4x 128-bit FMAs is
           | irrelevant here - it's still a single physical unit in a CPU
           | that consumes 512 bits of data bandwidth. The question is why
           | the CPU budget allows for 2x 512-bit or 4x 256-bit while
           | H100, for example, has 14592 FP32 CUDA cores - in AVX
           | terminology that would translate, if I am not mistaken, to
           | 7926x 512-bit or 14592x 256-bit FMAs per clock cycle. Even
           | considering the obvious differences between GPUs and CPUs,
           | this is still a large difference. Since GPU cores operate at
           | much lower frequencies than CPU cores, it is what it made me
           | believe where the biggest difference comes from.
        
             | eqvinox wrote:
             | AIUI an FP32 core is only 32 bits wide, but this is outside
             | my area of expertise really. Also note that CPUs also have
             | additional ALUs that can't do FMAs, FMA is just the most
             | capable one.
             | 
             | You're also repeating 2x512 / 4x256 -- that's per core, you
             | need to multiply by CPU core count.
             | 
             | [also, note e.g. an 8-core CPU is much cheaper than a H100
             | card ;) -- if anything you'd be comparing the highest end
             | server CPUs here. An 192-core Zen5c is 8.2~10.5kEUR open
             | retail, an H100 is 32~35kEUR...]
             | 
             | [reading through some random docs, a CPU core seems vaguely
             | comparable to a SM; a SM might have 128 or 64 lanes (=FP32
             | cores) while a CPU only has 16 with AVX-512, but there is
             | indeed also a notable clock difference and far more
             | flexibility otherwise in the CPU core (which consumes
             | silicon area)]
        
             | adrian_b wrote:
             | Like another poster already said, the power budget of a
             | consumer CPU, like 9950X, executing programs at a double
             | clock frequency in comparison with a GPU, allows for 16
             | cores x 2 execution units x 16 = 512 FP32 FMA per clock
             | cycle, which provides the same throughput like an 1024 FP32
             | FMA per clock cycle iGPU from the best laptop CPUs, while
             | consuming 3 times less power than a datacenter GPU, so the
             | power budget and performance is like for a datacenter GPU
             | with 3072 FP32 FMA per clock cycle.
             | 
             | However, because of its high clock frequency a consumer CPU
             | has high performance per dollar, but low performance per
             | watt.
             | 
             | Server CPUs with many cores have much better energy
             | efficiency, e.g. around 3 times higher than a desktop CPU
             | and the same with the most efficient laptop CPUs. For many
             | generations of NVIDIA GPUs and Intel Xeon CPUs, until about
             | 5-6 years ago, the ratio between their floating-point FMA
             | throughput per watt has been of only 3.
             | 
             | This factor of 3 is mainly due to the overhead of various
             | tricks used by CPUs to extract instruction-level
             | parallelism from programs that do not use enough concurrent
             | threads or array operations, e.g. superscalar out-of-order
             | execution, register renaming, etc.
             | 
             | In recent years, starting with NVIDIA Volta, followed later
             | by AMD and Intel GPUs, the GPUs have made a jump in
             | performance that has increased the gap between their
             | throughput and that of CPUs, by supplementing the vector
             | instructions with matrix instructions, i.e. what NVIDIA
             | calls tensor instructions.
             | 
             | However this current greater gap in performance between
             | CPUs and GPUs could easily be removed and the performance
             | per watt ratio could be brought back to a factor unlikely
             | to be greater than 3, by adding matrix instructions to the
             | CPUs.
             | 
             | Intel has introduced the AMX instruction set, besides AVX,
             | but for now it is supported only in expensive server CPUs
             | and Intel has defined only instructions for low-precision
             | operations used for AI/ML. If AMX were extended with FP32
             | and FP64 operations, then the performance would be much
             | more competitive with GPUs.
             | 
             | ARM is more advanced in this direction, with SME (Scalable
             | Matrix Extension) defined besides SVE (Scalable Vector
             | Extension). SME is already available in recent Apple CPUs
             | and it is expected to be also available in the new Arm
             | cores that will be announced in a few months for now, which
             | should become available in the smartphones of 2026, and
             | presumably also in future Arm-based CPUs for servers and
             | laptops.
             | 
             | The current Apple CPUs do not have strong SME accelerators,
             | because they also have an iGPU that can perform the
             | operations whose latency is less important.
             | 
             | On the other hand, an Arm-based server CPU could have a
             | much bigger SME accelerator, providing a performance much
             | closer to a GPU.
        
               | menaerus wrote:
               | I appreciate the response with a lot of interesting
               | details, however, I don't believe it answers the question
               | I had? My doubt was why is it so that the CPU design
               | suffers from clock frequency issues in AVX-512 workloads
               | whereas GPUs which have much more compute power do not.
               | 
               | I assumed that it was due to the fact that GPUs run at
               | much lower clock frequencies and therefore available
               | power budget but as I also discussed with another
               | commenter above this was probably a premature conclusion
               | since we don't have enough evidence showing that GPUs
               | indeed do not suffer from same type of issues. They
               | likely do but nobody measured it yet?
        
               | adrian_b wrote:
               | The low clock frequency when executing AVX-512 workloads
               | is a frequency where the CPU operates efficiently, with a
               | low energy consumption per operation executed.
               | 
               | For such a workload that executes a very large number of
               | operations per second, the CPU cannot afford to operate
               | inefficiently because it will overheat.
               | 
               | When a CPU core has many execution units that are idle,
               | so they do not consume power, like when executing only
               | scalar operations or only operations with narrow 128-bit
               | vectors, it can afford to raise the clock frequency e.g.
               | by 50%, even if that would increase the energy
               | consumption per operation e.g. 3 times. By executing 4
               | times or 8 times less operations per clock cycle, even if
               | the energy consumption is 3 times higher the total power
               | consumption is smaller and the CPU does not overheat and
               | the desktop owner does not care that the completion of
               | the same workload requires much more energy, because it
               | is likely that the owner cares more about the time to
               | completion.
               | 
               | The clock frequency of a GPU also varies continuously
               | depending on the workload, in order to maintain the power
               | consumption within the limits. However a GPU is not
               | designed to be able to increase the clock frequency as
               | much as a CPU. The fastest GPUs have clock frequencies
               | under 3 GHz, while the fastest CPUs exceed 6 GHz.
               | 
               | The reason is that normally one never launches a GPU
               | program that would use only a small fraction of the
               | resources of a GPU allowing a higher clock frequency, so
               | it makes no sense to design a GPU for this use case.
               | 
               | Designing a chip for a higher clock frequency greatly
               | increases the size of the chip, as shown by the
               | comparison between a normal Zen core designed for 5.7 GHz
               | and a Zen compact core, designed e.g. for 3.3 GHz, a
               | frequency not much higher than that of a GPU.
               | 
               | On Zen compact cores and on normal Zen cores configured
               | for server CPUs with a large number of cores, e.g. 128
               | cores (with a total of 4096 FP32 ALUs, like a low-to-mid-
               | range desktop GPU, or like a top desktop GPU of 5 years
               | ago; a Zen compact server CPU can have 6144 FP32 ALUs,
               | more than a RTX 4070), the clock frequency variation
               | range is small, very similar to the clock variation range
               | of a GPU.
               | 
               | In conclusion, it is not the desktop/laptop CPUs which
               | drop their clock frequency, but it is the GPUs which
               | never raise their clock frequency much, the same as the
               | server CPUs, because neither GPUs nor server CPUs are
               | normally running programs that keep most of their
               | execution units idle, to allow higher clock frequencies
               | without overheating.
        
         | atq2119 wrote:
         | It's not the execution of FMAs that's the challenge, it's the
         | ramp up / down.
         | 
         | And I assure you GPUs do have challenges with that as well.
         | That's just less well known because (1) in GPUs, _all_
         | workloads are vector workloads, and so there was never a stark
         | contrast between scalar and vector regimes like in Intel 's
         | AVX-512 implementation and (2) GPU performance characteristics
         | are in general less broadly known.
        
           | menaerus wrote:
           | Yes, I agree that it was premature to say that GPUs aren't
           | suffering from the same symptoms. There's just not enough
           | evidence but the differences in the compute power are still
           | large.
        
         | dzaima wrote:
         | Zen 5 still clocks way higher than GPUs even with the
         | penalties. Additionally, CPUs typically target much lower
         | latency for operations even per-clock, which adds a ton of
         | silicon cost for the same throughput, and especially so at high
         | clock frequency.
         | 
         | The difficulty with transitions that Skylake-X suffered
         | especially from just has no equivalent on GPU; if you always
         | stay in the transitioned-to-AVX512 state on Skylake-X, things
         | are largely normal; GPUs just are always unconditionally in
         | such a state, but that be awful on CPUs, as it'd make scalar-
         | only code (not a thing on GPUs, but the main target for CPUs)
         | unnecessarily slow. And so Intel decided that the transitions
         | are worth the improved clocks for code not utilizing AVX-512.
        
         | adgjlsfhk1 wrote:
         | It's gpu frequency 5.5 GHZ is ~4x the heat and power as the 2.5
         | GHZ for GPUs.
        
       | sylware wrote:
       | They should put forward the fact that 512bits is the "sweet spot"
       | as it is a data cache-line!
        
       | ksec wrote:
       | I wonder if this will be improved or fix in Zen 6. Although
       | personally I much rather they focus on IPC.
        
         | Remnant44 wrote:
         | Nothing to fix here. The behavior in the transition regimes is
         | already quite good.
         | 
         | The overall throttling is dynamic and reactive based on heat
         | and power draw - this is unavoidable and in fact desirable (the
         | alternative is to simply run slower all the time, not to
         | somehow be immune to physics and run faster all the time)
        
       | mgaunard wrote:
       | In practice everyone turns off AVX512 because they're afraid of
       | the frequency throttling.
       | 
       | The damage was made by Skylake-X and won't be healed for years.
        
         | yaro330 wrote:
         | Everyone who? Surely anyone interested would do the research
         | after buying a shiny new CPU.
        
       ___________________________________________________________________
       (page generated 2025-03-01 23:01 UTC)