[HN Gopher] Zen 5's AVX-512 Frequency Behavior
___________________________________________________________________
Zen 5's AVX-512 Frequency Behavior
Author : matt_d
Score : 183 points
Date : 2025-03-01 04:10 UTC (18 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| kristianp wrote:
| I find it irritating that they are comparing clock scaling to the
| venerable Skylake-X. Surely Sapphire Rapids has been out for
| almost 2 years by now.
| fuhsnn wrote:
| I think it's mostly the lack of comparable research other than
| the Skylake-X one by Travis Downs. I too would like to see how
| Zen 4 behaves in the situation with its double-pumping.
| eqvinox wrote:
| Seemed appropriate to me as comparing the "first core to use
| full-width AVX-512 datapaths"; my interpretation is that AMD
| threw more R&D into this than Intel before shipping it to
| customers...
|
| (It's also not really a comparative article at all? Skylake-X
| is mostly just introduction...)
| kristianp wrote:
| > my interpretation is that AMD threw more R&D into this than
| Intel before shipping it to customers
|
| AMD had the benefit of learning from Intel's mistakes in
| their first generation of AVX-512 chips. It seemed unfair to
| compare an Intel chip that's so old (albeit long-lasting due
| to Intel's scaling problems). Skylake-X chips were released
| in 2017! [1]
|
| [1] https://en.wikipedia.org/wiki/Skylake_(microarchitecture)
| #Hi...
| eqvinox wrote:
| sure, but AMD's decision to start with a narrower datapath
| happened without insight from Intel's mistakes and could
| very well have backfired (if Intel had managed to produce a
| better-working implementation faster, that could've cost
| AMD a lot of market share). Intel had the benefit of
| designing the instructions along with the implementation as
| well, and also the choice of starting on a 2x256
| datapath...
|
| And again, yeah it's not great were it a comparison, but it
| really just doesn't read as a comparison to me at all. It's
| a _reference_.
| adrian_b wrote:
| AMD did not start with a narrower datapath, even if this
| is a widespread myth. It only had a narrower path between
| the inner CPU core and the L1 data cache memory.
|
| The most recent Intel and AMD CPU cores (Lion Cove and
| Zen 5) have identical vector datapath widths, but for
| many years, for 256-bit AVX Intel had a narrower datapath
| than AMD, 768-bit for Intel (3 x 256-bit) vs. 1024-bit
| for AMD (4 x 256-bit).
|
| Only when executing 512-bit AVX-512 instructions, the
| vector datapath of Intel was extended to 1024-bit (2 x
| 512-bit), matching the datapath used by AMD for any
| vector instructions.
|
| There were only 2 advantages of Intel AVX-512 vs. AMD
| executing AVX or the initial AVX-512 implementation of
| Zen 4.
|
| The first was that some Intel CPU models, but only the
| more expensive SKUs, i.e. most of the Gold and all of the
| Platinum, had 2 x 512-bit FMA units, while the cheap
| Intel CPUs and AMD Zen 4 had only one 512-bit FMA unit
| (but AMD Zen 4 still had 2 x 512-bit FADD units).
| Therefore Intel could do 2 FMUL or FMA per clock cycle,
| while Zen 4 could do only 1 FMUL or FMA (+ 1 FADD).
|
| The second was that Intel had a double width link to the
| L1 cache, so it could do 2 x 512-bit loads + 1 x 512-bit
| stores per clock cycle, while Zen 4 could do only 1 x
| 512-bit loads per cycle + 1 x 512-bit stores every other
| cycle. (In a balanced CPU core design the throughput for
| vector FMA and for vector loads from the L1 cache must be
| the same, which is true for both old and new Intel and
| AMD CPU cores.)
|
| With the exception of vector load/store and FMUL/FMA, Zen
| 4 had the same or better AVX-512 throughput for most
| instructions, in comparison with Intel Sapphire
| Rapids/Emerald Rapids. There were a few instructions with
| a poor implementation on Zen 4 and a few instructions
| with a poor implementation on Intel, where either Intel
| or AMD were significantly better than the other.
| eqvinox wrote:
| > AMD did not start with a narrower datapath, even if
| this is a widespread myth. It only had a narrower path
| between the inner CPU core and the L1 data cache memory.
|
| https://www.mersenneforum.org/node/21615#post614191
|
| "Thus as many of us predicted, 512-bit instructions are
| split into 2 x 256-bit of the same instruction. And
| 512-bit is always half-throughput of their 256-bit
| versions."
|
| Is that wrong?
|
| There's a lot of it being described as "double pumped"
| going around...
|
| (tbh I couldn't care less about how wide the interface
| buses are, as long as they can deliver in sum total a
| reasonable bandwidth at a reasonable latency...
| especially on the further out cache hierarchies the
| latency overshadows the width so much it doesn't matter
| if it comes down to 1x512 or 2x256. The question at hand
| here is the total width of the ALUs and effective IPC.)
| adrian_b wrote:
| Sorry, but you did not read carefully that good article
| and you did not read the AMD documentation and the Intel
| documentation.
|
| The AMD Zen cores had for several generations, until Zen
| 4, 4 (four) vector execution units with a width of 256
| bits, i.e. a total datapath width of 1024 bits.
|
| On an 1024-bit datapath, you can execute either four
| 256-bit instructions per clock cycle or two 512-bit
| instructions per clock cycle.
|
| While the number of instructions executed per cycle
| varies, the data processing throughput is the same, 1024
| bits per clock cycle, as determined by the datapath
| width.
|
| The use of the word "double-pumped" by the AMD CEO has
| been a very unfortunate choice, because it has been
| completely misunderstood by most people, who have never
| read the AMD technical documentation and who have never
| tested the behavior of the micro-architecture of the Zen
| CPU cores.
|
| On Zen 4, the advantage of using AVX-512 is not from a
| different throughput, but it is caused by a better
| instruction set and by the avoiding of bottlenecks in the
| CPU core front-end, at instruction fetching, decoding,
| renaming and dispatching.
|
| On the Intel P cores before Lion Cove, the datapath for
| 256-bit instructions had a width of 768 bits, as they had
| three 256-bit execution units. For most 256-bit
| instructions the throughput was of 768 bits per clock
| cycle. However the three execution units were not
| identical, so some 256-bit instructions had only a
| throughput of 512 bits per cycle.
|
| When the older Intel P cores executed 512-bit
| instructions, the instructions with a 512 bit/cycle
| throughput remained at that throughput, but most of the
| instructions with a 768 bit/cycle throughput had their
| throughput increased to 1024 bit/cycle, matching the AMD
| throughput, by using an additional 256-bit datapath
| section that stayed unused when executing 256-bit or
| narrower instructions.
|
| While what is said above applies to most vector
| instructions, floating-point multiplication and FMA have
| different rules, because their throughput is not
| determined by the width of the datapath, but it may be
| smaller, being determined by the number of available
| floating-point multipliers.
|
| Cheap Intel CPUs and AMD Zen 2/Zen 3/Zen 4 had FP
| multipliers with a total throughput of 512 bits of
| results per clock cycle, while the expensive Xeon Gold
| and Platinum had FP multipliers with a total throughput
| of 1024 bits of results per clock cycle.
|
| The "double-pumped" term is applicable only to FP
| multiplication, where Zen 4 and cheap Intel CPUs require
| a double number of clock cycles to produce the same
| results as expensive Intel CPUs. It may be also applied,
| even if that is even less appropriate, to vector load and
| store, where the path to the L1 data cache was narrower
| in Zen 4 than in Intel CPUs.
|
| The "double-pumped" term is not applicable to the very
| large number of other AVX-512 instructions, whose
| throughput is determined by the width of the vector
| datapath, not by the width of the FP multipliers or by
| the L1 data cache connection.
|
| Zen 5 doubles the vector datapath width to 2048 bits, so
| many 512-bit AVX-512 instructions have a 2048 bit/cycle
| throughput, except FMUL/FMA, which have an 1024 bit/cycle
| throughput, determined by the width of the FP
| multipliers. (Because there are only 4 execution units,
| 256-bit instructions cannot use the full datapath.)
|
| Intel Diamond Rapids, expected by the end of 2026, is
| likely to have the same vector throughput as Zen 5. Until
| then, the Lion Cove cores from consumer CPUs, like Arrow
| Lake S, Arrow Lake H and Lunar Lake, are crippled, by
| having a half-width datapath of 1024 bits, which cannot
| compete with a Zen 5 that executes AVX-512 instructions.
| crest wrote:
| Isn't it misleading to just add up the output width of
| all SIMD ALU pipelines and call the sum "datapath width",
| because you can't freely mix and match when the available
| ALUs pipelines determine what operations you can compute
| at full width?
| adrian_b wrote:
| You are right that in most CPUs the 3 or 4 vector
| execution units are not completely identical.
|
| Therefore some operations may use the entire datapath
| width, while others may use only a fraction, e.g. only a
| half or only two thirds or only three quarters.
|
| However you cannot really discuss these details without
| listing all such instructions, i.e. reproducing the
| tables from the Intel or AMD optimization guides of from
| Agner Fog's optimization documents.
|
| For the purpose of this discussion thread, these details
| are not really relevant, because for Intel and AMD the
| classification of the instructions is mostly the same,
| i.e. the cheap instructions, like addition operations,
| can be executed in all execution units, using the entire
| datapath width, while certain more expensive operations,
| like multiplication/division/square root/shuffle may be
| done only in a subset of the execution units, so they can
| use only a fraction of the datapath width (but when
| possible they will be coupled with simple instructions
| using the remainder of the datapath, maintaining a total
| throughput equal with the datapath width).
|
| Because most instructions are classified by cost in the
| same way by AMD and Intel, the throughput ratio between
| AMD and Intel is typically the same both for instructions
| using the full datapath width and for those using only a
| fraction.
|
| Like I have said, with very few exceptions (including
| FMUL/FMA/LD/ST), the throughput for 512-bit instructions
| has been the same for Zen 4 and the Intel CPUs with
| AVX-512 support, as determined by the common 1024-bit
| datapath width, including for the instructions that could
| use only a half-width 512-bit datapath.
| dzaima wrote:
| Wouldn't it be 1536-bit for 2 256-bit FMA/cycle, with FMA
| taking 3 inputs? (applies equally to both so doesn't
| change anything materially; And even goes back to
| Haswell, which too is capable of 2 256-bit FMA/cycle)
| adrian_b wrote:
| That is why I have written the throughput "for results",
| to clarify the meaning (the throughput for output results
| is determined by the number of execution units; it does
| not depend on the number of input operands).
|
| The vector register file has a number of read and write
| ports, e.g. 10 x 512-bit read ports for recent AMD CPUs
| (i.e. 10 ports can provide the input operands for 2 x FMA
| + 2 FADD, when no store instructions are done
| simultaneously).
|
| So a detailed explanation of the "datapath widths", would
| have to take into account the number of read and write
| ports, because some combinations of instructions cannot
| be executed simultaneously, even when there are available
| execution units, because the paths between the register
| file and the execution units are occupied.
|
| Even more complicately, some combinations of instructions
| that would be prohibited by not having enough register
| read and write ports, can actually be done simultaneously
| because there are bypass paths between the execution
| units that allow the sharing of some input operands or
| the direct use of output operands as input operands,
| without passing through the register file.
|
| The structure of the Intel vector execution units, with 3
| x 256-bit execution units, 2 of which can do FMA, goes
| indeed back to Haswell, as you say.
|
| The Lion Cove core launched in 2024 is the first Intel
| core that uses the enhanced structure used by AMD Zen for
| many years, with 4 execution units, where 2 can do
| FMA/FMUL, but all 4 can do FADD.
|
| Starting with the Skylake Server CPUs, the Intel CPUs
| with AVX-512 support retain the Haswell structure when
| executing 256-bit or narrower instructions, but when
| executing 512-bit instructions, 2 x 256-bit execution
| units are paired to make a 512-bit execution unit, while
| the third 256-bit execution unit is paired with an
| otherwise unused 256-bit execution unit to make a second
| 512-bit execution unit.
|
| Of these 2 x 512-bit execution units, only one can do
| FMA. Certain Intel SKUs add a second 512-bit FMA unit, so
| in those both 512-bit execution units can do FMA (this
| fact is mentioned where applicable in the CPU
| descriptions from the Intel Ark site).
| dzaima wrote:
| So the 1024-bit number is the number of vector output
| bits per cycle, i.e. 2xFMA+2xFADD = (2+2)x256-bit? Is the
| term "datapath width" used for that anywhere else? (I
| guess you've prefixed that with "total " in some places,
| which makes much more sense)
| dzaima wrote:
| oops, haswell has only 3 SIMD ALUs, i.e. 768 bits of
| output per cycle, not 1024.
| eigenform wrote:
| > Sorry, but you did not read carefully that good article
| and you did not read the AMD documentation and the Intel
| documentation.
|
| I think _you_ are the one who hasn 't read documentation
| or tested the behavior of Zen cores. Read literally any
| AMD material about Zen4: it mentions that the AVX512
| implementation is done over two cycles because there are
| 256-bit datapaths.
|
| On page 34 of the Zen4 Software Optimization Guide[^1],
| it literally says:
|
| > Because the data paths are 256 bits wide, the scheduler
| uses two consecutive cycles to issue a 512-bit operation.
|
| [^1]: https://www.amd.com/content/dam/amd/en/documents/pr
| ocessor-t...
| crest wrote:
| AMD also has a long history of using half-width, double-
| pumped SIMD implementations. It worked each time. With
| Zen 4 the surprise wasn't that it worked at all, but how
| well it worked and how little the core grew in any
| relevant metric to support it: size, static and dynamic
| power.
|
| It makes me wonder why Intel didn't pair two efficiency
| cores with half width SIMD data paths per core sharing a
| single full width permute unit between them.
| adrian_b wrote:
| True, but who would bother to pay a lot of money for a CPU that
| is known to be inferior to alternatives, only to be able to
| test the details of its performance.
| crest wrote:
| A well funded investigative tech journalist?
| adrian_b wrote:
| With prices in the range $2,000 - $20,000 for the CPU, plus
| a couple grand for at least MB, cooler, memory and PSU, the
| journalist must be very well funded to spend that much for
| publishing one article analyzing the CPU.
|
| I would like to read such an article or to be able to test
| myself such a CPU, but my curiosity is not so great as to
| make me spend such money.
|
| For now, the best one can do is to examine the results of
| general-purpose benchmarks published on sites like:
|
| https://www.servethehome.com/
|
| https://www.phoronix.com/
|
| where Intel, AMD or Ampere send some review systems with
| Xeon, Epyc or Arm-based server CPUs.
|
| These sites are useful, but a more thorough micro-
| architectural investigation would have been nice.
| kristianp wrote:
| AWS, dedicated EC2 instance. A few dollars an hour.
| formerly_proven wrote:
| Cascade Lake improved the situation a bit, but then you had Ice
| Lake where iirc the hard cutoffs were gone and you were just
| looking at regular power and thermal steering. IIRC, that was
| the first generation where we enabled AVX512 for all workloads.
| Earw0rm wrote:
| Depending on which CPU category. I think Intel HEDT stops at
| Cascade Lake, which is essentially Skylake-X Refresh from 2019?
|
| Whereas AMD has full-fat AVX512 even in gaming laptop CPUs.
| Remnant44 wrote:
| It's interesting that Zen5's FPUs running in full 512bit wide
| mode doesn't actually seem to cause any trouble, but that
| lighting up the load store units does. I don't know enough about
| hardware-level design to know if this would be "expected".
|
| The fully investigation in this article is really interesting,
| but the TL;DR is: Light up enough of the core, and frequencies
| will have to drop to maintain power envelope. The transition
| period is done very smartly, but it still exists - but as opposed
| to the old intel avx512 cores that got endless (deserved?) bad
| press for their transition behavior, this is more or less
| seamless.
| eqvinox wrote:
| Reading the section under "Load Another FP Pipe?" I'm coming
| away with the impression that it's not the LSU but rather total
| overall load that causes trouble. While that section is focused
| on transition time, the end steady state is also slower...
| tanelpoder wrote:
| I haven't read the article yet, but back when I tried to get
| to over 100 GB/s IO rate from a bunch of SSDs on Zen4 (just
| fio direct IO workload without doing anything with the data),
| I ended up disabling Core Boost states (or maybe something
| else in BIOS too), to give more thermal allowance for the IO
| hub on the chip. As RAM load/store traffic goes through the
| IO hub too, maybe that's it?
| eqvinox wrote:
| I don't think these things are related, this is talking
| about the LSU right inside the core. I'd also expect
| oscillations if there were a thermal problem like you're
| describing, i.e. core clocks up when IO hub delivers data,
| IO hub stalls, causes core to stall as well, IO hub can run
| again delivering data, repeat from beginning.
|
| (Then again, boost clocks are an intentional oscillation
| anyway...)
| tanelpoder wrote:
| Ok, I just read through the article. As I understand,
| their tests were designed to run entirely on data on the
| local cores' cache? I only see L1d mentioned there.
| eqvinox wrote:
| Yes, that's my understanding of "Zen 5 also doubles L1D
| load bandwidth, and I'm exercising that by having each
| FMA instruction source an input from the data cache."
| Also, considering the author's other work, I'm pretty
| sure they can isolate load-store performance from cache
| performance from memory interface performance.
| yaantc wrote:
| On the L/S unit impact: data movement is expensive, computation
| is cheap (relatively).
|
| In "Computer Architecture, A Quantitative Approach" there are
| numbers for the now old TSMC 45nm process: A 32 bits FP
| multiplication takes 3.7 pJ, and a 32 bits SRAM read from an 8
| kB SRAM takes 5 pJ. This is a basic SRAM, not a cache with its
| tag comparison and LRU logic (more expansive).
|
| Then I have some 2015 numbers for Intel 22nm process, old too.
| A 64 bits FP multiplication takes 6.4 pJ, a 64 bits read/write
| from a small 8 kB SRAM 4.2 pJ, and from a larger 256 kB SRAM
| 16.7 pJ. Basic SRAM here too, not a more expansive cache.
|
| The cost of a multiplication is quadratic, and it should be
| more linear for access, so the computation cost in the second
| example is much heavier (compare the mantissa sizes, that's
| what is multiplied).
|
| The trend gets even worse with more advanced processes. Data
| movement is usually what matters the most now, expect for
| workloads with very high arithmetic intensity where computation
| will dominate (in practice: large enough matrix
| multiplications).
| Remnant44 wrote:
| Appreciate the detail! That explains a lot of what is going
| on.. It also dovetails with some interesting facts I remember
| reading about the relative power consumption for the zen
| cores versus the infinity fabric connecting them - The
| percentage of package power usage simply from running the
| fabric interconnect was shocking.
| formerly_proven wrote:
| Random logic had also much better area scaling than SRAM
| since EUV which implies that gap continues to widen at a
| faster rate.
| eigenform wrote:
| AFAIK you have to think about how many different 512b paths
| are being driven when this happens, like each cycle in the
| steady-state case is simultaneously (in the case where you
| can do _two_ vfmadd132ps per cycle):
|
| - Capturing 2x512b from the L1D cache
|
| - Sending 2x512b to the vector register file
|
| - Capturing 4x512b values from the vector register file
|
| - Actually multiplying 4x512b values
|
| - Sending 2x512b results to the vector register file
|
| .. and probably more?? That's already like 14*512 wires
| [switching constantly at 5Ghz!!], and there are probably even
| more intermediate stages?
| Earw0rm wrote:
| Right, but a SIMD single precision mul is linear (or even sub
| linear) relative to it's scalar cousin. So a 16x32, 512-bit
| MUL won't be even 16x the cost of a scalar mul, the decoder
| has to do only the same amount of work for example.
| kimixa wrote:
| The calculations within each unit may be, true, but routing
| and data transfer is probably the biggest limiting factor
| on a modern chip. It should be clear that placing 16x units
| of non-trivial size means that the average will likely be
| further away from the data source than a single unit, and
| transmitting data over distances can have greater-than-
| linear increasing costs (not just resistance/capacitance
| losses, but to hit timing targets you need faster
| switching, which means higher voltages etc.)
| rayiner wrote:
| It seems even more interesting than the power envelope. It
| looks like the core is limited by the ability of the power
| supply to ramp up. So the dispatch rate drops momentarily and
| then goes back up to allow power delivery to catch up.
| deaddodo wrote:
| To be clear, the problem with the Skylake implementation was
| that triggering AVX-512 would downclock the entirety of the
| CPU. It didn't do anything smart, it was fairly binary.
|
| This AMD implementation instead seems to be better optimized
| and plug into the normal thermal operations of the CPU for
| better scaling.
| bayindirh wrote:
| > but as opposed to the old intel avx512 cores that got endless
| (deserved?) bad press for their transition behavior, this is
| more or less seamless.
|
| The problem with Intel was, the AVX frequencies were _secrets_.
| They were never disclosed in later cores where power envelope
| got tight, and using AVX-512 killed performance throughout the
| core. This meant that if there was a core using AVX-512, any
| other cores in the same socket throttled down due to thermal
| load and power cap on the core. This led to every process on
| the same socket to suffer. Which is a big no-no for cloud or
| HPC workloads where nodes are shared by many users.
|
| Secrecy and downplaying of this effect made Intel's AVX-512
| frequency and behavior infamous.
|
| Oh, doing your own benchmarks on your own hardware which you
| paid for and releasing the results to the public was _verboten_
| , btw.
| Szpadel wrote:
| I'm curious how changing some OC parameters would affect those
| results if that it caused by voltage drop, how load line
| calibration affects it? is that's power constraint then how PBO
| would affect it?
| menaerus wrote:
| I don't understand why 2x FMAs in CPU design poses such a
| challenge when GPUs literally have hundreds of such ALUs? Both
| operate at similar TDP so where's the catch? Much lower GPU clock
| frequency?
| eqvinox wrote:
| It's not 2 FMAs, it's AVX-512 (and going with 32-bit words) =
| 2*512/32 = 32 FMAs per core, 256 on an 8-core CPU. The unit
| counts for GPUs - depending on which number you look at - count
| these separately.
|
| CPUs also have much more complicated program flow control,
| versatility, and AFAIK latency (= flow control cost) of
| individual instructions. GPUs are optimized for raw calculation
| throughput meanwhile.
|
| Also note that modern GPUs and CPUs don't have a clear pricing
| relationship anymore, e.g. a desktop CPU is much cheaper than a
| high-end GPU, and large server CPUs are more expensive than
| either.
| menaerus wrote:
| 1x 512-bit FMA or 2x 256-bit FMAs or 4x 128-bit FMAs is
| irrelevant here - it's still a single physical unit in a CPU
| that consumes 512 bits of data bandwidth. The question is why
| the CPU budget allows for 2x 512-bit or 4x 256-bit while
| H100, for example, has 14592 FP32 CUDA cores - in AVX
| terminology that would translate, if I am not mistaken, to
| 7926x 512-bit or 14592x 256-bit FMAs per clock cycle. Even
| considering the obvious differences between GPUs and CPUs,
| this is still a large difference. Since GPU cores operate at
| much lower frequencies than CPU cores, it is what it made me
| believe where the biggest difference comes from.
| eqvinox wrote:
| AIUI an FP32 core is only 32 bits wide, but this is outside
| my area of expertise really. Also note that CPUs also have
| additional ALUs that can't do FMAs, FMA is just the most
| capable one.
|
| You're also repeating 2x512 / 4x256 -- that's per core, you
| need to multiply by CPU core count.
|
| [also, note e.g. an 8-core CPU is much cheaper than a H100
| card ;) -- if anything you'd be comparing the highest end
| server CPUs here. An 192-core Zen5c is 8.2~10.5kEUR open
| retail, an H100 is 32~35kEUR...]
|
| [reading through some random docs, a CPU core seems vaguely
| comparable to a SM; a SM might have 128 or 64 lanes (=FP32
| cores) while a CPU only has 16 with AVX-512, but there is
| indeed also a notable clock difference and far more
| flexibility otherwise in the CPU core (which consumes
| silicon area)]
| adrian_b wrote:
| Like another poster already said, the power budget of a
| consumer CPU, like 9950X, executing programs at a double
| clock frequency in comparison with a GPU, allows for 16
| cores x 2 execution units x 16 = 512 FP32 FMA per clock
| cycle, which provides the same throughput like an 1024 FP32
| FMA per clock cycle iGPU from the best laptop CPUs, while
| consuming 3 times less power than a datacenter GPU, so the
| power budget and performance is like for a datacenter GPU
| with 3072 FP32 FMA per clock cycle.
|
| However, because of its high clock frequency a consumer CPU
| has high performance per dollar, but low performance per
| watt.
|
| Server CPUs with many cores have much better energy
| efficiency, e.g. around 3 times higher than a desktop CPU
| and the same with the most efficient laptop CPUs. For many
| generations of NVIDIA GPUs and Intel Xeon CPUs, until about
| 5-6 years ago, the ratio between their floating-point FMA
| throughput per watt has been of only 3.
|
| This factor of 3 is mainly due to the overhead of various
| tricks used by CPUs to extract instruction-level
| parallelism from programs that do not use enough concurrent
| threads or array operations, e.g. superscalar out-of-order
| execution, register renaming, etc.
|
| In recent years, starting with NVIDIA Volta, followed later
| by AMD and Intel GPUs, the GPUs have made a jump in
| performance that has increased the gap between their
| throughput and that of CPUs, by supplementing the vector
| instructions with matrix instructions, i.e. what NVIDIA
| calls tensor instructions.
|
| However this current greater gap in performance between
| CPUs and GPUs could easily be removed and the performance
| per watt ratio could be brought back to a factor unlikely
| to be greater than 3, by adding matrix instructions to the
| CPUs.
|
| Intel has introduced the AMX instruction set, besides AVX,
| but for now it is supported only in expensive server CPUs
| and Intel has defined only instructions for low-precision
| operations used for AI/ML. If AMX were extended with FP32
| and FP64 operations, then the performance would be much
| more competitive with GPUs.
|
| ARM is more advanced in this direction, with SME (Scalable
| Matrix Extension) defined besides SVE (Scalable Vector
| Extension). SME is already available in recent Apple CPUs
| and it is expected to be also available in the new Arm
| cores that will be announced in a few months for now, which
| should become available in the smartphones of 2026, and
| presumably also in future Arm-based CPUs for servers and
| laptops.
|
| The current Apple CPUs do not have strong SME accelerators,
| because they also have an iGPU that can perform the
| operations whose latency is less important.
|
| On the other hand, an Arm-based server CPU could have a
| much bigger SME accelerator, providing a performance much
| closer to a GPU.
| menaerus wrote:
| I appreciate the response with a lot of interesting
| details, however, I don't believe it answers the question
| I had? My doubt was why is it so that the CPU design
| suffers from clock frequency issues in AVX-512 workloads
| whereas GPUs which have much more compute power do not.
|
| I assumed that it was due to the fact that GPUs run at
| much lower clock frequencies and therefore available
| power budget but as I also discussed with another
| commenter above this was probably a premature conclusion
| since we don't have enough evidence showing that GPUs
| indeed do not suffer from same type of issues. They
| likely do but nobody measured it yet?
| adrian_b wrote:
| The low clock frequency when executing AVX-512 workloads
| is a frequency where the CPU operates efficiently, with a
| low energy consumption per operation executed.
|
| For such a workload that executes a very large number of
| operations per second, the CPU cannot afford to operate
| inefficiently because it will overheat.
|
| When a CPU core has many execution units that are idle,
| so they do not consume power, like when executing only
| scalar operations or only operations with narrow 128-bit
| vectors, it can afford to raise the clock frequency e.g.
| by 50%, even if that would increase the energy
| consumption per operation e.g. 3 times. By executing 4
| times or 8 times less operations per clock cycle, even if
| the energy consumption is 3 times higher the total power
| consumption is smaller and the CPU does not overheat and
| the desktop owner does not care that the completion of
| the same workload requires much more energy, because it
| is likely that the owner cares more about the time to
| completion.
|
| The clock frequency of a GPU also varies continuously
| depending on the workload, in order to maintain the power
| consumption within the limits. However a GPU is not
| designed to be able to increase the clock frequency as
| much as a CPU. The fastest GPUs have clock frequencies
| under 3 GHz, while the fastest CPUs exceed 6 GHz.
|
| The reason is that normally one never launches a GPU
| program that would use only a small fraction of the
| resources of a GPU allowing a higher clock frequency, so
| it makes no sense to design a GPU for this use case.
|
| Designing a chip for a higher clock frequency greatly
| increases the size of the chip, as shown by the
| comparison between a normal Zen core designed for 5.7 GHz
| and a Zen compact core, designed e.g. for 3.3 GHz, a
| frequency not much higher than that of a GPU.
|
| On Zen compact cores and on normal Zen cores configured
| for server CPUs with a large number of cores, e.g. 128
| cores (with a total of 4096 FP32 ALUs, like a low-to-mid-
| range desktop GPU, or like a top desktop GPU of 5 years
| ago; a Zen compact server CPU can have 6144 FP32 ALUs,
| more than a RTX 4070), the clock frequency variation
| range is small, very similar to the clock variation range
| of a GPU.
|
| In conclusion, it is not the desktop/laptop CPUs which
| drop their clock frequency, but it is the GPUs which
| never raise their clock frequency much, the same as the
| server CPUs, because neither GPUs nor server CPUs are
| normally running programs that keep most of their
| execution units idle, to allow higher clock frequencies
| without overheating.
| atq2119 wrote:
| It's not the execution of FMAs that's the challenge, it's the
| ramp up / down.
|
| And I assure you GPUs do have challenges with that as well.
| That's just less well known because (1) in GPUs, _all_
| workloads are vector workloads, and so there was never a stark
| contrast between scalar and vector regimes like in Intel 's
| AVX-512 implementation and (2) GPU performance characteristics
| are in general less broadly known.
| menaerus wrote:
| Yes, I agree that it was premature to say that GPUs aren't
| suffering from the same symptoms. There's just not enough
| evidence but the differences in the compute power are still
| large.
| dzaima wrote:
| Zen 5 still clocks way higher than GPUs even with the
| penalties. Additionally, CPUs typically target much lower
| latency for operations even per-clock, which adds a ton of
| silicon cost for the same throughput, and especially so at high
| clock frequency.
|
| The difficulty with transitions that Skylake-X suffered
| especially from just has no equivalent on GPU; if you always
| stay in the transitioned-to-AVX512 state on Skylake-X, things
| are largely normal; GPUs just are always unconditionally in
| such a state, but that be awful on CPUs, as it'd make scalar-
| only code (not a thing on GPUs, but the main target for CPUs)
| unnecessarily slow. And so Intel decided that the transitions
| are worth the improved clocks for code not utilizing AVX-512.
| adgjlsfhk1 wrote:
| It's gpu frequency 5.5 GHZ is ~4x the heat and power as the 2.5
| GHZ for GPUs.
| sylware wrote:
| They should put forward the fact that 512bits is the "sweet spot"
| as it is a data cache-line!
| ksec wrote:
| I wonder if this will be improved or fix in Zen 6. Although
| personally I much rather they focus on IPC.
| Remnant44 wrote:
| Nothing to fix here. The behavior in the transition regimes is
| already quite good.
|
| The overall throttling is dynamic and reactive based on heat
| and power draw - this is unavoidable and in fact desirable (the
| alternative is to simply run slower all the time, not to
| somehow be immune to physics and run faster all the time)
| mgaunard wrote:
| In practice everyone turns off AVX512 because they're afraid of
| the frequency throttling.
|
| The damage was made by Skylake-X and won't be healed for years.
| yaro330 wrote:
| Everyone who? Surely anyone interested would do the research
| after buying a shiny new CPU.
___________________________________________________________________
(page generated 2025-03-01 23:01 UTC)