[HN Gopher] Zen4's AVX512 Teardown
___________________________________________________________________
Zen4's AVX512 Teardown
Author : dragontamer
Score : 326 points
Date : 2022-09-26 14:17 UTC (8 hours ago)
(HTM) web link (www.mersenneforum.org)
(TXT) w3m dump (www.mersenneforum.org)
| asbeb wrote:
| bufo wrote:
| The BF16 and VNNI instructions are finally going to make AMD
| competitive for neural network inference.
| smat wrote:
| Very interesting read. The author notes that double pumping the
| 512 bit instructions to 256 bit execution units appears to be a
| good trade-off.
|
| As far as I understood ARMs new SIMD instruction set is able to
| map to execution units of arbitrary width. So it sounds to me
| like ARM is ahead of x86 in flexibility here and might be able to
| profit in the future.
|
| Maybe somebody with more in-depth knowledge could respond whether
| my understanding is correct.
| adrian_b wrote:
| With any traditional ISA with wide registers and instructions,
| a.k.a. SIMD instructions, it is possible to implement the
| execution units with any width desired, regardless which is the
| architectural register and instruction width.
|
| Obviously, it only makes sense for the width of the execution
| units to be a divisor of the architectural width, otherwise
| they would not be used efficiently.
|
| Thus it is possible to choose various compromises between the
| cost and the performance of the execution units.
|
| However, if the ISA specifies e.g. 32 512-bit registers, then
| even the cheapest implementation must include at least that
| amount of physical registers, even if the execution units may
| be much narrower.
|
| What is new in the ARM SVE/SVE2 and which gives the name
| "Scalable" to that vector extension, is that here the register
| width is not fixed by the ISA, but it may be different between
| implementations.
|
| Thus a cheap smartphone CPU may have 128-bit registers, while
| an expensive server CPU for scientific computation applications
| might have 1024-bit registers.
|
| With SVE/SVE2, it is possible to write a program without
| knowing which will be the width of the registers on the target
| CPU.
|
| Nevertheless, the scalability feature is not perfect, thus some
| programs may still be made faster if a certain register width
| is assumed before compilation, which may make them run slower
| than possible on a CPU that in fact has wider registers than
| assumed.
| bee_rider wrote:
| ARM's SVE is definitely interesting, but I do wonder if it is
| slowly be honing in on CRAY style vector processing. Which is
| definitely a cool idea, but a little different from the now-
| popular fixed-width SIMD. I don't know that it makes sense to
| call one ahead of the other yet -- ARM's documentation is clear
| that SVE2 doesn't replace NEON. "Mostly scalar but let's
| sprinkle in some SIMD" coding will probably always be with us
| (until ML somehow turns all programs into dot products I
| guess!)
|
| RISC-V also has a variable length vector extension.
| brigade wrote:
| There's not really any reason for modern general-purpose CPUs
| to specialize for IPC lower than 1 like what Cray did. CPUs
| need wide frontends to execute existing scalar code as fast
| as we're used to, and if you're not reusing most of that
| width for vectors then the design is just wasting power.
| dis-sys wrote:
| AMD EPYC Genoa will be a killing machine, with almost 1TBytes/sec
| memory bandwidth and this avx512 extension...
|
| good luck for intel's xeon.
| xani_ wrote:
| Genuinely good luck, it's never good when there is no
| competition
| sliken wrote:
| I'm hearing 12 channels @ 5200 MT/sec or so. Sounds like
| 500GB/sec, not 1TB/sec. Oh maybe you meant in a dual socket
| config?
| sekh60 wrote:
| I've updated most of my home lab to AMD EPYC Rome processors.
| Really can't beat the core counts for private cloud and the
| price is amazing compared to Intel. Looking forward to Genoa
| myself, though moving past Rome will be a ways away for my lab.
| xani_ wrote:
| Sounds like a hell of a lab! What you're doing on it, machine
| learning ?
| causi wrote:
| Exciting stuff. AVX512 isn't just for specialized work projects.
| It's also a huge performance boost for game console emulation.
| mmastrac wrote:
| When I was doing some work on Dolphin's JIT, AVX
| implementations were always back of mind. It's a massive
| tradeoff in so many cases but having access to these is
| amazing.
| stagger87 wrote:
| That sounds like a specialized project :)
| marginalia_nu wrote:
| Any project becomes specialized if you work long enough on
| it.
| dtech wrote:
| Interesting! Any reason why they're specifically good for
| emulation?
| xani_ wrote:
| https://whatcookie.github.io/posts/why-is-avx-512-useful-
| for...
| causi wrote:
| I don't have a deep understanding of the implementation, but
| it gets you a 30% performance boost in Playstation 3
| emulation.
|
| https://www.tomshardware.com/news/ps3-emulator-
| avx-512-30-pe...
| mastax wrote:
| > And it is basically impossible to hit the 230W power limit at
| stock without direct die or sub-ambient.
|
| Almost, but not quite. In GamersNexus' review they recorded
| 250.8W measured at the EPS12V cables, while using an Arctic
| Cooling Liquid Freezer II 360mm AIO with the fans at 100%. At
| 230W/1.5V=153A a good VRM will generate about 17W of heat. That
| leaves you a few watts for board power plane and socket resistive
| losses (I don't have an estimate for that).
|
| Not a very practical cooling solution for a day-to-day
| workstation, but I do wonder if you could reduce the fan speeds a
| bit while still maxing out the power limit.
| philjohn wrote:
| Then again, I have 2 420mm Black Ice Nemesis radiators in my
| custom loop - even at relatively low speeds it can keep the
| 5800X in there and 3080 Ti cool under constant high loads.
| pclmulqdq wrote:
| My mini-ITX work desktop has no problems with a 5900x and a
| Radeon VII pro running Rocm work, using only a tiny heatsink
| on the 5900x (and some high-airflow fans, but nothing too
| incredibly loud). It doesn't thermal throttle, but tops out
| around 80-90 degrees C.
|
| The 7000-series seems to be a different story: you really
| need a big cooler for those chips.
| adrian_b wrote:
| I have the same CPU + GPU combination, but used on an ATX
| MB with a Noctua cooler with a double 120-mm fan.
|
| While the larger case and cooler makes the cooling easier,
| the fans are normally inaudible and the CPU stays under 45
| Celsius degrees when not doing heavy work, and the
| temperature may raise up to a little over 60 degrees
| Celsius when 100% busy.
|
| From what I have seen until now, cooling will no longer be
| so easy for the 7000 series, unless you choose to run them
| in the Eco mode.
| snvzz wrote:
| Amusingly, there seems to still be 70% of performance if
| limiting the power to 65W.
|
| This means the default power limits are not reasonable, and
| only there to win the release day benchmarks.
| ignaloidas wrote:
| Worth noting that GN measured the power before the VRMs, while
| the limit is applied after the VRMs. Assuming a 90% efficiency,
| what GN measured would be 225.7W at the socket. Close, but
| still not quite.
| mastax wrote:
| I accounted for VRM efficiency losses in the second sentence,
| using data from a real X570 VRM.
| fefe23 wrote:
| I find the vpmullq part the most stunning.
|
| This instruction is used in some bignum code, for example if you
| are implementing RSA. Yet AMD implemented it three times faster
| than Intel.
|
| I'm also fascinated by AMD now making AVX512 worthwhile on
| consumer devices (where they would until quite recently
| artificially slow down Intel CPUs that had it), which presumably
| will lead to widespread adoption where it matters. Intels
| strategy of turning off AVX512 in the recent consumer devices
| because their energy efficiency cores don't have it may turn out
| to be a monumental mistake.
| ComputerGuru wrote:
| No one is going to be able to seriously use and support AVX512
| (or be sufficiently motivated to implement support for it in
| their libraries and especially applications) until Intel
| finally gets its act together with regards to AVX512 and
| decides it actually wants to commit to it being a thing.
|
| The AVX2 rollout was (comparatively) flawless. The gains AVX512
| brings over AVX2 are, for most people w/ specialty libs
| excluded, not worth dealing with the terrible CPU support. And
| Intel just keeps making the situation worse, taking one step
| forward and two back.
| bayindirh wrote:
| The biggest problem is not support for the instruction set in
| the silicon, but the performance penalty it brings.
|
| SIMD hardware is the most power hungry block on Intel CPUs,
| and the frequency penalty it brings is never completely
| disclosed in the tech docs. Even Intel doesn't share that
| information with you (as a serious customer) sometimes.
|
| In HPC world, no instruction is too obscure or niche to use.
| However, when you use these instructions too frequently, the
| heat load it generates can slow you down instead of
| accelerating you over the course of your job, so AVX512 is a
| pretty mixed case in Intel CPUs.
|
| Regardless of this penalty, numeric code benefits from wider
| SIMD pipelines in most cases. At worst, you see no speedup,
| but you're investing for the future.
|
| On the other hand, we have seen applications which run faster
| on previous generation hardware due to over-optimization.
| coder543 wrote:
| > The biggest problem is not support for the instruction
| set in the silicon, but the performance penalty it brings.
|
| Why is that sentence present tense instead of past tense?
| Why does your entire comment make no mention of these
| problems being specific to Intel? With the introduction of
| Zen 4, your entire comment appears to be based on outdated
| information. Zen 4 apparently implements AVX-512
| efficiently, without the problems Intel implementations
| experienced. That's what this whole discussion is about,
| and that's what Phoronix found as well.[0]
|
| [0]: https://www.phoronix.com/review/amd-zen4-avx512/6
| Tuna-Fish wrote:
| > However, when you use these instructions too frequently,
| the heat load it generates can slow you down
|
| It's not the heat load that slows you down. If you are
| using them enough that you produce enough heat that you
| have to downclock, it's still a win because the
| instructions improved your throughput more than what you
| lost in clocks.
|
| The problem with Intel's initial AVX-512 implementation was
| that they didn't clock down because of heat, they clocked
| down pre-emptively and substantially whenever the CPU
| executed even a single AVX-512 instruction, even if there
| was no added heat load, and stayed on the lower clocks for
| a long period. This worked fine any proper SIMD loads, but
| was crushing in any situation where there was just a
| handful of AVX-512 ops between long stretches, such as
| using an AVX-512 optimized version of some library
| function.
| bayindirh wrote:
| > [T]hey clocked down pre-emptively and substantially
| whenever the CPU executed even a single AVX-512
| instruction...
|
| Because you were hitting the power envelope limits in the
| CPU in these cases too. You might not see the heat, but
| the CPU cannot carry the power required to keep that core
| at non-AVX speeds with these power-hungry blocks operated
| at full speed.
|
| As I said, to add insult to the injury, Intel didn't
| share the exact details of its AVX implementations and
| frequency ranges it operates, either.
|
| Ah, publicly sharing your findings is/was forbidden too.
| adrian_b wrote:
| No, as the above poster said, Intel slows down the CPU
| before any actual increase in power consumption or
| temperature occurs, because their fear that their power
| limit and temperature controller will not be able to
| react fast enough when the power increase eventually
| happens.
|
| Whatever control mechanism is used in the AMD Zen CPUs is
| better than Intel's, so they downclock only when the
| power consumption really increases and the clock
| frequency recovers when the power consumption decreases,
| so there is no penalty when using sporadically some
| 512-bit instructions, like in the Intel CPUs.
| jackmott42 wrote:
| Imagine next gen consoles, suppose they stick with AMD. Then
| every game studio and game engine studio is going to _love_
| flinging some AVX-512 around. Developers will get more
| experience with it, any game that runs on PC and Console is
| going to look slow on PC if you have intel cpus with bad
| support. More libraries and tools will get created that
| people will want to use.
|
| Adoption could accelerate quick!
| kllrnohj wrote:
| Next-next gen consoles are probably still a good 5+ years
| away. AVX-512 for consumers will either have already become
| "a thing" or it'll be dead & buried by then.
| jackmott42 wrote:
| People said that about it 5 years ago to. Yet here we
| are. Nobody is going to just get rid of it, servers are
| already using it.
| pbsd wrote:
| vpmullq is not that useful; in bignum code you also want the
| upper part of the product, and there is no corresponding
| vpmulhq instruction to get that.
|
| On the other hand, vpmadd52luq and vpmadd52huq do give you
| access to the lower and upper parts of a 52x52->104 bit
| product, and those instructions perform well in the Intel
| chips, 3x faster than vpmullq.
| oxxoxoxooo wrote:
| > This instruction is used in some bignum code
|
| Could you be more specific? I think for that to work one would
| also need the upper half of 64x64 multiplication and `vpmullq`
| provides only the lower half. You could break one 64x64
| multiplication into four 32x32 multiplications (i.e. emulate
| the full 64x64 = 128 bits multiplication) but I was under the
| impression that this was slow.
| adrian_b wrote:
| I assume that as you say, whoever used this instruction was
| using it for multiplying 32-bit numbers.
|
| On AMD Zen 4 and Intel Cannon Lake or newer (when AVX-512 is
| supported), the fastest method to multiply big numbers is to
| use the IFMA instructions, which reuse the floating-point
| multipliers to generate 104-bit products of 52-bit numbers.
| boundchecked wrote:
| Not many people realize is that recent glibc brought AVX-512
| optimized str* and mem* functions to the ifunc dispatch table,
| your C code may have been using fancy mask registers on someone's
| Intel laptop!
| formerly_proven wrote:
| > For all practical purposes, under a suitable all-core load, the
| 7950X will be running at 95C Tj.Max all the time. If you throw a
| bigger cooler on it, it will boost to higher clock speeds to get
| right back to Tj.Max. Because of this, the performance of the
| chip is dependent on the cooling. And it is basically impossible
| to hit the 230W power limit at stock without direct die or sub-
| ambient.
|
| > If 95C sounds scary, wait to you see the voltages involved. AMD
| advertises 5.7 GHz. In reality, a slightly higher value of 5.75
| GHz seems to be the norm - often across half the cores
| simultaneously. So it's not just a single core load. The Fmax is
| 5.85 GHz, but I have never personally seen it go above 5.75.
|
| 5.75 GHz is reached with 1.5 V Vcore.
|
| The +50 MHz bump over advertised boost clocks was also present in
| Zen 3, likely in response to the poor reception of Zen 2
| behavior, which would usually fail to achieve the advertised
| clocks.
| loser777 wrote:
| I'm genuinely curious of the details of how the 1.5v vCore
| measurement was obtained. CPU-Z and software measurements in
| general don't have the greatest reputation of being accurate,
| especially with just-released generations of CPUs. Conventional
| wisdom has been with newer manufacturing processes, less voltage
| is required (and tolerated), and 1.5v vCore sounds truly insane
| in 2022 for a "4nm" chip. For reference, I haven't heard of 1.5v
| being a safe "24/7" voltage since the days of 90nm-130nm+ CPUs
| circa 2005-2006. IIRC casual overclockers in the forums weren't
| really comfortable with 1.5v even with 65nm Core 2, and this was
| back when it was common to e.g., safely overclock your 2.4 GHz
| Core 2 Quad to 3.4 GHz.
| xani_ wrote:
| Probably used same registers as previous.
|
| Would be simple to confirm with some scope probing CPU power.
| magila wrote:
| The problem is the CPU itself isn't the one measuring
| voltage, it gets that information from the motherboard's VRM
| controller. The accuracy of the reported value can vary
| depending on the controller, how it's configured by the
| motherboard's firmware, and the physical circuit design.
|
| That being said, with new motherboards generally using fully
| digital VRM controllers the reported value should be pretty
| close in most cases.
| dragontamer wrote:
| Excellent Teardown by "Mysticial" from mersenneforum.org.
|
| Cliffnotes:
|
| * Zen4 AVX512 is mostly double-pumped: a 256-bit native hardware
| that processes two halves of the 512-bit register.
|
| * No throttling observed
|
| * 512-bit shuffle pipeline (!!). A powerful exception to the
| "double-pumping" found in most other AVX512 instructions.
|
| * AMD seemingly handles the AVX512 mask registers better than
| Intel.
|
| * Gather/Scatter slow on AMD's Zen4 implementation.
|
| * Intel's 512-bit native load/store unit has clear advantages
| over AMD's 256-bit load-store unit when reading/writing to L1
| cache and beyond.
| celrod wrote:
| Looks like SIMD implementations that use LUTs should favor
| small tables that fit in registers and use `vperm2ipd` as look
| ups over larger tables + gather.
|
| With 64 bits, you still get a LUT size of 16 (shuffle indexes
| into two 8xdouble vectors), which can be good enough for
| functions like log and exp.
| daniel-cussen wrote:
| Shuffle is the SIMD's killer app. It's apparently an
| interesting but expensive circuit, but it's smart to prioritize
| it. Absolute best instruction, hands down. So double-pumping
| yes isn't full speed meaning single cycle, but that increases
| the compatibility with AVX512 code. I guess if a program
| executed itself as a function of its runtime from CPUID it
| might not, and of course there's all kinds of...but for
| pedestrian purposes, meaning everything on github, it's a step.
| Hey 40% speedup on Cinebench, that's buen.
| dragontamer wrote:
| > Shuffle is the SIMD's killer app
|
| A shame that AVX512 only has pshufb (aka: permute), and is
| missing the GPU-instruction "bpermute", aka backwards
| permute.
|
| pshufb is effectively a "gather" instruction over a AVX
| register. Equivalent to GPU permutes.
|
| bpermute, in GPU land, is a "scatter" instruction over a
| vector register. There's no CPU / AVX equivalent of it. But I
| keep coming up with good uses of the bpermute instruction
| (much like pshufb is crazy flexible, its inverse, the
| backwards permute, is also crazy flexible).
|
| --------
|
| Almost any code that's finding itself "gathering" data across
| a vector register, will inevitably "scatter" the data back at
| some point.
|
| Much like how "pext" is the "gather" instruction for 64-bits,
| you need pdep to handle the equal-and-opposite case. Its
| incredibly silly that AVX / AVX512 has implemented only one-
| half of this concept (gather / pshufb / aka Permute).
|
| I wish for the day that Intel/AMD implements (scatter /
| backwards-pshufb / aka Backwards-Permute).
|
| -------
|
| Fortunately, I got Vega64 and NVidia Graphics Cards with both
| permute and bpermute instructions for high-speed shuffling of
| data. But CPU-space should benefit from this concept too.
| daniel-cussen wrote:
| OK that's cool, didn't know about bpermute. Made sense
| there should be a counterpart. Well when you only have
| pshufb, it works OK, yeah there's tons of gaps but if
| you're clever and...and if you compromise speed...thanks
| for telling me about bpermute!
| giyanani wrote:
| Why do you say shuffle is "SIMD's killer app"? I've only
| dabbled in vector instructions from a learning perspective,
| and seen others mention it's important too, but have yet to
| understand why.
| demindiro wrote:
| I use PSHUFB to convert 24-bit RGB to 32-bit RGBX or BGRX.
| Without a shuffle instruction it'd be quite a bit harder.
| MrBuddyCasino wrote:
| This Rust issue [0] was the best short summary of what an
| SIMD Shuffle is I could find:
|
| ,,A "shuffle", in SIMD terms, takes a SIMD vector (or
| possibly two vectors) and a pattern of source lane indexes
| (usually as an immediate), and then produces a new SIMD
| vector where the output is the source lane values in the
| pattern given."
|
| [0] https://github.com/rust-lang/portable-simd/issues/11
| magicalhippo wrote:
| It's basically several moves for the price of one. Given
| that you operate on multiple values at once, being able to
| shuffle or duplicate values comes up all the time.
|
| For example if you're filtering four image lines at a time
| using a 1D filter kernel, you'll want to replicate the
| filter coefficient to each SIMD element, so that you can
| multiply each of the four pixel values with the same
| coefficient. Shuffle lets you replicate a single
| coefficient value into all the elements of a register in
| one instruction.
| daniel-cussen wrote:
| Which is the point of SIMD. Several moves for the price
| of one.
| zX41ZdbW wrote:
| Here is an overview of the usage of the shuffle instruction
| to speed up decompression in ClickHouse:
| https://habr.com/ru/company/yandex/blog/457612/
| Veliladon wrote:
| Because you can do things using bitmasks and single
| instructions instead of brute forcing using multiple
| instructions.
|
| Let's say you have a whole heap of 8-bit numbers you want
| to multiply by 2 and you have a set of 256-bit registers
| and a nice SIMD multiply command. If you don't have a
| shuffle you need to assemble your series of 2s for the
| second operand for each lane before you can even start.
| This is going to take hundreds of instructions and hundreds
| of clocks. Shuffle means you load up lane 0 with the "2"
| and then splat the contents of lane 0 across the other 31
| lanes in two instructions and a few clocks using the
| shuffle unit.
|
| N.B. Shuffle isn't just about splatting. There's a whole
| heap of different operations it can do that are useful. I
| just picked a simple example with an obvious massive
| performance increase for illustrative purposes.
| Dylan16807 wrote:
| I think that example is too simple to show the benefit of
| shuffle. It's like explaining the benefit of an adder by
| showing how you can move a value with X = Y + 0.
| Especially since there's also a (much simpler) piece of
| hardware dedicated to ultra-fast splat/broadcast (under
| the right conditions).
| stabbles wrote:
| You're not talking about shuffle, you're talking about
| broadcast. Shuffle instructions is where you take one or
| two vectors, and output a third with elements from any
| index of the input. So for example `out = [in[2], in[1]]`
| is a shuffle of a vector of length 2.
|
| It's useful for example if you have say RGB color data
| stored contiguously in memory as say RGBRGBRGBRGB..., and
| you want to vectorize operations on R, B and G
| separately. You can load a few registers like
| [RGBR][GBRG][BRGB], and then shuffle them to
| [RRRR][BBBB][GGGG]. In fact it's not entirely trivial how
| to shuffle optimally, it takes a few shuffles to get
| there.
|
| More generally, if you have an array of structs, you
| often need to go to struct of arrays to do vectorized
| operations on the array, before returning to an array of
| struct again.
|
| Another example is fast matrix transpose (in fact you can
| think of the RGB example a 3 by N matrix transpose to N
| by 3, where N is the vector width -- AoS -> SoA is a
| transpose too, in a sense). Suppose you have a matrix of
| size N by N where N is the vector width, you need N lg N
| shuffles to transpose the matrix.
| janwas wrote:
| Indeed a great article, well worth reading in full for anyone
| who uses AVX-512.
|
| Two other things that jumped out at me: VPCONFLICT is 10x as
| fast, compressstoreu is >10x slower. Those might be enough to
| warrant a Zen4-specific codepath in Highway.
| celrod wrote:
| The Intel optimization manual has a fun example where they
| use vpconflict for vectorizing sparse dot products:
| https://github.com/intel/optimization-
| manual/blob/main/chap1...
|
| I benchmarked it on Intel, and it was indeed quite fast/a
| good improvement over the scalar version. Will be interesting
| to try that on AMD.
| sitkack wrote:
| I think it is important to note that while double-pumped, using
| 512-bit registers puts lower pressure on decode and enables the
| pipelines to fill. So use 512-bit if you can.
| celrod wrote:
| Yeah, the claim was that this is why it hit higher clock
| speeds. The front end will be hard pressed to hit/maintian 4
| IPC, while 2 IPC is much easier.
| adrian_b wrote:
| It should also be noted that believing that Zen 4 is "double-
| pumped" and the Intel CPUs are not "double-pumped" is
| completely misleading.
|
| On most Intel CPUs with AVX-512 support, there are 2 classes
| of 512-bit instructions: instructions executed by combining a
| pair of 256-bit units, thus having an equal throughput for
| 512-bit instructions and 256-bit instructions, and the second
| class of instructions, which are executed by combining a pair
| of 256-bit execution units and also by extending to 512 bits
| another 256-bit execution unit.
|
| For the second class of instructions the Intel CPUs have a
| throughput of two 512-bit instructions per cycle vs. three
| 256-bit instructions per cycle.
|
| Compared to the cheaper models of Intel CPUs, Zen 4, while
| having the same throughput as Zen 3, i.e. two 512-bit
| instructions per cycle vs. four 256-bit instructions per
| cycle in Zen 3, either matches or exceeds the throughput of
| the Intel CPUs with AVX-512. Compared to the Intel CPUs, Zen
| 4 allows 1 FMA + 1 FADD, while on the Intel CPUs only 1 FMA
| per cycle can be executed.
|
| The only important advantage of Intel appears in the most
| expensive models of the server and workstation CPUs, i.e. in
| most Xeon Gold, all Xeon Platinum and all of the Xeon W
| models that have AVX-512 support.
|
| In these more expensive models, there is a second 512-bit FMA
| unit, which enables a double FMA throughput compared to Zen
| 4. These models with double FMA throughput are also helped by
| a double throughput for the loads from the L1 cache, which is
| matched to the FMA throughput.
|
| So the AVX-512 implementation in Zen 4 is superior to that in
| the cheaper CPUs like Tiger Lake, even without taking into
| account the few new execution units added in Zen 4, like the
| 512-bit shuffle unit.
|
| Only the Xeon Platinum and the like of the future Sapphire
| Rapids will have a definitely greater throughput for the
| floating-point operations than Zen 4, but they will also have
| a significantly lower all-clock frequency (due to the
| inferior manufacturing process), so the higher throughput per
| clock cycle is not certain to overcome the deficit in clock
| frequency.
| asbeb wrote:
| pella wrote:
| phoronix:AMD Zen 4 AVX-512 Performance Analysis On The Ryzen 9
| 7950X
|
| https://www.phoronix.com/review/amd-zen4-avx512
|
| _" On average for the tested AVX-512 workloads, making use of
| the AVX-512 instructions led to around 59% higher performance
| compared to when artificially limiting the Ryzen 9 7950X to AVX2
| / no-AVX512.
|
| From these results I am rather impressed by the AVX-512
| performance out of the AMD Ryzen 9 7950X. While initially being
| disappointed when hearing of their "double pumping" approach
| rather than going for a 512-bit data path, these benchmark
| results speak for themselves. For software that can effectively
| make use of AVX-512 (and compiled so), there is significant
| performance uplift to enjoy while no negative impact in terms of
| reduced CPU clock speeds / higher power consumption (with oneDNN
| being one of the only exceptions seen so far in terms of higher
| power draw).
|
| AVX-512 is looking good on the Ryzen 7000 series and I'll
| continue running more benchmarks over the weeks ahead. These
| AVX-512 results make me all the more excited for AMD EPYC "Genoa"
| where AVX-512 can be a lot more widely-used among HPC/server
| workloads. "_
| phire wrote:
| I wonder how much of that 59% gain comes from the 512bit
| registers/instructions themselves, and how much comes from the
| new instructions and modes that come with AVX-512, and can
| still be used with the narrower 256bit and 128bit registers.
|
| Would be interesting to modify some of the benchmarks to be
| limited to 256bit AVX-512 and see how they compare.
| TinkersW wrote:
| Mysticals report indicates much of it does come from wider
| instructions, because it can saturate the core easier. Zen 3
| was front end bottlenecked, so on Zen4 running AVX512 it can
| more often hit 4x256. The new instructions are useful and
| some help perf, but mostly only for pretty specialized stuff.
| Masking is nice but I think people really exaggerate the
| improvement from it, vblend was only 2 cycles.
| paulmd wrote:
| Haha, as someone who has been shouting "no, really, AVX-512 is
| good, even if it's double-pumped, just wait for it guys" into the
| void for years now, glad to see it finally hit the desktop for
| real and that the AVX people are already leaning into it.
|
| Years and years of "nobody needs AVX-512" and "linus says it's
| just for benchmarks, he worked at transmeta two decades ago, he
| knows better than Lisa Su" hot takes down the tubes ;)
___________________________________________________________________
(page generated 2022-09-26 23:00 UTC)