[HN Gopher] Zen5's AVX512 Teardown and More
___________________________________________________________________
Zen5's AVX512 Teardown and More
Author : todsacerdoti
Score : 101 points
Date : 2024-08-07 15:28 UTC (7 hours ago)
(HTM) web link (www.numberworld.org)
(TXT) w3m dump (www.numberworld.org)
| PaulHoule wrote:
| Intel's handling of SIMD is representative of Intel's value-
| subtracting principles that have caused Intel to stagnate in the
| past 15 years or so.
|
| It is an intrinsic problem with SIMD as we know it that you have
| to recode your application to support new instructions, which is
| a big hassle. Most people and companies will give up on
| supporting the latest and greatest and will ship binaries that
| meet the lowest common denominator. For instance it took forever
| for Microsoft to rely on instructions that were available almost
| 15 years ago.
|
| As Charlie Demerjian has pointed out for years consumers are
| waking up to the fact that a 2024 craptop isn't much better than
| a 2014 crapbook and there is zero credibility in claims about
| "Ultrabooks", "AI PCs", etc. What could make a difference is a
| coordinated effort end-to-end to widely deploy the latest
| developments as quickly as possible across as much of the product
| line as possible, to get tooling support for them, and drive
| developers to adopt them as quickly as possible. As it is Intel
| will boast about how the national labs are blowing up H-bombs in
| VR faster than they ever had, and Facebook is profiling users
| more efficiently than before and not realize that customers don't
| believe these advances are going to make a difference for them so
| instead of buying a new PC which might deliver better performance
| when software (maybe) catches up in 7-8 years they are going to
| hold on to old machines longer.
| happycube wrote:
| I'd say 2018 ~= 2024 instead of 2014 (those craptops all too
| often had dual-core i5's and i7's, 15" TN 768 screens, and
| HDDs), but yeah things have slowed down a bit.
| jsheard wrote:
| Worse still is Intels rollout of AVX512 specifically, which
| started nearly a decade ago but to this day it's still not
| available across their whole product stack, so the countdown to
| it becoming ubiquitous _hasn 't even started yet._ They painted
| themselves into a corner by making 512bit vectors a mandatory
| feature, which they then decided isn't feasible to support in
| their small E-cores, so now they're walking it all back with a
| new "AVX10" spec which is just a redux of AVX512 except 512bit
| vectors are optional this time.
|
| Then we'll have to wait a _another_ decade or so for AVX10 to
| become baseline, so AVX2 will probably be old enough to drink
| (in the US) before it 's fully phased out.
| Vecr wrote:
| I don't think you can phase out AVX2, it's the base of
| AVX512, because you can't always go 512 wide, and you'd have
| no backwards compatibility.
| jsheard wrote:
| I know AVX2 will continue to exist in hardware forever for
| backwards compatibility, by "fully phased out" I mean the
| eventual point when software no longer has to maintain a
| dedicated path for hardware which supports AVX2 but doesn't
| support AVX10, because all relevent hardware supports
| AVX10.
| the8472 wrote:
| EVEX prefix can address XMM/YMM/ZMM registers. So you can
| apply the AVX512 instruction set to 128bit and 256bit
| registers too.
| xeeeeeeeeeeenu wrote:
| While it does seem that AVX10 was mainly designed for
| consumer CPUs so they could use modern vector instructions
| without 512-bit vectors, the upcoming Arrow Lake will _not_
| have it.[1]
|
| I guess we will have to wait for at least one more
| generation.
|
| [1] - According to Intel(r) Architecture Instruction Set
| Extensions Programming Reference:
| https://cdrdv2-public.intel.com/826290/architecture-
| instruct...
| jsheard wrote:
| Intel is making Xeons out of E-cores (up to 288 of them on
| one chip) so I assume those will also be motivating the
| rollout of AVX10, not just their consumer parts.
| xeeeeeeeeeeenu wrote:
| Well, they don't support it either. According to the
| document I linked, neither the just-released Sierra
| Forest, nor the planned Clearwater Forest support AVX10.
| kllrnohj wrote:
| But surely they could just double pump like AMD does on
| Zen 4(c) and also on (some?) Zen 5c.
|
| It's weird to see an Intel so... Broke? That they are
| seemingly forced to recycle old architectures endlessly
| jsheard wrote:
| I think Intels E-cores are quite a bit smaller than the
| Zen 4c/5c cores, maybe at that scale it's prohibitive to
| even double up the register file? That's required even if
| the logic is double-pumped. AIUI the small Zen cores are
| mostly the same design as the big ones, just with less
| cache, silicon layout retuned for density rather than
| speed, and the removal of the 3D Cache stacking vias,
| while Intels small cores are clean-sheet designs with
| next to nothing in common with their big cores so they
| have to opportunity to shrink them a lot more.
| suresk wrote:
| My non-expert brain immediately jumped to double-pumping
| + maybe working with their thread director to have tasks
| using a lot of AVX512 instructions prefer P cores more.
| It feels like such an obvious solution to a really dumb
| problem that I assumed there was something simple I was
| missing.
|
| The register file size makes sense, I didn't think they
| were that much of the die on those processors but I guess
| they had to be pretty aggressive to meet power goals?
| jsheard wrote:
| > The register file size makes sense, I didn't think they
| were that much of the die on those processors
|
| https://i.imgur.com/WdMPX8S.jpeg
|
| According to this, Zen4s FP register file is almost as
| big as its FP execution units. It's a pretty sizable
| chunk of silicon.
| suresk wrote:
| I was having trouble finding an E Core die shot, but that
| helps put it into perspective a bit anyway. Thanks!
| celrod wrote:
| Skymont little cores have 4x 128-bit execution. They
| could quadruple-pump.
|
| But looks more like they're giving up on people writing
| code for wide vectors, instead settling on trying to make
| the existing code faster.
| wtallis wrote:
| AVX10 is still pretty much in the _proposal_ phase, and has
| been recently updated based on feedback Intel has received.
| It takes several years to get from that stage to shipping
| hardware.
| jeffbee wrote:
| I don't think you can make a credible case that a 2024 PC is
| equal to a 2014 one. In 2014 you could get 4 Haswell cores in a
| 65W TDP, for 410 2014 US dollars. For the same power and less
| money in 2024 you get a web browser platform that is around 6x
| faster, or a code compilation platform that is 5-20x faster.
| antisthenes wrote:
| > In 2014 you could get 4 Haswell cores in a 65W TDP, for 410
| 2014 US dollars.
|
| Is there a typo here? I bought a 4-core Haswell in 2014 for
| around $200.
|
| It definitely didn't cost anywhere near $410, except maybe
| for the 6-core HEDT cpu?
| jeffbee wrote:
| I was just looking at the launch MSRP of one of the higher-
| end hyperthreaded ones, but sure there was a range
| including $182 for the i5-4460S without hyperthreading,
| $224 for the i5-4690S, etc.
| jauntywundrkind wrote:
| The n100 that's out is on par-ish both in performance and
| only a little bit more power saving than a i5-6500t.
|
| A 2013 core. https://www.intel.com/content/www/us/en/products
| /sku/88183/i...
|
| The price point is far better at least (iff we look at new
| inventory only).
| michaelt wrote:
| A lot of the performance gains of the past 10 years aren't
| obvious from the headline specs.
|
| The desktop PC I built in 2015 for PS1082.98
| i7-4790K (4 cores, "4.00 GHz base 4.40 GHz
| Turbo") 32GB DDR3 RAM Samsung 250GB SSD
| nvidia GTX 960 2GB GPU
|
| The desktop PC Dell will sell me, today, for PS1,174.80 [1]
| i7-14700 (8 performance cores, 12 efficient
| cores, "2.1 GHz Base, 5.4 GHz Turbo") 32GB DDR5 RAM
| 512 GB SSD Intel Integrated Graphics
|
| Sure, it's better. But _on paper_ those spec changes don 't
| look game-changing, considering it's been an entire decade.
| Especially if you're mistrustful of efficiency cores.
|
| But what those headline specs don't mention is that the RAM
| is 4x faster, the SSD is now nvme and faster, the pcie lanes
| are 4x faster, and the CPU cache has quadrupled.
|
| [1] https://www.dell.com/en-uk/shop/desktop-computers/new-
| optipl...
| adgjlsfhk1 wrote:
| The other big difference is if you are comparing DIY vs
| DIY, you can now get a 2TB SSD for $100 which is pretty
| great compared to a decade ago.
| BearOso wrote:
| A good 2TB TLC NVMe 4.0 drive went for 100 USD when the
| prices bottomed out, but Samsung and friends cut the
| supply.
|
| You can get 2TB for cheap now, but those are DRAM-less
| QLC drives. I suggest paying an extra 30-50 USD to get
| one with the TLC and RAM.
| jeffbee wrote:
| On paper those specs definitely look game-changing to me.
| The newer one has three of the older CPUs on the side, for
| free. The i9 is even more ridiculous, having what amounts
| to 4 quad-core Skylake CPUs as coprocessors (efficiency
| cores, in the parlance).
|
| But people are underestimating the compound effect of 10
| years of 15% generational improvements. The CPUs in the
| article will run your web browser 4x faster than the older
| CPU you mentioned, about 2x faster than a Ryzen 5 5500 that
| is only 2 years old.
| michaelt wrote:
| _> On paper those specs definitely look game-changing to
| me._
|
| The reality is that the newer processor is better in all
| benchmarks. Like 4-5x on the benchmarks I looked up, yes.
|
| But back in the day, the 'turbo' frequency was only
| available for a few seconds for thermal and power
| reasons, and the 'base' clock speed was how it would
| actually perform on big, compute-intensive tasks.
|
| If you only consider "8 performance cores, 2.1 GHz Base"
| vs "4 cores, 4.0 GHz base" and you also discount the
| 'efficiency cores', a person might think performance had
| barely changed.
| imtringued wrote:
| Those are the specs you'd expect from a $500 mini PC. Step
| up to $700 and you will get a better CPU plus 2TB nvme SSD.
|
| https://www.geekom.de/geekom-a5-mini-pc/
| kolbe wrote:
| Mystical (the author) does such fantastic work for the CS
| community. I really like that guy. A compilation of his stack
| overflow answers on SIMD would be better than any available book.
|
| His intelligence and openness, despite no one paying him for it,
| shines such a bad light on the terrible state of academia. That
| he was considered a "bad student" is near-proof in and of itself
| that our system judges people catastrophically poorly.
| RMarcus wrote:
| Looks like he got a master's degree from UIUC and did some
| research on FFT implementations. Seems to have been successful.
| What makes you say he was 'considered a "bad student"'?
|
| (This is a genuine question. I've never met Alex in person, but
| if an applicant to my lab spent their free time diving into
| SIMD implementations and breaking records for computing
| mathematical constants, I'd rush to hire them. Not that either
| of those two things is a requirement, of course.)
| kolbe wrote:
| It's in his bio.
|
| "However, ever since grade school, I've always sucked in
| terms of grades and standardized tests. I graduated from Palo
| Alto High School in the bottom quartile among all the
| college-bound students. My GPA was barely a 3.0 at
| graduation, so it was somewhat miraculous that I got accepted
| into Northwestern University at all."[1]
|
| I know he has a masters and all, but he is spectacular.
| Hundreds of thousands of people have masters degrees. He is
| more impressive than 99% of professors, and academia doesn't
| even acknowledge him as a peer of them.
|
| [1] http://www.numberworld.org/about/ayee/
| nocoiner wrote:
| The fact that he got into Northwestern is proof in itself
| that the "system" didn't consider him to be a bad student.
| It actually seems like the system did a pretty good job of
| identifying sheer intellectual horsepower and potential
| despite the self-professed low GPA and standardized test
| scores.
| kolbe wrote:
| In the grand hierarchy of college admissions committees
| ranking people, "getting into northwestern" means roughly
| "the 20,000th best student in his class year."
| dahinds wrote:
| and that was a "catastrophic" outcome? When I think of
| "catastrophic", it would be something like ending up
| institutionalized, or dead, not ranked in the top 1% of
| college applicants.
| ComputerGuru wrote:
| Great article. It really drives home what a damn shame Intel's
| persistent mishandling of AVX512 has been ever since its
| introduction. I don't even know if it has a future outside of
| extremely niche libraries given how scattered hardware support
| for it is on Intel's side.
|
| On a completely different topic: I wasn't expecting the redacted
| portion of the article due to AMD's embargo to take away too much
| of the article but the first half (up until the discussion about
| AVX512) would clearly be much more interesting with the censored
| out parts. I guess someone will have to resubmit this come August
| 14th!
| gpapilion wrote:
| Mishandling aside, the issue I've seen is there really isn't
| consumer demand for this. Prior to AMD having AVX512, most of
| the comments were around wasting the silicon on SIMD, rather
| than improving other aspects of the CPU. I'm pretty sure there
| was good reason to think it was largely a dark area of the
| chip.
|
| From what I've seen, but haven't heard discussed much, the
| naive implementation vs AVX512 is a huge gain, but AVX2 vs
| AVX512 was not very impressive for the application I was
| looking at. The complexity this code added, and the cases where
| we needed it to run on AMD (for other reasons), basically made
| taking advantage of the feature undesirable for a single digit
| gain.
|
| Things like VNNI or AMX are better wins, but they are only
| needed in very specific cases. VNNI in particular looked to be
| a 30% improvement in a BERT workload.
| Avamander wrote:
| Isn't it a bit weird to expect consumer demand for CPU
| instruction set extensions?
|
| Obviously there's very little of that, but what should matter
| is the developer uptake and thus better end-user experience
| that can be delivered? (I'd also hope for even better
| autovectorization in compilers.)
|
| It's in my opinion kind-of insane that we're still building
| so much software for ancient baselines and leaving quite a
| bit of performance on the table across the entire system.
| (How much has Apple won in terms of performance by forcing
| everyone to build for new ARM targets using new toolchains?)
| drewg123 wrote:
| The most interesting bit about this article for me is the
| "transition time" to get the power needed use AVX-256 or AVX-512
| which is present on Intel, but not AMD zen4/zen5. It explains
| some behavior that I saw years ago when implementing kTLS on
| FreeBSD, and validates our design of having per-core kTLS crypto
| worker threads, rather than doing the crypto in the context of
| sosend() or sendfile's tcp_usr_ready().
| alberth wrote:
| Any chance you'd move to using Intel for your content servers
| in the foreseeable future?
|
| Or this further cements the use of AMD?
| drewg123 wrote:
| It doesn't matter so much anymore, since we use kTLS offload
| NICs.
| altairprime wrote:
| Ugh, half of this article is
|
| > This section has been redacted until August 14.
|
| Could you repost it then?
| kolbe wrote:
| I think you need to wait one year before being able to post an
| exact link again. You can ask deng to reset it if you want, but
| maybe take some initiative for yourself rather than giving
| other people attitude, then asking them to do you favors?
| ipsum2 wrote:
| The part of the article I found most amusing:
|
| "Intel added AVX512-VP2INTERSECT to Tiger Lake. But it was really
| slow. (microcoded ~25 cycles/46 uops) It was so slow that someone
| found a better way to implement its functionality without using
| the instruction itself. Intel deprecates the instruction and
| removes it from all processors after Tiger Lake. (ignoring the
| fact that early Alder Lake unofficially also had it) AMD adds it
| to Zen5. So just as Intel kills off VP2INTERSECT, AMD shows up
| with it. Needless to say, Zen5 had probably already taped out by
| the time Intel deprecated the instruction. So VP2INTERSECT made
| it into Zen5's design and wasn't going to be removed.
|
| But how good is AMD's implementation? Let's look at AIDA64's
| dumps for Granite Ridge:
|
| AVX512_VP2INTERSECT :VP2INTERSECTQ k1+1, zmm, zmm L: [diff. reg.
| set] T: 0.23ns= 1.00c
|
| Yes, that's right. 1 cycle throughput. ONE cycle. I can't... I
| just can't...
|
| Intel was so bad at this that they dropped the instruction. And
| now AMD finally appears and shows them how it's done - 2 years
| too late."
| fuhsnn wrote:
| It's in fact very common to microcode instructions at early
| iterations of an arch. https://uops.info/table.html is a nice
| place if certain instructions being slow brings joy to your
| life.
| Remnant44 wrote:
| AMD's avx512 implementation is just lovely and they seem to be
| firing on all cylinders for it. Zen4 was already great, 'double
| pumped' or no.
|
| It looks like Zen5's support is essentially the dream - all EUs
| and load/store are expanded to 512 bit, so you can sustain 2 512
| FMAs and 2 512 Adds every cycle. There also appears to be
| essentially no transition penalty to the full-power state which
| is incredible.
|
| The only thing sad here is that all this work to enable full-
| width AVX512 is going to be mostly wasted as approximately 0% of
| all client software will get recompiled to an AVX512 baseline for
| decades if ever. But if you can compile for your own targets, or
| JIT for it.. it looks really good.
| stagger87 wrote:
| Seeing 2 consumer CPU generations in a row not only support but
| improve AVX512 capabilities will hopefully go a long ways towards
| regaining the confidence of the developers that use AVX512 in the
| consumer space. I know I personally have been holding back as I
| watched Intel fumble AVX512 for the last 10 years. With their
| even more recent fumbles there could be a near future where AMD
| CPUs have majority market share in both desktop and mobile. Great
| news for developers that can use AVX512.
| Remnant44 wrote:
| Agreed. My current best-case-scenario hope is that the success
| of the Zen4/5/etc processors will force Intel to adapt their
| strategy towards AMDs, and finally move us out of the avx512
| mess they've segmented us into.
| vardump wrote:
| Assuming Intel is changing direction right now, unfortunately
| they will face 2-3 years of latency to implement that.
| Manabu-eo wrote:
| AMD fixed vpcompressd in Zen5:
|
| > Hazards Fixed:
|
| > V(P)COMPRESS store to memory is fixed. (3 cycles/store to non-
| overlapping addresses)
|
| > The super-alignment hazard is fixed.
|
| I initially tried searching for the string, but the () thwarted
| that.
|
| It used to be 142 cycles/instruction for "vpcompressd [mem]{k},
| zmm" in zen4.
| pixelpoet wrote:
| What an excellent writeup, thx for sharing
| andy_xor_andrew wrote:
| out of curiosity, what applications might I see this used for?
___________________________________________________________________
(page generated 2024-08-07 23:01 UTC)