[HN Gopher] Zen5's AVX512 Teardown and More
       ___________________________________________________________________
        
       Zen5's AVX512 Teardown and More
        
       Author : todsacerdoti
       Score  : 101 points
       Date   : 2024-08-07 15:28 UTC (7 hours ago)
        
 (HTM) web link (www.numberworld.org)
 (TXT) w3m dump (www.numberworld.org)
        
       | PaulHoule wrote:
       | Intel's handling of SIMD is representative of Intel's value-
       | subtracting principles that have caused Intel to stagnate in the
       | past 15 years or so.
       | 
       | It is an intrinsic problem with SIMD as we know it that you have
       | to recode your application to support new instructions, which is
       | a big hassle. Most people and companies will give up on
       | supporting the latest and greatest and will ship binaries that
       | meet the lowest common denominator. For instance it took forever
       | for Microsoft to rely on instructions that were available almost
       | 15 years ago.
       | 
       | As Charlie Demerjian has pointed out for years consumers are
       | waking up to the fact that a 2024 craptop isn't much better than
       | a 2014 crapbook and there is zero credibility in claims about
       | "Ultrabooks", "AI PCs", etc. What could make a difference is a
       | coordinated effort end-to-end to widely deploy the latest
       | developments as quickly as possible across as much of the product
       | line as possible, to get tooling support for them, and drive
       | developers to adopt them as quickly as possible. As it is Intel
       | will boast about how the national labs are blowing up H-bombs in
       | VR faster than they ever had, and Facebook is profiling users
       | more efficiently than before and not realize that customers don't
       | believe these advances are going to make a difference for them so
       | instead of buying a new PC which might deliver better performance
       | when software (maybe) catches up in 7-8 years they are going to
       | hold on to old machines longer.
        
         | happycube wrote:
         | I'd say 2018 ~= 2024 instead of 2014 (those craptops all too
         | often had dual-core i5's and i7's, 15" TN 768 screens, and
         | HDDs), but yeah things have slowed down a bit.
        
         | jsheard wrote:
         | Worse still is Intels rollout of AVX512 specifically, which
         | started nearly a decade ago but to this day it's still not
         | available across their whole product stack, so the countdown to
         | it becoming ubiquitous _hasn 't even started yet._ They painted
         | themselves into a corner by making 512bit vectors a mandatory
         | feature, which they then decided isn't feasible to support in
         | their small E-cores, so now they're walking it all back with a
         | new "AVX10" spec which is just a redux of AVX512 except 512bit
         | vectors are optional this time.
         | 
         | Then we'll have to wait a _another_ decade or so for AVX10 to
         | become baseline, so AVX2 will probably be old enough to drink
         | (in the US) before it 's fully phased out.
        
           | Vecr wrote:
           | I don't think you can phase out AVX2, it's the base of
           | AVX512, because you can't always go 512 wide, and you'd have
           | no backwards compatibility.
        
             | jsheard wrote:
             | I know AVX2 will continue to exist in hardware forever for
             | backwards compatibility, by "fully phased out" I mean the
             | eventual point when software no longer has to maintain a
             | dedicated path for hardware which supports AVX2 but doesn't
             | support AVX10, because all relevent hardware supports
             | AVX10.
        
             | the8472 wrote:
             | EVEX prefix can address XMM/YMM/ZMM registers. So you can
             | apply the AVX512 instruction set to 128bit and 256bit
             | registers too.
        
           | xeeeeeeeeeeenu wrote:
           | While it does seem that AVX10 was mainly designed for
           | consumer CPUs so they could use modern vector instructions
           | without 512-bit vectors, the upcoming Arrow Lake will _not_
           | have it.[1]
           | 
           | I guess we will have to wait for at least one more
           | generation.
           | 
           | [1] - According to Intel(r) Architecture Instruction Set
           | Extensions Programming Reference:
           | https://cdrdv2-public.intel.com/826290/architecture-
           | instruct...
        
             | jsheard wrote:
             | Intel is making Xeons out of E-cores (up to 288 of them on
             | one chip) so I assume those will also be motivating the
             | rollout of AVX10, not just their consumer parts.
        
               | xeeeeeeeeeeenu wrote:
               | Well, they don't support it either. According to the
               | document I linked, neither the just-released Sierra
               | Forest, nor the planned Clearwater Forest support AVX10.
        
               | kllrnohj wrote:
               | But surely they could just double pump like AMD does on
               | Zen 4(c) and also on (some?) Zen 5c.
               | 
               | It's weird to see an Intel so... Broke? That they are
               | seemingly forced to recycle old architectures endlessly
        
               | jsheard wrote:
               | I think Intels E-cores are quite a bit smaller than the
               | Zen 4c/5c cores, maybe at that scale it's prohibitive to
               | even double up the register file? That's required even if
               | the logic is double-pumped. AIUI the small Zen cores are
               | mostly the same design as the big ones, just with less
               | cache, silicon layout retuned for density rather than
               | speed, and the removal of the 3D Cache stacking vias,
               | while Intels small cores are clean-sheet designs with
               | next to nothing in common with their big cores so they
               | have to opportunity to shrink them a lot more.
        
               | suresk wrote:
               | My non-expert brain immediately jumped to double-pumping
               | + maybe working with their thread director to have tasks
               | using a lot of AVX512 instructions prefer P cores more.
               | It feels like such an obvious solution to a really dumb
               | problem that I assumed there was something simple I was
               | missing.
               | 
               | The register file size makes sense, I didn't think they
               | were that much of the die on those processors but I guess
               | they had to be pretty aggressive to meet power goals?
        
               | jsheard wrote:
               | > The register file size makes sense, I didn't think they
               | were that much of the die on those processors
               | 
               | https://i.imgur.com/WdMPX8S.jpeg
               | 
               | According to this, Zen4s FP register file is almost as
               | big as its FP execution units. It's a pretty sizable
               | chunk of silicon.
        
               | suresk wrote:
               | I was having trouble finding an E Core die shot, but that
               | helps put it into perspective a bit anyway. Thanks!
        
               | celrod wrote:
               | Skymont little cores have 4x 128-bit execution. They
               | could quadruple-pump.
               | 
               | But looks more like they're giving up on people writing
               | code for wide vectors, instead settling on trying to make
               | the existing code faster.
        
             | wtallis wrote:
             | AVX10 is still pretty much in the _proposal_ phase, and has
             | been recently updated based on feedback Intel has received.
             | It takes several years to get from that stage to shipping
             | hardware.
        
         | jeffbee wrote:
         | I don't think you can make a credible case that a 2024 PC is
         | equal to a 2014 one. In 2014 you could get 4 Haswell cores in a
         | 65W TDP, for 410 2014 US dollars. For the same power and less
         | money in 2024 you get a web browser platform that is around 6x
         | faster, or a code compilation platform that is 5-20x faster.
        
           | antisthenes wrote:
           | > In 2014 you could get 4 Haswell cores in a 65W TDP, for 410
           | 2014 US dollars.
           | 
           | Is there a typo here? I bought a 4-core Haswell in 2014 for
           | around $200.
           | 
           | It definitely didn't cost anywhere near $410, except maybe
           | for the 6-core HEDT cpu?
        
             | jeffbee wrote:
             | I was just looking at the launch MSRP of one of the higher-
             | end hyperthreaded ones, but sure there was a range
             | including $182 for the i5-4460S without hyperthreading,
             | $224 for the i5-4690S, etc.
        
           | jauntywundrkind wrote:
           | The n100 that's out is on par-ish both in performance and
           | only a little bit more power saving than a i5-6500t.
           | 
           | A 2013 core. https://www.intel.com/content/www/us/en/products
           | /sku/88183/i...
           | 
           | The price point is far better at least (iff we look at new
           | inventory only).
        
           | michaelt wrote:
           | A lot of the performance gains of the past 10 years aren't
           | obvious from the headline specs.
           | 
           | The desktop PC I built in 2015 for PS1082.98
           | i7-4790K (4 cores, "4.00 GHz base               4.40 GHz
           | Turbo")       32GB DDR3 RAM       Samsung 250GB SSD
           | nvidia GTX 960 2GB GPU
           | 
           | The desktop PC Dell will sell me, today, for PS1,174.80 [1]
           | i7-14700 (8 performance cores, 12 efficient
           | cores, "2.1 GHz Base, 5.4 GHz Turbo")       32GB DDR5 RAM
           | 512 GB SSD        Intel Integrated Graphics
           | 
           | Sure, it's better. But _on paper_ those spec changes don 't
           | look game-changing, considering it's been an entire decade.
           | Especially if you're mistrustful of efficiency cores.
           | 
           | But what those headline specs don't mention is that the RAM
           | is 4x faster, the SSD is now nvme and faster, the pcie lanes
           | are 4x faster, and the CPU cache has quadrupled.
           | 
           | [1] https://www.dell.com/en-uk/shop/desktop-computers/new-
           | optipl...
        
             | adgjlsfhk1 wrote:
             | The other big difference is if you are comparing DIY vs
             | DIY, you can now get a 2TB SSD for $100 which is pretty
             | great compared to a decade ago.
        
               | BearOso wrote:
               | A good 2TB TLC NVMe 4.0 drive went for 100 USD when the
               | prices bottomed out, but Samsung and friends cut the
               | supply.
               | 
               | You can get 2TB for cheap now, but those are DRAM-less
               | QLC drives. I suggest paying an extra 30-50 USD to get
               | one with the TLC and RAM.
        
             | jeffbee wrote:
             | On paper those specs definitely look game-changing to me.
             | The newer one has three of the older CPUs on the side, for
             | free. The i9 is even more ridiculous, having what amounts
             | to 4 quad-core Skylake CPUs as coprocessors (efficiency
             | cores, in the parlance).
             | 
             | But people are underestimating the compound effect of 10
             | years of 15% generational improvements. The CPUs in the
             | article will run your web browser 4x faster than the older
             | CPU you mentioned, about 2x faster than a Ryzen 5 5500 that
             | is only 2 years old.
        
               | michaelt wrote:
               | _> On paper those specs definitely look game-changing to
               | me._
               | 
               | The reality is that the newer processor is better in all
               | benchmarks. Like 4-5x on the benchmarks I looked up, yes.
               | 
               | But back in the day, the 'turbo' frequency was only
               | available for a few seconds for thermal and power
               | reasons, and the 'base' clock speed was how it would
               | actually perform on big, compute-intensive tasks.
               | 
               | If you only consider "8 performance cores, 2.1 GHz Base"
               | vs "4 cores, 4.0 GHz base" and you also discount the
               | 'efficiency cores', a person might think performance had
               | barely changed.
        
             | imtringued wrote:
             | Those are the specs you'd expect from a $500 mini PC. Step
             | up to $700 and you will get a better CPU plus 2TB nvme SSD.
             | 
             | https://www.geekom.de/geekom-a5-mini-pc/
        
       | kolbe wrote:
       | Mystical (the author) does such fantastic work for the CS
       | community. I really like that guy. A compilation of his stack
       | overflow answers on SIMD would be better than any available book.
       | 
       | His intelligence and openness, despite no one paying him for it,
       | shines such a bad light on the terrible state of academia. That
       | he was considered a "bad student" is near-proof in and of itself
       | that our system judges people catastrophically poorly.
        
         | RMarcus wrote:
         | Looks like he got a master's degree from UIUC and did some
         | research on FFT implementations. Seems to have been successful.
         | What makes you say he was 'considered a "bad student"'?
         | 
         | (This is a genuine question. I've never met Alex in person, but
         | if an applicant to my lab spent their free time diving into
         | SIMD implementations and breaking records for computing
         | mathematical constants, I'd rush to hire them. Not that either
         | of those two things is a requirement, of course.)
        
           | kolbe wrote:
           | It's in his bio.
           | 
           | "However, ever since grade school, I've always sucked in
           | terms of grades and standardized tests. I graduated from Palo
           | Alto High School in the bottom quartile among all the
           | college-bound students. My GPA was barely a 3.0 at
           | graduation, so it was somewhat miraculous that I got accepted
           | into Northwestern University at all."[1]
           | 
           | I know he has a masters and all, but he is spectacular.
           | Hundreds of thousands of people have masters degrees. He is
           | more impressive than 99% of professors, and academia doesn't
           | even acknowledge him as a peer of them.
           | 
           | [1] http://www.numberworld.org/about/ayee/
        
             | nocoiner wrote:
             | The fact that he got into Northwestern is proof in itself
             | that the "system" didn't consider him to be a bad student.
             | It actually seems like the system did a pretty good job of
             | identifying sheer intellectual horsepower and potential
             | despite the self-professed low GPA and standardized test
             | scores.
        
               | kolbe wrote:
               | In the grand hierarchy of college admissions committees
               | ranking people, "getting into northwestern" means roughly
               | "the 20,000th best student in his class year."
        
               | dahinds wrote:
               | and that was a "catastrophic" outcome? When I think of
               | "catastrophic", it would be something like ending up
               | institutionalized, or dead, not ranked in the top 1% of
               | college applicants.
        
       | ComputerGuru wrote:
       | Great article. It really drives home what a damn shame Intel's
       | persistent mishandling of AVX512 has been ever since its
       | introduction. I don't even know if it has a future outside of
       | extremely niche libraries given how scattered hardware support
       | for it is on Intel's side.
       | 
       | On a completely different topic: I wasn't expecting the redacted
       | portion of the article due to AMD's embargo to take away too much
       | of the article but the first half (up until the discussion about
       | AVX512) would clearly be much more interesting with the censored
       | out parts. I guess someone will have to resubmit this come August
       | 14th!
        
         | gpapilion wrote:
         | Mishandling aside, the issue I've seen is there really isn't
         | consumer demand for this. Prior to AMD having AVX512, most of
         | the comments were around wasting the silicon on SIMD, rather
         | than improving other aspects of the CPU. I'm pretty sure there
         | was good reason to think it was largely a dark area of the
         | chip.
         | 
         | From what I've seen, but haven't heard discussed much, the
         | naive implementation vs AVX512 is a huge gain, but AVX2 vs
         | AVX512 was not very impressive for the application I was
         | looking at. The complexity this code added, and the cases where
         | we needed it to run on AMD (for other reasons), basically made
         | taking advantage of the feature undesirable for a single digit
         | gain.
         | 
         | Things like VNNI or AMX are better wins, but they are only
         | needed in very specific cases. VNNI in particular looked to be
         | a 30% improvement in a BERT workload.
        
           | Avamander wrote:
           | Isn't it a bit weird to expect consumer demand for CPU
           | instruction set extensions?
           | 
           | Obviously there's very little of that, but what should matter
           | is the developer uptake and thus better end-user experience
           | that can be delivered? (I'd also hope for even better
           | autovectorization in compilers.)
           | 
           | It's in my opinion kind-of insane that we're still building
           | so much software for ancient baselines and leaving quite a
           | bit of performance on the table across the entire system.
           | (How much has Apple won in terms of performance by forcing
           | everyone to build for new ARM targets using new toolchains?)
        
       | drewg123 wrote:
       | The most interesting bit about this article for me is the
       | "transition time" to get the power needed use AVX-256 or AVX-512
       | which is present on Intel, but not AMD zen4/zen5. It explains
       | some behavior that I saw years ago when implementing kTLS on
       | FreeBSD, and validates our design of having per-core kTLS crypto
       | worker threads, rather than doing the crypto in the context of
       | sosend() or sendfile's tcp_usr_ready().
        
         | alberth wrote:
         | Any chance you'd move to using Intel for your content servers
         | in the foreseeable future?
         | 
         | Or this further cements the use of AMD?
        
           | drewg123 wrote:
           | It doesn't matter so much anymore, since we use kTLS offload
           | NICs.
        
       | altairprime wrote:
       | Ugh, half of this article is
       | 
       | > This section has been redacted until August 14.
       | 
       | Could you repost it then?
        
         | kolbe wrote:
         | I think you need to wait one year before being able to post an
         | exact link again. You can ask deng to reset it if you want, but
         | maybe take some initiative for yourself rather than giving
         | other people attitude, then asking them to do you favors?
        
       | ipsum2 wrote:
       | The part of the article I found most amusing:
       | 
       | "Intel added AVX512-VP2INTERSECT to Tiger Lake. But it was really
       | slow. (microcoded ~25 cycles/46 uops) It was so slow that someone
       | found a better way to implement its functionality without using
       | the instruction itself. Intel deprecates the instruction and
       | removes it from all processors after Tiger Lake. (ignoring the
       | fact that early Alder Lake unofficially also had it) AMD adds it
       | to Zen5. So just as Intel kills off VP2INTERSECT, AMD shows up
       | with it. Needless to say, Zen5 had probably already taped out by
       | the time Intel deprecated the instruction. So VP2INTERSECT made
       | it into Zen5's design and wasn't going to be removed.
       | 
       | But how good is AMD's implementation? Let's look at AIDA64's
       | dumps for Granite Ridge:
       | 
       | AVX512_VP2INTERSECT :VP2INTERSECTQ k1+1, zmm, zmm L: [diff. reg.
       | set] T: 0.23ns= 1.00c
       | 
       | Yes, that's right. 1 cycle throughput. ONE cycle. I can't... I
       | just can't...
       | 
       | Intel was so bad at this that they dropped the instruction. And
       | now AMD finally appears and shows them how it's done - 2 years
       | too late."
        
         | fuhsnn wrote:
         | It's in fact very common to microcode instructions at early
         | iterations of an arch. https://uops.info/table.html is a nice
         | place if certain instructions being slow brings joy to your
         | life.
        
       | Remnant44 wrote:
       | AMD's avx512 implementation is just lovely and they seem to be
       | firing on all cylinders for it. Zen4 was already great, 'double
       | pumped' or no.
       | 
       | It looks like Zen5's support is essentially the dream - all EUs
       | and load/store are expanded to 512 bit, so you can sustain 2 512
       | FMAs and 2 512 Adds every cycle. There also appears to be
       | essentially no transition penalty to the full-power state which
       | is incredible.
       | 
       | The only thing sad here is that all this work to enable full-
       | width AVX512 is going to be mostly wasted as approximately 0% of
       | all client software will get recompiled to an AVX512 baseline for
       | decades if ever. But if you can compile for your own targets, or
       | JIT for it.. it looks really good.
        
       | stagger87 wrote:
       | Seeing 2 consumer CPU generations in a row not only support but
       | improve AVX512 capabilities will hopefully go a long ways towards
       | regaining the confidence of the developers that use AVX512 in the
       | consumer space. I know I personally have been holding back as I
       | watched Intel fumble AVX512 for the last 10 years. With their
       | even more recent fumbles there could be a near future where AMD
       | CPUs have majority market share in both desktop and mobile. Great
       | news for developers that can use AVX512.
        
         | Remnant44 wrote:
         | Agreed. My current best-case-scenario hope is that the success
         | of the Zen4/5/etc processors will force Intel to adapt their
         | strategy towards AMDs, and finally move us out of the avx512
         | mess they've segmented us into.
        
           | vardump wrote:
           | Assuming Intel is changing direction right now, unfortunately
           | they will face 2-3 years of latency to implement that.
        
       | Manabu-eo wrote:
       | AMD fixed vpcompressd in Zen5:
       | 
       | > Hazards Fixed:
       | 
       | > V(P)COMPRESS store to memory is fixed. (3 cycles/store to non-
       | overlapping addresses)
       | 
       | > The super-alignment hazard is fixed.
       | 
       | I initially tried searching for the string, but the () thwarted
       | that.
       | 
       | It used to be 142 cycles/instruction for "vpcompressd [mem]{k},
       | zmm" in zen4.
        
       | pixelpoet wrote:
       | What an excellent writeup, thx for sharing
        
       | andy_xor_andrew wrote:
       | out of curiosity, what applications might I see this used for?
        
       ___________________________________________________________________
       (page generated 2024-08-07 23:01 UTC)