[HN Gopher] Using the most unhinged AVX-512 instruction to make ...
___________________________________________________________________
Using the most unhinged AVX-512 instruction to make fastest phrase
search algo
Author : cmcollier
Score : 214 points
Date : 2025-01-23 21:38 UTC (4 days ago)
(HTM) web link (gab-menezes.github.io)
(TXT) w3m dump (gab-menezes.github.io)
| nxobject wrote:
| Spoiler if you don't want to read through the (wonder but many)
| paragraphs of exposition: the instruction is `vp2intersectq k,
| zmm, zmm`.
| bri3d wrote:
| And, as noted in the article, that's an instruction which only
| works on two desktop CPU architectures (Tiger Lake and Zen 5),
| including one where it's arguably slower than not using it
| (Tiger Lake).
|
| Meaning... this entire effort was for something that's faster
| on only a single kind of CPU (Zen 5).
|
| This article is honestly one of the best I've read in a long
| time. It's esoteric and the result is 99.5% pointless
| objectively, but in reality it's incredibly useful and a
| wonderful guide to low-level x86 optimization end to end. The
| sections on cache alignment and uiCA + analysis notes are a
| perfect illustration of "how it's done."
| tpm wrote:
| Presumably Zen 5 cores will also get used in Threadripper and
| EPYC processors.
| josephg wrote:
| Yep. And the feature will probably be available on all AMD
| CPUs manufactured from here on.
|
| It might be an esoteric feature today. But if it'll become
| an ubiquitous feature in a few years, its nice to learn
| about using it.
| nine_k wrote:
| Not just that, but the fact that Intel CPUs execute it 20-30
| times slower than AMD Zen 5 CPUs.
|
| Also, the fact that it's deprecated by Intel.
| mycall wrote:
| Now the question is if Intel will revive it now that Zen 5
| has it.
| iamnotagenius wrote:
| Imo the most "unhinged" cpus for AVX-512 are early batches of
| Alder Lakes which is the only cpu family that has nearly full
| coverage of all existing avx-512 subsets.
| fuhsnn wrote:
| Do they cover anything Sapphire Rapids Xeon's don't? I thought
| they share the same arch (Golden Cove).
| iamnotagenius wrote:
| Yes, you are right; I meant "consumer grade cpu".
| suzumer wrote:
| According to this [1] wikipedia article, the only feature
| Sapphire Rapids doesn't support is VP2INTERSECT.
|
| [1]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
| nextaccountic wrote:
| It seems that there are faster alternatives to it
|
| https://arxiv.org/abs/2112.06342
|
| https://www.reddit.com/r/asm/comments/110pld0/fasterthannat
| i...
| hogwarts2025 wrote:
| Or Zen 5. :-p
| janwas wrote:
| Note that the article mentions using both outputs of the
| instruction, whereas the emulation is only able to
| compute one output efficiently.
| Namidairo wrote:
| It's a shame that Intel seemed to really not want people to use
| it, given they started disabling the ability to use it in
| future microcode, and fused it off in later parts.
| Aurornis wrote:
| > It's a shame that Intel seemed to really not want people to
| use it
|
| AVX-512 was never part of the specification for those CPUs.
| It was never advertised as a feature or selling point. You
| had to disable the E cores to enable AVX-512, assuming your
| motherboard even supported it.
|
| Alder Lake AVX-512 has reached mythical status, but I think
| the number of people angry about it is far higher than the
| number of people who ever could have taken advantage of it
| and benefitted from it. For general purpose workloads, having
| the E cores enabled (and therefore AVX-512 disabled) was
| faster. You had to have an extremely specific workload that
| didn't scale well with additional cores and also had hot
| loops that benefitted from AVX-512, which was not very
| common.
|
| So you're right: They never wanted people to use it. It
| wasn't advertised and wasn't usable without sacrificing all
| of the E cores and doing a lot of manual configuration work.
| I suspect they didn't want people using it because they never
| validated it. AVX-512 mode increased the voltages, which
| would impact things like failure rate and warranty returns.
| They probably meant to turn it off but forgot in the first
| versions.
| Dylan16807 wrote:
| The reason you had to disable the E cores was... also an
| artificial barrier imposed by Intel. Enabling AVX-512 only
| looks like a problem when inside that false dichotomy. You
| can have both with a bit of scheduler awareness.
| 867-5309 wrote:
| https://www.reddit.com/r/rpcs3/comments/tqt1ko/clearing_up_
| s...
| adrian_b wrote:
| They had to disable AVX-512 only because Microsoft was too
| lazy to rewrite their thread scheduler to handle
| heterogeneous CPU cores.
|
| The Intel-AMD x86-64 architecture is full of horrible
| things, starting with the System Management Mode added in
| 1990, which have been added by Intel only because every
| time Microsoft has refused to update Windows, expecting
| that the hardware vendors must do the work instead of
| Microsoft for enabling Windows to continue to work on newer
| hardware, even when that causes various disadvantages for
| the customers.
|
| Moreover, even if Intel had not said that Alder Lake will
| support AVX-512, they also had not said that the P-cores of
| Alder Lake will not support AVX-512.
|
| Therefore everybody had expected that Intel will continue
| to provide backward compatibility, as always before that,
| so the P-cores of Alder Lake will continue to support any
| instruction subset that had been supported by Rocket Lake
| and Tiger Lake and Ice Lake and Cannon Lake.
|
| The failure to be compatible with their previous products
| has been a surprise for everybody.
| p_l wrote:
| Windows can work without SMM, especially NT - the problem
| is that SMM was created for a world where majority used
| DOS and the idea of using OS services instead of every
| possibly quirk of IBM PC was anathema to developers.
|
| Thus, SMM, because there was no other way to hook power
| management on a 386 laptop running " normal" DOS
| kccqzy wrote:
| If Windows could work without SMM, is there a historical
| reason why SMM mode didn't just die and become disused
| after Windows becomes popular and nobody uses DOS any
| more? There are plenty of features in x86 that are
| disused.
| cesarb wrote:
| > Thus, SMM, because there was no other way to hook power
| management on a 386 laptop running " normal" DOS
|
| In theory, there was: you could have a separate
| microcontroller, accessed through some of the I/O ports,
| doing the power management; it's mostly how it's done
| nowadays, with the EC (Embedded Controller) on laptops
| (and nowadays there's also the PSP or ME, which is a
| separate processor core doing startup and power
| management for the main CPU cores). But back then, it
| would also be more expensive (a whole other chip) than
| simply adding an extra mode to the single CPU core
| (multiple cores back then usually required multiple CPU
| chips).
| kccqzy wrote:
| > the P-cores of Alder Lake will continue to support any
| instruction subset that had been supported by Rocket Lake
| and Tiger Lake and Ice Lake and Cannon Lake
|
| Wait. I thought the article says only Tiger Lake supports
| the vp2intersect instruction. Is that not true then?
| adrian_b wrote:
| Tiger Lake is the only one with vp2intersect, but before
| Alder Lake there had already been 3 generations of
| consumer CPUs with AVX-512 support (Cannon Lake in
| 2018/2019, Ice Lake in 2019/2020 and Tiger Lake + Rocket
| Lake in 2020/2021).
|
| So it was expected that any future Intel CPUs will remain
| compatible. Removing an important instruction subset has
| never happened before in Intel's history.
|
| Only AMD has removed some instructions when passing from
| a 32-bit ISA to a 64-bit ISA, most of which were obsolete
| (except that removing interrupt on overflow was bad and
| it does not simplify greatly a CPU core, since there are
| many other sources of precise exceptions that must still
| be supported; the only important effect of removing INTO
| is that many instructions can be retired earlier than
| otherwise, which reduces the risk of filling up the
| retirement queue).
| Symmetry wrote:
| I don't know if I'd call Microsoft lazy. Are there any
| existing operating systems that allow preemptive
| scheduling across cores with different ISA subsets? I'd
| sort of assume Microsoft research has a proof of concept
| for something like that but putting it into a production
| OS is a different kettle of fish.
| ack_complete wrote:
| The problem with the validation argument is that the
| P-cores were advertising AVX-512 via CPUID with the E-cores
| disabled. If the AVX-512 support was not validated and
| meant to be used, it would not have been a good idea to set
| that CPUID bit, or even allow the instructions to be
| executed without faulting. It's strange that it launched
| with any AVX-512 support at all and there were rumors that
| the decision to drop AVX-512 support officially was made at
| the last minute.
|
| As for the downsides of disabling the E-cores, there were
| Alder Lake SKUs that were P-core only and had no E-cores.
|
| Not all workloads are widely parallelizable and AVX-512 has
| features that are also useful for highly serialized
| workloads such as decompression, even at narrower than
| 512-bit width. Part of the reason that AVX-512 has limited
| usage is that Intel has set back widespread adoption of
| AVX-512 by half a decade by dropping it again from their
| consumer SKUs, with AVX10/256 only to return starting in
| ~2026.
| adgjlsfhk1 wrote:
| What about Zen5?
| irthomasthomas wrote:
| I think I have two of these sitting in a box, one prototype
| with avx512 and one retail without. Is it worth me breaking
| these out for ML experiments and such?
| jonstewart wrote:
| The most unhinged AVX-512 instruction is GF2P8AFFINEQB.
| pclmulqdq wrote:
| What about GF2P8AFFINEINVQB?
| jonstewart wrote:
| potato, potato, tomato, tomato
| LarsKrimi wrote:
| It has a fixed polynomial, so not really that useful for
| anything but AES
|
| The only case where I've had use of GF(2^8) inverses is in
| FEC algorithms (Forney's algorithm) and then you need some
| kind of weird polynomial. But all of those needs are rarely
| in the hot-path, and the FEC algo's are way outdated
| pclmulqdq wrote:
| I think the AFFINE and AFFINEINV instructions are
| specifically for FEC and maybe compression algorithms. I
| also think they smell like something requested by one of
| the big customers of Intel (e.g. the government).
| LarsKrimi wrote:
| Hmm of course erasure codes would always need to solve
| these problems. Not sure what modern applications need
| that in the X86 world
|
| I really think it's only AES since thats the only place
| I've seen that polynomial used. But of course maybe
| there's an obscure tape backup FEC algo used somewhere in
| datacenters?
| Sesse__ wrote:
| The forward affine matrix is useful for all sorts of bit
| manipulation, e.g. something as simple as a bit reversal.
| bri3d wrote:
| There's a pretty good list of weird off-label uses for the
| Galois Field instructions here:
| https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
| genewitch wrote:
| I think I actually need that instruction and have a use case
| for it, and it does something with a matrix transpose so I
| might finally find a real world useful demonstration of a
| matrix operation I can cite to people who don't know what
| those mean.
| mrandish wrote:
| From my 1980s 8-bit CPU perspective, the instruction is
| unhinged based solely on the number of letters. Compared to
| LDA, STA, RTS, that's not an assembler mnemonic, it's a novel.
| :-)
| pclmulqdq wrote:
| "Load accumulator" (LDA)
|
| vs
|
| "Galois Field 2^8 affine transform on quad binary words"
| (GF2P8AFFINEQB)
|
| The compression factor isn't quite the same on character
| count, but it's still abbreviated. :)
| mananaysiempre wrote:
| Incidentally, how is it a GF(2^8) affine transform? As best
| as I can tell, it's a GF(2)^8 affine transform, i.e. an
| affine transform of vectors of bits with normal XOR
| addition and AND multiplication, and the polynomial
| defining GF(2^8) just does not enter anywhere. It does
| enter into GF2P8AFFINEINVQB, but I'm having difficulties
| finding a geometric description for that one at all.
| pclmulqdq wrote:
| I believe that the polynomial for GF2P8AFFINEQB is user-
| defined. One argument is an 8x8 matrix in GF(2) and the
| result is [A.x + b] in GF(2)^8 for each 8-bit section.
| Don't quote me on this, but I believe that matrix
| multiply in GF(2)^8 gets you a transform in GF(2^8).
| inopinatus wrote:
| Sometimes I read through the instrinsics guide just to play the
| game of spotting instructions defined primarily because certain
| cryptologic agencies asked for it.
| camel-cdr wrote:
| Here is Knuth introducing the MMIX instruction MXOR, which
| Intel later defined on vector registers under the name
| vgf2p8affineqb.
|
| https://www.youtube.com/watch?v=r_pPF5npnio&t=3300 (55:00)
|
| "This is an instruction that doesn't exist in any computer
| right now, so why should I put it in a machine, if it's
| supposed to be realistic? Well, it's because it's ahead of
| time."
| mschuster91 wrote:
| MMIX? Now that's something I haven't heard in a long time...
| nine_k wrote:
| What a post. It should have taken a week just to write it, never
| mind the amount of time it took to actually come up with all this
| stuff and overcome all the obstacles mentioned. What a dedication
| to improving the performance of phrase search.
| LoganDark wrote:
| I think you meant to write "it must have" rather than "it
| should have"?
| nine_k wrote:
| Let's agree on "It likely has".
| rkagerer wrote:
| Is the first example under _The genius idea_ heading missing
| entry #3 below? mary: docs: -
| 0: posns: [0, 8] - 1:
| posns: [2] - 3: posns: [1]
| rstuart4133 wrote:
| I thought it's missing. However, he does introduce it with:
|
| > The inverted index will look something like this:
|
| He isn't wrong. It is indeed "something like".
| nemoniac wrote:
| Fascinating blog post. Having said that, it may seem like
| nitpicking but I have to take issue with the point about
| recursion, which is often far too easily blamed for inefficiency.
|
| The blog post mentions it as one of the reasons for the
| inefficiency of the conventional algorithm.
|
| A glance at the algorithm shows that the recursion in question is
| a tail call. This means that any overhead can be readily
| eliminated using a technique known for nearly fifty years
| already.
|
| Steele, Guy Lewis (1977). "Debunking the "expensive procedure
| call" myth or, procedure call implementations considered harmful
| or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual
| conference on - ACM '77.
| queuebert wrote:
| Dumb question: does modern stack layout randomization affect
| the efficiency of recursion? On first glance I would be worried
| about cache misses.
| bri3d wrote:
| Not specifically address space layout randomization in the
| way it's usually implemented; ASLR as applied in most modern
| production OSes randomizes each stack's base address, but
| each stack is still laid out and used in a normal way. There
| are some research projects towards actual stack layout
| randomization (involving stack rearrangement via static
| analysis, randomly sized stack padding frames, and other
| techniques) which would also definitely blow up cache, but
| none that are mainstream in a production system as far as I
| know.
|
| However, for the naive case where the recursion is a full-
| blown function call, without some kind of optimization, other
| security mitigations than ASLR will significantly affect the
| efficiency of recursion by adding function call overhead (and
| possible cache side effects) - for example, the stack cookie
| will still be verified and control-flow guard checks and the
| shadow/return stack will still be in play, if present.
| slashdev wrote:
| A lot of modern programming languages do not do tail call
| optimization, often citing keeping accurate stack history for
| debugging as an excuse.
|
| Regardless of how valid the excuse is, for such an obvious and
| old optimization, it's very poorly supported.
| cesarb wrote:
| The main problem with tail call optimization is that it's
| unreliable; small apparently unrelated changes elsewhere in
| the function, a difference in the compiler command line
| flags, or a different compiler version, could all make a tail
| call become a non-tail call. Some languages have proposed
| explicit markers to force a call to be a tail call (and
| generate a compilation error if it can't), but I don't think
| these proposals have been adopted yet.
| ltbarcly3 wrote:
| Very cool blog post, but the fastest phrase search would just use
| a suffix array, which would make any phrase search take double
| digit nanoseconds.
| yorwba wrote:
| > Why are you merging up to one rare token at the beginning or at
| the end? Let's consider that someone searched for C_0 R_1 C_2
| C_3. If we don't do this merge, we would end up searching for
| C_0, R_1, C_2 C_3, and this is bad. As established, intersecting
| common tokens is a problem, so it's way better to search C_0 R_1,
| C_2 C_3. I learned this the hard way...
|
| But since R_1 C_2 C_3 is in the index as well, instead of
| searching for C_0 R_1, C_2 C_3 with a distance of 2, you can
| instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1
| (overlapping), which hopefully means that the lists to intersect
| are smaller.
___________________________________________________________________
(page generated 2025-01-27 23:02 UTC)