[HN Gopher] Using the most unhinged AVX-512 instruction to make ...
       ___________________________________________________________________
        
       Using the most unhinged AVX-512 instruction to make fastest phrase
       search algo
        
       Author : cmcollier
       Score  : 214 points
       Date   : 2025-01-23 21:38 UTC (4 days ago)
        
 (HTM) web link (gab-menezes.github.io)
 (TXT) w3m dump (gab-menezes.github.io)
        
       | nxobject wrote:
       | Spoiler if you don't want to read through the (wonder but many)
       | paragraphs of exposition: the instruction is `vp2intersectq k,
       | zmm, zmm`.
        
         | bri3d wrote:
         | And, as noted in the article, that's an instruction which only
         | works on two desktop CPU architectures (Tiger Lake and Zen 5),
         | including one where it's arguably slower than not using it
         | (Tiger Lake).
         | 
         | Meaning... this entire effort was for something that's faster
         | on only a single kind of CPU (Zen 5).
         | 
         | This article is honestly one of the best I've read in a long
         | time. It's esoteric and the result is 99.5% pointless
         | objectively, but in reality it's incredibly useful and a
         | wonderful guide to low-level x86 optimization end to end. The
         | sections on cache alignment and uiCA + analysis notes are a
         | perfect illustration of "how it's done."
        
           | tpm wrote:
           | Presumably Zen 5 cores will also get used in Threadripper and
           | EPYC processors.
        
             | josephg wrote:
             | Yep. And the feature will probably be available on all AMD
             | CPUs manufactured from here on.
             | 
             | It might be an esoteric feature today. But if it'll become
             | an ubiquitous feature in a few years, its nice to learn
             | about using it.
        
         | nine_k wrote:
         | Not just that, but the fact that Intel CPUs execute it 20-30
         | times slower than AMD Zen 5 CPUs.
         | 
         | Also, the fact that it's deprecated by Intel.
        
           | mycall wrote:
           | Now the question is if Intel will revive it now that Zen 5
           | has it.
        
       | iamnotagenius wrote:
       | Imo the most "unhinged" cpus for AVX-512 are early batches of
       | Alder Lakes which is the only cpu family that has nearly full
       | coverage of all existing avx-512 subsets.
        
         | fuhsnn wrote:
         | Do they cover anything Sapphire Rapids Xeon's don't? I thought
         | they share the same arch (Golden Cove).
        
           | iamnotagenius wrote:
           | Yes, you are right; I meant "consumer grade cpu".
        
           | suzumer wrote:
           | According to this [1] wikipedia article, the only feature
           | Sapphire Rapids doesn't support is VP2INTERSECT.
           | 
           | [1]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
        
             | nextaccountic wrote:
             | It seems that there are faster alternatives to it
             | 
             | https://arxiv.org/abs/2112.06342
             | 
             | https://www.reddit.com/r/asm/comments/110pld0/fasterthannat
             | i...
        
               | hogwarts2025 wrote:
               | Or Zen 5. :-p
        
               | janwas wrote:
               | Note that the article mentions using both outputs of the
               | instruction, whereas the emulation is only able to
               | compute one output efficiently.
        
         | Namidairo wrote:
         | It's a shame that Intel seemed to really not want people to use
         | it, given they started disabling the ability to use it in
         | future microcode, and fused it off in later parts.
        
           | Aurornis wrote:
           | > It's a shame that Intel seemed to really not want people to
           | use it
           | 
           | AVX-512 was never part of the specification for those CPUs.
           | It was never advertised as a feature or selling point. You
           | had to disable the E cores to enable AVX-512, assuming your
           | motherboard even supported it.
           | 
           | Alder Lake AVX-512 has reached mythical status, but I think
           | the number of people angry about it is far higher than the
           | number of people who ever could have taken advantage of it
           | and benefitted from it. For general purpose workloads, having
           | the E cores enabled (and therefore AVX-512 disabled) was
           | faster. You had to have an extremely specific workload that
           | didn't scale well with additional cores and also had hot
           | loops that benefitted from AVX-512, which was not very
           | common.
           | 
           | So you're right: They never wanted people to use it. It
           | wasn't advertised and wasn't usable without sacrificing all
           | of the E cores and doing a lot of manual configuration work.
           | I suspect they didn't want people using it because they never
           | validated it. AVX-512 mode increased the voltages, which
           | would impact things like failure rate and warranty returns.
           | They probably meant to turn it off but forgot in the first
           | versions.
        
             | Dylan16807 wrote:
             | The reason you had to disable the E cores was... also an
             | artificial barrier imposed by Intel. Enabling AVX-512 only
             | looks like a problem when inside that false dichotomy. You
             | can have both with a bit of scheduler awareness.
        
             | 867-5309 wrote:
             | https://www.reddit.com/r/rpcs3/comments/tqt1ko/clearing_up_
             | s...
        
             | adrian_b wrote:
             | They had to disable AVX-512 only because Microsoft was too
             | lazy to rewrite their thread scheduler to handle
             | heterogeneous CPU cores.
             | 
             | The Intel-AMD x86-64 architecture is full of horrible
             | things, starting with the System Management Mode added in
             | 1990, which have been added by Intel only because every
             | time Microsoft has refused to update Windows, expecting
             | that the hardware vendors must do the work instead of
             | Microsoft for enabling Windows to continue to work on newer
             | hardware, even when that causes various disadvantages for
             | the customers.
             | 
             | Moreover, even if Intel had not said that Alder Lake will
             | support AVX-512, they also had not said that the P-cores of
             | Alder Lake will not support AVX-512.
             | 
             | Therefore everybody had expected that Intel will continue
             | to provide backward compatibility, as always before that,
             | so the P-cores of Alder Lake will continue to support any
             | instruction subset that had been supported by Rocket Lake
             | and Tiger Lake and Ice Lake and Cannon Lake.
             | 
             | The failure to be compatible with their previous products
             | has been a surprise for everybody.
        
               | p_l wrote:
               | Windows can work without SMM, especially NT - the problem
               | is that SMM was created for a world where majority used
               | DOS and the idea of using OS services instead of every
               | possibly quirk of IBM PC was anathema to developers.
               | 
               | Thus, SMM, because there was no other way to hook power
               | management on a 386 laptop running " normal" DOS
        
               | kccqzy wrote:
               | If Windows could work without SMM, is there a historical
               | reason why SMM mode didn't just die and become disused
               | after Windows becomes popular and nobody uses DOS any
               | more? There are plenty of features in x86 that are
               | disused.
        
               | cesarb wrote:
               | > Thus, SMM, because there was no other way to hook power
               | management on a 386 laptop running " normal" DOS
               | 
               | In theory, there was: you could have a separate
               | microcontroller, accessed through some of the I/O ports,
               | doing the power management; it's mostly how it's done
               | nowadays, with the EC (Embedded Controller) on laptops
               | (and nowadays there's also the PSP or ME, which is a
               | separate processor core doing startup and power
               | management for the main CPU cores). But back then, it
               | would also be more expensive (a whole other chip) than
               | simply adding an extra mode to the single CPU core
               | (multiple cores back then usually required multiple CPU
               | chips).
        
               | kccqzy wrote:
               | > the P-cores of Alder Lake will continue to support any
               | instruction subset that had been supported by Rocket Lake
               | and Tiger Lake and Ice Lake and Cannon Lake
               | 
               | Wait. I thought the article says only Tiger Lake supports
               | the vp2intersect instruction. Is that not true then?
        
               | adrian_b wrote:
               | Tiger Lake is the only one with vp2intersect, but before
               | Alder Lake there had already been 3 generations of
               | consumer CPUs with AVX-512 support (Cannon Lake in
               | 2018/2019, Ice Lake in 2019/2020 and Tiger Lake + Rocket
               | Lake in 2020/2021).
               | 
               | So it was expected that any future Intel CPUs will remain
               | compatible. Removing an important instruction subset has
               | never happened before in Intel's history.
               | 
               | Only AMD has removed some instructions when passing from
               | a 32-bit ISA to a 64-bit ISA, most of which were obsolete
               | (except that removing interrupt on overflow was bad and
               | it does not simplify greatly a CPU core, since there are
               | many other sources of precise exceptions that must still
               | be supported; the only important effect of removing INTO
               | is that many instructions can be retired earlier than
               | otherwise, which reduces the risk of filling up the
               | retirement queue).
        
               | Symmetry wrote:
               | I don't know if I'd call Microsoft lazy. Are there any
               | existing operating systems that allow preemptive
               | scheduling across cores with different ISA subsets? I'd
               | sort of assume Microsoft research has a proof of concept
               | for something like that but putting it into a production
               | OS is a different kettle of fish.
        
             | ack_complete wrote:
             | The problem with the validation argument is that the
             | P-cores were advertising AVX-512 via CPUID with the E-cores
             | disabled. If the AVX-512 support was not validated and
             | meant to be used, it would not have been a good idea to set
             | that CPUID bit, or even allow the instructions to be
             | executed without faulting. It's strange that it launched
             | with any AVX-512 support at all and there were rumors that
             | the decision to drop AVX-512 support officially was made at
             | the last minute.
             | 
             | As for the downsides of disabling the E-cores, there were
             | Alder Lake SKUs that were P-core only and had no E-cores.
             | 
             | Not all workloads are widely parallelizable and AVX-512 has
             | features that are also useful for highly serialized
             | workloads such as decompression, even at narrower than
             | 512-bit width. Part of the reason that AVX-512 has limited
             | usage is that Intel has set back widespread adoption of
             | AVX-512 by half a decade by dropping it again from their
             | consumer SKUs, with AVX10/256 only to return starting in
             | ~2026.
        
         | adgjlsfhk1 wrote:
         | What about Zen5?
        
         | irthomasthomas wrote:
         | I think I have two of these sitting in a box, one prototype
         | with avx512 and one retail without. Is it worth me breaking
         | these out for ML experiments and such?
        
       | jonstewart wrote:
       | The most unhinged AVX-512 instruction is GF2P8AFFINEQB.
        
         | pclmulqdq wrote:
         | What about GF2P8AFFINEINVQB?
        
           | jonstewart wrote:
           | potato, potato, tomato, tomato
        
           | LarsKrimi wrote:
           | It has a fixed polynomial, so not really that useful for
           | anything but AES
           | 
           | The only case where I've had use of GF(2^8) inverses is in
           | FEC algorithms (Forney's algorithm) and then you need some
           | kind of weird polynomial. But all of those needs are rarely
           | in the hot-path, and the FEC algo's are way outdated
        
             | pclmulqdq wrote:
             | I think the AFFINE and AFFINEINV instructions are
             | specifically for FEC and maybe compression algorithms. I
             | also think they smell like something requested by one of
             | the big customers of Intel (e.g. the government).
        
               | LarsKrimi wrote:
               | Hmm of course erasure codes would always need to solve
               | these problems. Not sure what modern applications need
               | that in the X86 world
               | 
               | I really think it's only AES since thats the only place
               | I've seen that polynomial used. But of course maybe
               | there's an obscure tape backup FEC algo used somewhere in
               | datacenters?
        
               | Sesse__ wrote:
               | The forward affine matrix is useful for all sorts of bit
               | manipulation, e.g. something as simple as a bit reversal.
        
         | bri3d wrote:
         | There's a pretty good list of weird off-label uses for the
         | Galois Field instructions here:
         | https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
        
           | genewitch wrote:
           | I think I actually need that instruction and have a use case
           | for it, and it does something with a matrix transpose so I
           | might finally find a real world useful demonstration of a
           | matrix operation I can cite to people who don't know what
           | those mean.
        
         | mrandish wrote:
         | From my 1980s 8-bit CPU perspective, the instruction is
         | unhinged based solely on the number of letters. Compared to
         | LDA, STA, RTS, that's not an assembler mnemonic, it's a novel.
         | :-)
        
           | pclmulqdq wrote:
           | "Load accumulator" (LDA)
           | 
           | vs
           | 
           | "Galois Field 2^8 affine transform on quad binary words"
           | (GF2P8AFFINEQB)
           | 
           | The compression factor isn't quite the same on character
           | count, but it's still abbreviated. :)
        
             | mananaysiempre wrote:
             | Incidentally, how is it a GF(2^8) affine transform? As best
             | as I can tell, it's a GF(2)^8 affine transform, i.e. an
             | affine transform of vectors of bits with normal XOR
             | addition and AND multiplication, and the polynomial
             | defining GF(2^8) just does not enter anywhere. It does
             | enter into GF2P8AFFINEINVQB, but I'm having difficulties
             | finding a geometric description for that one at all.
        
               | pclmulqdq wrote:
               | I believe that the polynomial for GF2P8AFFINEQB is user-
               | defined. One argument is an 8x8 matrix in GF(2) and the
               | result is [A.x + b] in GF(2)^8 for each 8-bit section.
               | Don't quote me on this, but I believe that matrix
               | multiply in GF(2)^8 gets you a transform in GF(2^8).
        
         | inopinatus wrote:
         | Sometimes I read through the instrinsics guide just to play the
         | game of spotting instructions defined primarily because certain
         | cryptologic agencies asked for it.
        
         | camel-cdr wrote:
         | Here is Knuth introducing the MMIX instruction MXOR, which
         | Intel later defined on vector registers under the name
         | vgf2p8affineqb.
         | 
         | https://www.youtube.com/watch?v=r_pPF5npnio&t=3300 (55:00)
         | 
         | "This is an instruction that doesn't exist in any computer
         | right now, so why should I put it in a machine, if it's
         | supposed to be realistic? Well, it's because it's ahead of
         | time."
        
           | mschuster91 wrote:
           | MMIX? Now that's something I haven't heard in a long time...
        
       | nine_k wrote:
       | What a post. It should have taken a week just to write it, never
       | mind the amount of time it took to actually come up with all this
       | stuff and overcome all the obstacles mentioned. What a dedication
       | to improving the performance of phrase search.
        
         | LoganDark wrote:
         | I think you meant to write "it must have" rather than "it
         | should have"?
        
           | nine_k wrote:
           | Let's agree on "It likely has".
        
       | rkagerer wrote:
       | Is the first example under _The genius idea_ heading missing
       | entry #3 below?                 mary:          docs:            -
       | 0:                posns: [0, 8]            - 1:
       | posns: [2]            - 3:                posns: [1]
        
         | rstuart4133 wrote:
         | I thought it's missing. However, he does introduce it with:
         | 
         | > The inverted index will look something like this:
         | 
         | He isn't wrong. It is indeed "something like".
        
       | nemoniac wrote:
       | Fascinating blog post. Having said that, it may seem like
       | nitpicking but I have to take issue with the point about
       | recursion, which is often far too easily blamed for inefficiency.
       | 
       | The blog post mentions it as one of the reasons for the
       | inefficiency of the conventional algorithm.
       | 
       | A glance at the algorithm shows that the recursion in question is
       | a tail call. This means that any overhead can be readily
       | eliminated using a technique known for nearly fifty years
       | already.
       | 
       | Steele, Guy Lewis (1977). "Debunking the "expensive procedure
       | call" myth or, procedure call implementations considered harmful
       | or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual
       | conference on - ACM '77.
        
         | queuebert wrote:
         | Dumb question: does modern stack layout randomization affect
         | the efficiency of recursion? On first glance I would be worried
         | about cache misses.
        
           | bri3d wrote:
           | Not specifically address space layout randomization in the
           | way it's usually implemented; ASLR as applied in most modern
           | production OSes randomizes each stack's base address, but
           | each stack is still laid out and used in a normal way. There
           | are some research projects towards actual stack layout
           | randomization (involving stack rearrangement via static
           | analysis, randomly sized stack padding frames, and other
           | techniques) which would also definitely blow up cache, but
           | none that are mainstream in a production system as far as I
           | know.
           | 
           | However, for the naive case where the recursion is a full-
           | blown function call, without some kind of optimization, other
           | security mitigations than ASLR will significantly affect the
           | efficiency of recursion by adding function call overhead (and
           | possible cache side effects) - for example, the stack cookie
           | will still be verified and control-flow guard checks and the
           | shadow/return stack will still be in play, if present.
        
         | slashdev wrote:
         | A lot of modern programming languages do not do tail call
         | optimization, often citing keeping accurate stack history for
         | debugging as an excuse.
         | 
         | Regardless of how valid the excuse is, for such an obvious and
         | old optimization, it's very poorly supported.
        
           | cesarb wrote:
           | The main problem with tail call optimization is that it's
           | unreliable; small apparently unrelated changes elsewhere in
           | the function, a difference in the compiler command line
           | flags, or a different compiler version, could all make a tail
           | call become a non-tail call. Some languages have proposed
           | explicit markers to force a call to be a tail call (and
           | generate a compilation error if it can't), but I don't think
           | these proposals have been adopted yet.
        
       | ltbarcly3 wrote:
       | Very cool blog post, but the fastest phrase search would just use
       | a suffix array, which would make any phrase search take double
       | digit nanoseconds.
        
       | yorwba wrote:
       | > Why are you merging up to one rare token at the beginning or at
       | the end? Let's consider that someone searched for C_0 R_1 C_2
       | C_3. If we don't do this merge, we would end up searching for
       | C_0, R_1, C_2 C_3, and this is bad. As established, intersecting
       | common tokens is a problem, so it's way better to search C_0 R_1,
       | C_2 C_3. I learned this the hard way...
       | 
       | But since R_1 C_2 C_3 is in the index as well, instead of
       | searching for C_0 R_1, C_2 C_3 with a distance of 2, you can
       | instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1
       | (overlapping), which hopefully means that the lists to intersect
       | are smaller.
        
       ___________________________________________________________________
       (page generated 2025-01-27 23:02 UTC)