[HN Gopher] Permuting Bits with GF2P8AFFINEQB
       ___________________________________________________________________
        
       Permuting Bits with GF2P8AFFINEQB
        
       Author : mfiguiere
       Score  : 38 points
       Date   : 2023-09-24 05:35 UTC (1 days ago)
        
 (HTM) web link (bitmath.blogspot.com)
 (TXT) w3m dump (bitmath.blogspot.com)
        
       | stabbles wrote:
       | This would be easier to follow with some diagrams...
        
         | ur-whale wrote:
         | Or a pointer to the documentation for GF2P8AFFINEQB
         | 
         | https://www.felixcloutier.com/x86/gf2p8affineqb
         | 
         | https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
        
       | dragontamer wrote:
       | For those unaware:
       | 
       | * A Galois Field is a mathematical structure where addition and
       | multiplication have been redefined so that some very useful
       | properties are retained. Galois Fields can exist with any prime
       | number or an "extension field" where that prime-number is
       | vectorized. In this case, the GF2 field (prime number 2) has been
       | extended to 8-bits (aka: a GF(2^8), aka 8-bit Galois Field).
       | 
       | * "Addition"'s new definition is simply XOR.
       | 
       | * "Multiply"'s new definition is bitshift and then add. (ie:
       | 0b10101010 x 0b00010010 == bitshift(x, 4) + bitshift(x, 1),
       | because the 4th and 1st bits are set to 1). And remember that
       | "addition" has been redefined to XOR in this math, so that +
       | means XOR.
       | 
       | * An Affine Transformation is A * x + B, where x is the original
       | value. As a "Galois Field Affine Transformation", A * x and + B
       | are all done in "Galois Field" terms.
       | 
       | * This is an AVX512 instruction, meaning there are 32-parallel
       | versions of this 8-bit computation happening in parallel across a
       | 512-bit vector.
       | 
       | Note: operation traditionally happens modulo a particular
       | GF(number) to create a field. The above operations are
       | "primitives" that can eventually create a field, but aren't
       | making a field just yet.
       | 
       | Perhaps the more accurate names for these operations is "GF-
       | addition" and "GF-pseudo-multiplication". In any case,
       | multiplication is any combination of the 8-bitshifts (bitshift0,
       | bitshift1, bitshift2...) and the 8x such results added together
       | (depending on the 1 or 0 on that bit). Meaning you can very
       | easily describe bitshift-and-xor operations to other
       | cryptographers who are operating in "GF-language".
       | 
       | Its actually really easy, though the terminology is a pain in the
       | ass.
       | 
       | ----------
       | 
       | Traditionally, this multiplication on Intel has been called
       | "carryless multiplication" (clmul) and "polynomial
       | multiplication" (pmull) on ARM. I preferred these names
       | personally.
       | 
       | But if Intel wants to call this the "gfp8affineqb", then so be
       | it, they're the ones naming this instruction not me. Its... a
       | reasonable name, I feel like a better name exists out there but I
       | can't think of one. But its certainly not a perfect name IMO.
       | Maybe "clp8affineqb" ("carryless 8-bit affine") would be my
       | preference?
        
         | gpderetta wrote:
         | So is this instruction basically a carry-less vectorized FMAD?
        
           | dragontamer wrote:
           | Seems like it.
           | 
           | I only learned of this operation from reading this blogpost.
           | "Permuting Bits" seems like a bloody obvious use of this
           | instruction, but honestly, anyone who is familiar with carry-
           | less multiplication will immediately see the widespread
           | capabilities of this instruction.
           | 
           | It really demonstrates to me that once again, Intel and their
           | design of AVX512 has some very smart people in there. Its a
           | good, flexible, operation. Albeit arcane and few people
           | probably see how to use this instruction well, but its really
           | exciting to see "such a good idea" from them.
           | 
           | GF(2^8) fields (aka: 8-bit GF operations) are used in all
           | sorts of CRC, Error-correction Reed Solomon, Elliptical
           | Curve, etc. etc. operations. But beyond just that, bitshifts
           | and xors innately have a huge number of applications outside
           | of cryptography or GF-stuffs. It makes sense to make a
           | dedicated instruction like this.
           | 
           | ------
           | 
           | EDIT: Upon rereading the end of the blogpost:
           | 
           | > I will probably keep using a SAT solver to solve the masks
           | 
           | Yeah, that's actually what's needed to generalize this.
           | 
           | Note: common operations like "if(a) b else c", or "min(a,
           | b)", or such can likely be specified in terms of AVX512
           | operations now.
           | 
           | Powerful, efficient, operations like GF2P8AFFINEQB, pext,
           | pdep, and pshufb are getting damn close to an FPGA-like
           | parallel processor.
           | 
           | I mean, pshufb is already an arbitrary 4-bit truth table /
           | 4-LUT if you think about it.
        
           | dzaima wrote:
           | Not quite - in "A * x + b", "A" is an 8x8 matrix, while "x"
           | and "b" are 8-bit vectors. (it's vectorized over "A" and "x",
           | but not "b", which is the immediate value)
        
             | dragontamer wrote:
             | https://www.felixcloutier.com/x86/gf2p8affineqb
             | 
             | VGF2P8AFFINEQB zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst, imm8
             | 
             | So there are 3x AVX512 registers + 1x 8-bit immediate.
             | 
             | zmm1 seems to be the output register, the 64x 8-bit
             | results.
             | 
             | zmm2 seems to be the "64x parallel x" in Ax + b.
             | 
             | zmm3 seems to be a 8x parallel A-matrix. I'm not sure how
             | these 8x A-matrices line up to the final result, but
             | perhaps 8-at-a-time is the logical conclusion. (outputs
             | values#0-7 use A[0] matrix. Output#8-15 uses A[1] matrix?)
             | 
             | imm8 seems to be "b" in the Ax + b formula, and is shared
             | with all operations.
             | 
             | --------
             | 
             | EDIT: I think I get it now. Edited this post with my latest
             | understanding.
        
               | dzaima wrote:
               | Yeah the edited version is correct; it matches up each
               | group of eight "x" values with one "A" matrix (such that,
               | in the bitwise representation of the arguments, there's
               | no data transfer happening across 64-bit blocks; this
               | ends up quite limiting if you want different matrices, so
               | most of the uses are ones where the matrix is the same
               | for all elements).
        
               | dragontamer wrote:
               | This is nuts.
               | 
               | https://www.intel.com/content/www/us/en/docs/intrinsics-
               | guid...
               | 
               | Intel's documentation is claiming 0.5 CPI, or a bandwidth
               | of 2x of these operations per clock tick.
               | 
               | I mean, forget about bit-shuffling or permuting. I feel
               | like with the right configuration, you can operate
               | 512-bits of CRC32 calculations in parallel, maybe in like
               | 3 or 4 instructions (or ~2 clock ticks) or something
               | incredibly fast.
               | 
               | > this ends up quite limiting if you want different
               | matrices, so most of the uses are ones where the matrix
               | is the same for all elements
               | 
               | Elliptical Curves is the obvious use case.
               | 
               | CRC is one immediate use where all the matricies are the
               | same (and linear in the GF-field), albeit the reduction
               | needs to be done "somehow" but that can be figured out in
               | a later operation... probably...
               | 
               | AES encryption is operated over 4x sets of GF(2^8). I
               | know that AES is already got its own instruction but a
               | full software implementation of the GF-operations behind
               | AES might be possible with this.
               | 
               | --------------
               | 
               | I dunno, there's a lot of things I see already as
               | immediately applicable where the "A" matrix is nearly
               | constant across all elements. And for those where A isn't
               | constant, pipelined operation is obvious.
               | 
               | 8x different "A" matrixies means that we can build a
               | "pipeline" where we can AVX512-shift the bytes down by 8
               | as well each clock tick, and build an 8x-deep pipeline
               | and apply those 8x A matricies in parallel while
               | consuming 64-bytes per clock-tick, while performing
               | 512-bit worth of operation per clock tick (no, 1024-bits
               | because Intel chips support 2x such operations per clock
               | tick).
               | 
               | Pipelining, full parallelization, etc. etc. So many
               | architectures are possible with this spec. IMO, it shows
               | that Intel knows what they're doing with these intrinsic,
               | and I think this is something now GPU can accomplish yet.
               | Truly an advantage to Intel chips only.
               | 
               | Very nice instruction. The practical benefits though
               | (Intel-only, exceptionally arcane with very few
               | programmers who can work with, etc. etc.) may make it a
               | dead instruction though. But I'm sure some wizard out
               | there is going to do something cool with this instruction
               | over the next 5 years. (It'd probably take weeks of
               | practice on some practice problem just to learn how to
               | use this instruction correctly, lol)
        
               | dzaima wrote:
               | 0.5 CPI is for the 128-bit or 256-bit versions; 512-bit
               | ones are 1 CPI. Same on Zen 4 according to uops.info - ht
               | tps://uops.info/table.html?search=vgf2p8affineqb%20zmm&cb
               | _...
        
               | dragontamer wrote:
               | That's still crazy to me that such a powerful operation
               | is done per clock tick.
               | 
               | I appreciate the uops.info table. I've never known about
               | that site, so I'll save that off now. Very nicely
               | presented latency/throughput stats.
               | 
               | -----------
               | 
               | EDIT: Looks like
               | https://www.felixcloutier.com/x86/gf2p8mulb also exists.
               | The 0b1'0001'1011 polynomial / x^8 + x^4+x^3+x+1
               | intrigues me, anyone have any idea why this particular
               | GF-polynomial was chosen?
               | 
               | EDIT2: Looks like its related to AES somehow
               | (https://crypto.stackexchange.com/questions/51848/why-
               | is-x8-x...). The CPU-chip already has AES-instructions
               | hard coded for modern TLS reasons (and AES-acceleration
               | is going to be in every modern chip). So I'm thinking
               | that these multiplication routines might be "sharing the
               | fundamental units" of the Intel CPU, and Intel is
               | wondering if anyone out there can make due with the
               | individual steps of Rinjadel/AES in other applications.
        
             | dragontamer wrote:
             | I appreciate the correction.
             | 
             | So this is perhaps more accurately denoted as:
             | 
             | * (A[0] * x + A[1] * x + ... A[7] * x + B).
             | 
             | If we normalize everything to 8-bit values. This is a far
             | more powerful operation than I initially described. Where *
             | is this pseudo-multiply (bitshift and xor) operation and +
             | is XOR.
        
             | [deleted]
        
         | H8crilA wrote:
         | How do you achieve actual multiplication? I remember there was
         | some polynomial that you had to divide by, and then every non-
         | zero element becomes invertible? Such pseudo-multiplication
         | doesn't seem to be invertible, as 0b00000010 multiplied by
         | anything will always end with a 0: 0b???????0.
        
           | dragontamer wrote:
           | You've got the gist but you're confusing a bunch of concepts
           | together.
           | 
           | Note: Remember that "x / y" division in a field is simply "x
           | * (1/y)", because (1/y) is always uniquely defined in a field
           | and "proper". So in practice, its actually a whole lot of
           | multiplies.
           | 
           | Note2: For the 8-bit domain to become a field, you must
           | perform a "modulo p", where p is prime (for simple fields),
           | or an irreducible polynomial (for an extension field). For us
           | computer-programmers, we're almost always operating on an
           | extension field (8-vector of field(2), aka a binary vector of
           | size 8-bits). (Though hardware designers may choose say,
           | GF(2^5), the 5-bit numbers, for USB1.0 packets for example)
           | 
           | So you need to perform the operation (x modulo p) and
           | normalize the result to return it to a field. There are
           | different fields for each chosen p, so the first step is to
           | choose a p. P is almost always chosen arbitrarily, but
           | there's benefits to different p's chosen. In any case, p
           | differs from case-to-case since they have minor benefits
           | (this p might be slightly faster at multiplication in
           | software, this other p might have an easier circuit layout
           | and is related to some simple LSFR, etc. etc.). So p is
           | arbitrary and changes from case-to-case.
           | 
           | After that, (x modulo p) is performed by (x / p), which
           | results in (quotient + remainder/p). In practice, one
           | iteration of this operation commonly results in a 9-bit
           | number or 10-bit number or 16-bit number. Anything 9 to 16
           | bits in length is obviously "not in GF(2^8)", so we
           | repeatedly perform the operation until the top-bits are zero
           | and we only have an 8-bit result.
           | 
           | (remainder/p) is simply (remainder * (1/p)), because this is
           | a field (all numbers 1/x are properly defined), so we can
           | "implement" this in practice through multiplications only.
           | 
           | --------------
           | 
           | And that's all CRC-32 and Reed Solomon Encoding is.
           | 
           | EDIT: I'm not sure if what I said originally was correct. I
           | decided to change the "modulo" algorithm to the LSFR style
           | which was simpler for me to discuss.
        
       | phiiiillll wrote:
       | Here's some context for this article:
       | 
       | https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
        
       | zodzedzi wrote:
       | For anyone else who was also seeing just garbled characters in
       | the title, and since the first few paragraphs of the linked
       | article don't explain what that is:
       | 
       | GF2P8AFFINEQB is one of the AVX-512 CPU instructions that
       | performs an affine transformation (essentially AX+b) on a Galois
       | field - finite number fields used in coding theory and
       | cryptography.
       | 
       | https://en.wikipedia.org/wiki/AVX-512#GFNI
        
         | spicymaki wrote:
         | Funny enough when I saw the gobbledygook, I immediately thought
         | to myself this looks like an AVX instruction.
        
       ___________________________________________________________________
       (page generated 2023-09-25 23:01 UTC)