[HN Gopher] Permuting Bits with GF2P8AFFINEQB
___________________________________________________________________
Permuting Bits with GF2P8AFFINEQB
Author : mfiguiere
Score : 38 points
Date : 2023-09-24 05:35 UTC (1 days ago)
(HTM) web link (bitmath.blogspot.com)
(TXT) w3m dump (bitmath.blogspot.com)
| stabbles wrote:
| This would be easier to follow with some diagrams...
| ur-whale wrote:
| Or a pointer to the documentation for GF2P8AFFINEQB
|
| https://www.felixcloutier.com/x86/gf2p8affineqb
|
| https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
| dragontamer wrote:
| For those unaware:
|
| * A Galois Field is a mathematical structure where addition and
| multiplication have been redefined so that some very useful
| properties are retained. Galois Fields can exist with any prime
| number or an "extension field" where that prime-number is
| vectorized. In this case, the GF2 field (prime number 2) has been
| extended to 8-bits (aka: a GF(2^8), aka 8-bit Galois Field).
|
| * "Addition"'s new definition is simply XOR.
|
| * "Multiply"'s new definition is bitshift and then add. (ie:
| 0b10101010 x 0b00010010 == bitshift(x, 4) + bitshift(x, 1),
| because the 4th and 1st bits are set to 1). And remember that
| "addition" has been redefined to XOR in this math, so that +
| means XOR.
|
| * An Affine Transformation is A * x + B, where x is the original
| value. As a "Galois Field Affine Transformation", A * x and + B
| are all done in "Galois Field" terms.
|
| * This is an AVX512 instruction, meaning there are 32-parallel
| versions of this 8-bit computation happening in parallel across a
| 512-bit vector.
|
| Note: operation traditionally happens modulo a particular
| GF(number) to create a field. The above operations are
| "primitives" that can eventually create a field, but aren't
| making a field just yet.
|
| Perhaps the more accurate names for these operations is "GF-
| addition" and "GF-pseudo-multiplication". In any case,
| multiplication is any combination of the 8-bitshifts (bitshift0,
| bitshift1, bitshift2...) and the 8x such results added together
| (depending on the 1 or 0 on that bit). Meaning you can very
| easily describe bitshift-and-xor operations to other
| cryptographers who are operating in "GF-language".
|
| Its actually really easy, though the terminology is a pain in the
| ass.
|
| ----------
|
| Traditionally, this multiplication on Intel has been called
| "carryless multiplication" (clmul) and "polynomial
| multiplication" (pmull) on ARM. I preferred these names
| personally.
|
| But if Intel wants to call this the "gfp8affineqb", then so be
| it, they're the ones naming this instruction not me. Its... a
| reasonable name, I feel like a better name exists out there but I
| can't think of one. But its certainly not a perfect name IMO.
| Maybe "clp8affineqb" ("carryless 8-bit affine") would be my
| preference?
| gpderetta wrote:
| So is this instruction basically a carry-less vectorized FMAD?
| dragontamer wrote:
| Seems like it.
|
| I only learned of this operation from reading this blogpost.
| "Permuting Bits" seems like a bloody obvious use of this
| instruction, but honestly, anyone who is familiar with carry-
| less multiplication will immediately see the widespread
| capabilities of this instruction.
|
| It really demonstrates to me that once again, Intel and their
| design of AVX512 has some very smart people in there. Its a
| good, flexible, operation. Albeit arcane and few people
| probably see how to use this instruction well, but its really
| exciting to see "such a good idea" from them.
|
| GF(2^8) fields (aka: 8-bit GF operations) are used in all
| sorts of CRC, Error-correction Reed Solomon, Elliptical
| Curve, etc. etc. operations. But beyond just that, bitshifts
| and xors innately have a huge number of applications outside
| of cryptography or GF-stuffs. It makes sense to make a
| dedicated instruction like this.
|
| ------
|
| EDIT: Upon rereading the end of the blogpost:
|
| > I will probably keep using a SAT solver to solve the masks
|
| Yeah, that's actually what's needed to generalize this.
|
| Note: common operations like "if(a) b else c", or "min(a,
| b)", or such can likely be specified in terms of AVX512
| operations now.
|
| Powerful, efficient, operations like GF2P8AFFINEQB, pext,
| pdep, and pshufb are getting damn close to an FPGA-like
| parallel processor.
|
| I mean, pshufb is already an arbitrary 4-bit truth table /
| 4-LUT if you think about it.
| dzaima wrote:
| Not quite - in "A * x + b", "A" is an 8x8 matrix, while "x"
| and "b" are 8-bit vectors. (it's vectorized over "A" and "x",
| but not "b", which is the immediate value)
| dragontamer wrote:
| https://www.felixcloutier.com/x86/gf2p8affineqb
|
| VGF2P8AFFINEQB zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst, imm8
|
| So there are 3x AVX512 registers + 1x 8-bit immediate.
|
| zmm1 seems to be the output register, the 64x 8-bit
| results.
|
| zmm2 seems to be the "64x parallel x" in Ax + b.
|
| zmm3 seems to be a 8x parallel A-matrix. I'm not sure how
| these 8x A-matrices line up to the final result, but
| perhaps 8-at-a-time is the logical conclusion. (outputs
| values#0-7 use A[0] matrix. Output#8-15 uses A[1] matrix?)
|
| imm8 seems to be "b" in the Ax + b formula, and is shared
| with all operations.
|
| --------
|
| EDIT: I think I get it now. Edited this post with my latest
| understanding.
| dzaima wrote:
| Yeah the edited version is correct; it matches up each
| group of eight "x" values with one "A" matrix (such that,
| in the bitwise representation of the arguments, there's
| no data transfer happening across 64-bit blocks; this
| ends up quite limiting if you want different matrices, so
| most of the uses are ones where the matrix is the same
| for all elements).
| dragontamer wrote:
| This is nuts.
|
| https://www.intel.com/content/www/us/en/docs/intrinsics-
| guid...
|
| Intel's documentation is claiming 0.5 CPI, or a bandwidth
| of 2x of these operations per clock tick.
|
| I mean, forget about bit-shuffling or permuting. I feel
| like with the right configuration, you can operate
| 512-bits of CRC32 calculations in parallel, maybe in like
| 3 or 4 instructions (or ~2 clock ticks) or something
| incredibly fast.
|
| > this ends up quite limiting if you want different
| matrices, so most of the uses are ones where the matrix
| is the same for all elements
|
| Elliptical Curves is the obvious use case.
|
| CRC is one immediate use where all the matricies are the
| same (and linear in the GF-field), albeit the reduction
| needs to be done "somehow" but that can be figured out in
| a later operation... probably...
|
| AES encryption is operated over 4x sets of GF(2^8). I
| know that AES is already got its own instruction but a
| full software implementation of the GF-operations behind
| AES might be possible with this.
|
| --------------
|
| I dunno, there's a lot of things I see already as
| immediately applicable where the "A" matrix is nearly
| constant across all elements. And for those where A isn't
| constant, pipelined operation is obvious.
|
| 8x different "A" matrixies means that we can build a
| "pipeline" where we can AVX512-shift the bytes down by 8
| as well each clock tick, and build an 8x-deep pipeline
| and apply those 8x A matricies in parallel while
| consuming 64-bytes per clock-tick, while performing
| 512-bit worth of operation per clock tick (no, 1024-bits
| because Intel chips support 2x such operations per clock
| tick).
|
| Pipelining, full parallelization, etc. etc. So many
| architectures are possible with this spec. IMO, it shows
| that Intel knows what they're doing with these intrinsic,
| and I think this is something now GPU can accomplish yet.
| Truly an advantage to Intel chips only.
|
| Very nice instruction. The practical benefits though
| (Intel-only, exceptionally arcane with very few
| programmers who can work with, etc. etc.) may make it a
| dead instruction though. But I'm sure some wizard out
| there is going to do something cool with this instruction
| over the next 5 years. (It'd probably take weeks of
| practice on some practice problem just to learn how to
| use this instruction correctly, lol)
| dzaima wrote:
| 0.5 CPI is for the 128-bit or 256-bit versions; 512-bit
| ones are 1 CPI. Same on Zen 4 according to uops.info - ht
| tps://uops.info/table.html?search=vgf2p8affineqb%20zmm&cb
| _...
| dragontamer wrote:
| That's still crazy to me that such a powerful operation
| is done per clock tick.
|
| I appreciate the uops.info table. I've never known about
| that site, so I'll save that off now. Very nicely
| presented latency/throughput stats.
|
| -----------
|
| EDIT: Looks like
| https://www.felixcloutier.com/x86/gf2p8mulb also exists.
| The 0b1'0001'1011 polynomial / x^8 + x^4+x^3+x+1
| intrigues me, anyone have any idea why this particular
| GF-polynomial was chosen?
|
| EDIT2: Looks like its related to AES somehow
| (https://crypto.stackexchange.com/questions/51848/why-
| is-x8-x...). The CPU-chip already has AES-instructions
| hard coded for modern TLS reasons (and AES-acceleration
| is going to be in every modern chip). So I'm thinking
| that these multiplication routines might be "sharing the
| fundamental units" of the Intel CPU, and Intel is
| wondering if anyone out there can make due with the
| individual steps of Rinjadel/AES in other applications.
| dragontamer wrote:
| I appreciate the correction.
|
| So this is perhaps more accurately denoted as:
|
| * (A[0] * x + A[1] * x + ... A[7] * x + B).
|
| If we normalize everything to 8-bit values. This is a far
| more powerful operation than I initially described. Where *
| is this pseudo-multiply (bitshift and xor) operation and +
| is XOR.
| [deleted]
| H8crilA wrote:
| How do you achieve actual multiplication? I remember there was
| some polynomial that you had to divide by, and then every non-
| zero element becomes invertible? Such pseudo-multiplication
| doesn't seem to be invertible, as 0b00000010 multiplied by
| anything will always end with a 0: 0b???????0.
| dragontamer wrote:
| You've got the gist but you're confusing a bunch of concepts
| together.
|
| Note: Remember that "x / y" division in a field is simply "x
| * (1/y)", because (1/y) is always uniquely defined in a field
| and "proper". So in practice, its actually a whole lot of
| multiplies.
|
| Note2: For the 8-bit domain to become a field, you must
| perform a "modulo p", where p is prime (for simple fields),
| or an irreducible polynomial (for an extension field). For us
| computer-programmers, we're almost always operating on an
| extension field (8-vector of field(2), aka a binary vector of
| size 8-bits). (Though hardware designers may choose say,
| GF(2^5), the 5-bit numbers, for USB1.0 packets for example)
|
| So you need to perform the operation (x modulo p) and
| normalize the result to return it to a field. There are
| different fields for each chosen p, so the first step is to
| choose a p. P is almost always chosen arbitrarily, but
| there's benefits to different p's chosen. In any case, p
| differs from case-to-case since they have minor benefits
| (this p might be slightly faster at multiplication in
| software, this other p might have an easier circuit layout
| and is related to some simple LSFR, etc. etc.). So p is
| arbitrary and changes from case-to-case.
|
| After that, (x modulo p) is performed by (x / p), which
| results in (quotient + remainder/p). In practice, one
| iteration of this operation commonly results in a 9-bit
| number or 10-bit number or 16-bit number. Anything 9 to 16
| bits in length is obviously "not in GF(2^8)", so we
| repeatedly perform the operation until the top-bits are zero
| and we only have an 8-bit result.
|
| (remainder/p) is simply (remainder * (1/p)), because this is
| a field (all numbers 1/x are properly defined), so we can
| "implement" this in practice through multiplications only.
|
| --------------
|
| And that's all CRC-32 and Reed Solomon Encoding is.
|
| EDIT: I'm not sure if what I said originally was correct. I
| decided to change the "modulo" algorithm to the LSFR style
| which was simpler for me to discuss.
| phiiiillll wrote:
| Here's some context for this article:
|
| https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
| zodzedzi wrote:
| For anyone else who was also seeing just garbled characters in
| the title, and since the first few paragraphs of the linked
| article don't explain what that is:
|
| GF2P8AFFINEQB is one of the AVX-512 CPU instructions that
| performs an affine transformation (essentially AX+b) on a Galois
| field - finite number fields used in coding theory and
| cryptography.
|
| https://en.wikipedia.org/wiki/AVX-512#GFNI
| spicymaki wrote:
| Funny enough when I saw the gobbledygook, I immediately thought
| to myself this looks like an AVX instruction.
___________________________________________________________________
(page generated 2023-09-25 23:01 UTC)