[HN Gopher] Dividing unsigned 8-bit numbers
       ___________________________________________________________________
        
       Dividing unsigned 8-bit numbers
        
       Author : mfiguiere
       Score  : 151 points
       Date   : 2024-12-21 19:25 UTC (1 days ago)
        
 (HTM) web link (0x80.pl)
 (TXT) w3m dump (0x80.pl)
        
       | abcd_f wrote:
       | Since it's 8bit by 8bit, a precomputed lookup table (64K in size)
       | will be another option.
        
         | mmastrac wrote:
         | Lookup tables aren't necessarily faster these days for a lot of
         | things when using low-level languages, but it would have been
         | interesting to see the comparison here.
        
           | adhoc32 wrote:
           | pretty sure a memory access is faster than the methods
           | presented in the article.
        
             | dist-epoch wrote:
             | Hitting L2 is more than 3-4 cycles
        
             | retrac wrote:
             | Access to main memory can be many many cycles; a short
             | routine already in cache may be able to recompute a value
             | more quickly than pulling it from main memory.
        
             | Retr0id wrote:
             | 64K is enough to fill L1 on many systems
        
             | PhilipRoman wrote:
             | Depends also heavily on the context. You pay for each cache
             | miss twice - once for the miss itself, and next time when
             | you access whatever was evicted during the first miss. This
             | is why LUTs often shine in microbenchmarks, but drag down
             | performance in real world scenarios when mixed with other
             | cache bound code.
        
             | ryao wrote:
             | An uncached random memory access is around 100 cycles.
        
               | Sesse__ wrote:
               | 100 cycles would be very low. Many systems have more than
               | 100 ns!
        
         | ashvardanian wrote:
         | 64 KB is a pretty significant budget for such a small
         | operation. I've had a variant that uses 768 bytes with some
         | extra logic, but will deprecate that kernel soon.
         | 
         | https://github.com/ashvardanian/StringZilla/blob/0d47be212c5...
        
           | sroussey wrote:
           | This seems like something that could be in hardware to allow
           | native execution of the instruction. Is that not the case
           | anywhere?
        
         | ack_complete wrote:
         | A lookup table in memory can only be accessed an element at a
         | time, so it gets bottlenecked in the execution units on the
         | load and address generation traffic. This algorithm uses 8-bit
         | integers, so the vectorized version is processing between 16-64
         | elements per operation depending on the vector width. It's even
         | worse if the 8-bit divide is integrated with other vector
         | operations as then the lookup table method also has
         | insert/extract overhead to exchange the values between the
         | vector and integer units.
         | 
         | A hybrid approach using small in-register lookup tables in the
         | vector unit (pshufb/tbl) can be lucrative, but are very limited
         | in table size.
        
           | mithametacs wrote:
           | I've never thought of a micro-lookup-table!
           | 
           | That's cool. What have you used those for?
        
             | ack_complete wrote:
             | Typically just like conventional lookup tables, where you
             | can get the table size down small enough. Indexed palette /
             | quantization coding is a case where this can often work.
             | It's pretty niche given the limitations, but if you can
             | make it work it's often a major speedup since you're able
             | to do 16/32/64 lookups in parallel.
        
             | glangdale wrote:
             | Not OP, but we use these tables all the time in Hyperscan
             | (for string and character class matching) and it's a long-
             | standing technique to use it for things like SIMD
             | population count (obsoleted now if you have the right
             | AVX-512 ISA, ofc).
        
             | anonymoushn wrote:
             | You can test whether a register full of bytes belong to a
             | class of bytes with the high bit unset and distinct lower
             | nibbles in 2 instructions (each with 1 cycle latency). For
             | example the characters that commonly occur in text and must
             | be escaped in json are "\"\\\\\r\n\t" (5 different bytes).
             | Their lower nibbles are:
             | 
             | \": 0010
             | 
             | \t: 1001
             | 
             | \n: 1010
             | 
             | \\\: 1100
             | 
             | \r: 1101
             | 
             | Since there are no duplicate lower nibbles, we can test for
             | membership in this set in 2 instructions. If we want to get
             | the value into a general-purpose register and branch on
             | whether it's 0 or not, that's 2 more instructions.
             | 
             | For larger sets or sets determined at runtime, see here:
             | http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html
        
       | AlotOfReading wrote:
       | I'm not certain it'll actually be faster, but you should be able
       | to merge the reciprocal, multiplication and rounding adjustment
       | into a single fma in log space via evil floating point bit hacks.
       | Then you'd just be paying the cost of converting to and from the
       | float representation of the integer.
        
         | ashvardanian wrote:
         | I'd also love to see such comparisons. In SimSIMD, if
         | AVX-512FP16 is available, I use FP16 for most I8/U8 operations,
         | but can't speak about division.
         | 
         | Overall, using F64 directly for I32 division is a well known
         | trick, and it should hold for smaller representations as well.
        
           | AlotOfReading wrote:
           | Don't have time to figure out the actual numbers myself, but
           | here's an example doing the same for various transcendental
           | functions: https://github.com/J-Montgomery/hacky_float_math/b
           | lob/623ee9...
           | 
           | a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be
           | able to do a*(fma(b, magic, constant)) with appropriate
           | conversions on either side. Should work in 32 bits for u8s.
        
           | AlotOfReading wrote:
           | Put in a couple hours to see how well it worked. The
           | approximately correct form is slightly faster than just doing
           | an FP division without proper rounding.
           | uint8_t magic_divide(uint8_t a, uint8_t b) {
           | float x = static_cast<float>(b);             int i =
           | 0x7eb504f3 - std::bit_cast<int>(x);             float
           | reciprocal = std::bit_cast<float>(i);             return
           | a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f);         }
           | 
           | No errors, ~20 cycles latency on skylake by uiCA. You can
           | optimize it a bit further by approximating the newton raphson
           | step though:                   uint8_t
           | approx_magic_divide(uint8_t a, uint8_t b) {             float
           | x = static_cast<float>(b);             int i = 0x7eb504f3 -
           | std::bit_cast<int>(x);             float reciprocal =
           | std::bit_cast<float>(i);             reciprocal =
           | 1.385356f*reciprocal;             return a*reciprocal;
           | }
           | 
           | Slightly off when a is a multiple of b, but approximately
           | correct. 15.5 cycles on skylake because of the dependency
           | chain.
        
       | orlp wrote:
       | What's not mentioned is that in most cases you have a _constant_
       | divisor which lets you replace division by multiplication with
       | the reciprocal. The reciprocal can be rounded to a nearby dyadic
       | rational, letting you do the division with a right-shift.
       | 
       | For example, 8-bit division by 3 is equivalent to widening
       | multiplication by 171 followed by a right-shift of 9, as 171/2^9
       | = 0.333984375 which is close enough to 1/3 that the results match
       | exactly.
        
         | edflsafoiewq wrote:
         | A shift of 16 is enough for every 8-bit numerator, ie. x/a is
         | (u32(x)*b)>>16 for some b depending only on a. You could
         | precompute b for each a and store it a lookup table. The
         | largest b is b=65536 for a=1 and the smallest is b=258 for
         | a=255, so b fits in a u16 if stored with a 1-offset. Not sure
         | it's worth it unless you reuse the denominator many times
         | though.
        
         | thaumasiotes wrote:
         | > For example, 8-bit division by 3 is equivalent to widening
         | multiplication by 171 followed by a right-shift of 9, as
         | 171/2^9 = 0.333984375 which is close enough to 1/3 that the
         | results match exactly.
         | 
         | Is this related to the fact that 171 is the multiplicative
         | inverse of 3 (mod 256), or is that a coincidence?
        
           | orlp wrote:
           | It's not entirely a coincidence but also not a general result
           | that one should use the modular inverse as multiplier.
           | 
           | 171 * 3 = 2^9 + 1, which is not surprising as we know that
           | 171 * 3 = 1 (mod 2^8). So rearranged we have 171 / 2^9 = 1/3
           | + 1/(2^9*3) which shows it's a close approximation of 1/3.
        
           | zahlman wrote:
           | Sort of. (After all, a "reciprocal" for our purposes is just
           | a multiplicative inverse in the reals, so it makes sense that
           | it would be related to the multiplicative inverse in other
           | domains.)
           | 
           | 3 times 171 is 513. So to divide by 3, we could multiply by
           | 171 and then divide by 513. Dividing by 513... isn't any
           | easier, but 513 is close to 512, so we hope that dividing by
           | 512 (which is trivial - just a right-shift) gets us close
           | enough.
           | 
           | Suppose for a moment we try dividing 3 by 3 using this trick.
           | First we'll multiply by 171 to get 513. Consider that value
           | in binary:                   1000000001          ^^^^^^^^^
           | ~~~~~~~~
           | 
           | Dividing by 512 will shift away the ^ bits. For floor
           | division, we therefore want the ^ underlined value to be
           | close to zero. (That way, when we divide 255 (say) by 3, the
           | error won't be big enough to overflow into the result bits.)
           | 
           | The multiplicative inverse fact is equivalent to telling us
           | that the ~ underlined bits are _exactly_ 1. Conveniently,
           | that 's close to 0 - but we didn't account for all the ^
           | underlined bits. For example, the multiplicative inverse of 7
           | (mod 256) is 183, but 7 times 183 is 1281. That's close to
           | 1280, but that doesn't really help us - we could right-shift
           | by 8 but then we still have to divide by 5. If we just ignore
           | the problem and divide by 1024 (right-shift by 10), of course
           | we get a lot of wrong results. (Even 6 / 7 would give us 1
           | instead of 0.)
           | 
           | It also turns out that we'll need more bits for accurate
           | results in the general case. I _think_ it 's possible without
           | overflowing 16-bit numbers, but it definitely requires a bit
           | more trickery for problematic divisors like (from my testing)
           | 195. I thought I remembered the details here but proved
           | myself wrong at the Python REPL :(
        
             | orlp wrote:
             | Division by 195 is trivial. The answer is simply
             | uint8_t div195(uint8_t x) { return x >= 195; }
        
               | zahlman wrote:
               | Yes, but you can't incorporate that into an algorithm
               | that allows the divisor to vary and avoids branching. 195
               | is problematic _for the multiply-shift strategy_.
        
         | kevinventullo wrote:
         | Also, if you know ahead of time that it's exact division, there
         | is a similar approach that doesn't even need widening
         | multiplication!
        
           | orlp wrote:
           | Yes, if you know something is an exact multiple of n = r*2^k
           | where r is odd, you can divide out the multiple by right-
           | shifting k followed by modular multiplication by the modular
           | multiplicative inverse of r.
        
             | thaumasiotes wrote:
             | Theoretically, you could also take advantage of two's
             | complement arithmetic:                      00011011 (27)
             | x 01010101 (-0.333...)         ------------------------
             | 11110111 (-9)                  invert and add one:
             | 00001001 (9, like we wanted)
             | 
             | I'd be interested in extending this to non-exact multiples,
             | but if that's possible I don't know how.
        
           | Footkerchief wrote:
           | Can you provide an example with details? Thanks!
        
             | kevinventullo wrote:
             | In 8-bit arithmetic (i.e. mod 256), the multiplicative
             | inverse of 11 is 163. So, if you take some multiple of 11,
             | say 154, then you can compute 154/11 instead as 154*163.
             | 
             | Indeed,
             | 
             | 154*163 = 25102, and
             | 
             | 25102 = 14 (mod 256).
        
         | 15155 wrote:
         | These methods are especially useful in hardware/FPGA
         | implementations where it's infeasible to have a ton of fully
         | pipelined dividers.
        
           | andrepd wrote:
           | They are actually useful for optimising compilers too! Mul or
           | mul+shifts is often faster than div
        
         | uticus wrote:
         | Does this approach still support modulo?
        
           | dzaima wrote:
           | Worst-case, you just multiply the result by the divisor and
           | subtract that from the dividend.
        
         | titzer wrote:
         | The article might have been updated since you read it, but this
         | is in there:
         | 
         | > We try to vectorize the following C++ procedure. The
         | procedure cannot assume anything about dividends, especially if
         | they are all equal. Thus, it is not possible to employ division
         | by a constant.
        
         | ndesaulniers wrote:
         | Probably could teach LLVM these tricks. Last I checked,
         | division by double word was problematic.
        
       | foundry27 wrote:
       | I think the approximate reciprocal approach is interesting here.
       | The doc mentions multiplying the dividend by ~1.00025 in the math
       | to avoid FP error so you don't end up off-by-one after
       | truncation, but I think this hack is still incomplete! On some
       | inputs (like 255, or other unlucky divisors near powers of two),
       | you might get borderline rounding behaviour that flips a bit of
       | the final integer. It's easy to forget that single-precision
       | floats don't line up neatly with every 8bit integer ratio in real
       | code, and a single off-by-one can break pixel ops or feed subtle
       | bugs into a bigger pipeline.
       | 
       | I suspect a hybrid scheme like using approximate reciprocals for
       | most values but punting to scalar for unlucky ones could handle
       | these corner cases without killing performance. That'd be
       | interesting to benchmark
        
         | LegionMammal978 wrote:
         | There are only 65280 possible inputs, that's easily small
         | enough to test every value for correctness.
        
       | dataflow wrote:
       | > SIMD ISAs usually do not provide the integer division; the only
       | known exception is RISC-V Vector Extension
       | 
       | It's kind of funny to read "the only known exception is..." in
       | this context. What would an unknown exception be - an ISA that
       | accidentally implements this but that the author believes nobody
       | is aware of yet?
       | 
       | More seriously, I actually don't understand the intended meaning
       | here. I assume the author means "out of all the ISAs I know"?
       | What is that set of ISAs?
        
         | michaelhoffman wrote:
         | > I assume the author means "out of all the ISAs I know"?
         | 
         | Out of all the ISAs where they know whether it provides integer
         | division or not.
        
           | dataflow wrote:
           | Yeah but my point is that as a reader I'm trying to figure
           | out which ISAs actually don't provide this (vs. which ISAs
           | the author lacks knowledge of), and I still don't know what
           | those are. The sentence looks like it's supposed to tell me,
           | but it doesn't.
        
         | perching_aix wrote:
         | You could come up with an ISA that implements it and it
         | wouldn't be "known". Maybe that helps?
        
         | bee_rider wrote:
         | Practically, could the expression "only know exception" mean
         | anything other than "known by me?" I mean, it is clearly
         | possible for an exception to exist, on account of the existing
         | known exception, so they can't know that more exceptions don't
         | exist out there somewhere.
         | 
         | I dunno. I think it is a more meaningful statement if we know
         | more about the author; if we assume that they are very well
         | informed, I guess we would assume that the fact that they don't
         | know about something is really meaningful. In the case of a
         | blog post where most of us don't know the author, it is hard to
         | infer much. But at least it tells us why they decided to do the
         | thing.
        
           | zahlman wrote:
           | If a SIMD ISA exists, someone must know that it exists,
           | because definitionally we only apply the term "SIMD ISA" to
           | things that were consciously created to be such. So we could
           | simply check every such example. Saying "only known example"
           | is indeed silly.
           | 
           | But e.g. in mathematics, if we say that "almost every member
           | of set X has property Y; the only known exception is Z" then
           | there absolutely could be more exceptions, even if we pool
           | the knowledge of every mathematician. It isn't necessary that
           | X is finite, or even enumerable. It could be possible for
           | exceptions other than Z to exist even though every other
           | member of the set that we know about has the property. It
           | could be possible to prove that there are at most finitely
           | many exceptions in an infinite set, and only know of Z but
           | not be able to rule out the possibility of more exceptions
           | than that.
           | 
           | We don't even need to appeal to infinities. For example,
           | there are problems in discrete math where nobody has found
           | the exact answer (which necessarily is integer, by the
           | construction of the problem) but we can prove upper and lower
           | bounds. Suppose we find a problem where the known bounds are
           | very tight (but not exact) and the bounded value is positive.
           | Now, construct a set of integers ranging from 0 up to (proven
           | upper bound + 1) inclusive... you can probably see where this
           | is going.
           | 
           | The latter doesn't apply to SIMD ISAs, because we know all
           | the interesting (big hand-wave!) properties of all of them
           | rather precisely - since they're _designed_ to have those
           | properties.
        
         | camel-cdr wrote:
         | The Convex C1 [0] and for a newer example NEC SX Aurora [1]
         | both also support vector integer division.
         | 
         | [0]
         | https://bitsavers.org/pdf/convex/080-000120-000_CONVEX_Archi...
         | 
         | [1] https://ftp.libre-
         | soc.org/NEC_SX_Aurora_TSUBASA_VectorEngine...
        
         | dzaima wrote:
         | Some SIMD ISAs with integer division:
         | 
         | - Arm SVE, though only for 32-bit and 64-bit element types:
         | https://developer.arm.com/architectures/instruction-sets/int...
         | 
         | - loongarch64 LSX/LASX: https://jia.je/unofficial-loongarch-
         | intrinsics-guide/viewer/...
         | 
         | - MRISC32 (though that's somewhat obvious as basically
         | everything in it is shared between scalar and vector).
        
       | ryao wrote:
       | Why not use libdivide?
       | 
       | https://libdivide.com/
        
         | LegionMammal978 wrote:
         | libdivide is optimized for the case of a common divisor used
         | many times, not for a long array of distinct divisors.
        
       | mlochbaum wrote:
       | We implemented something like this in CBQN last year (mainly for
       | modulus, as floor division isn't a primitive). Commit is
       | https://github.com/dzaima/CBQN/commit/d333902, some proofs of
       | when and why it works at
       | https://mlochbaum.github.io/BQN/implementation/primitive/ari....
        
       | ack_complete wrote:
       | Think the SSE2 implementation could be tightened up by using the
       | same register for the dividend and the quotient, shifting the
       | quotient bits into the dividend as the dividend bits are shifted
       | out. This was common practice in software CPU division routines.
        
       | Cold_Miserable wrote:
       | Heh? Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a
       | no-brainer? edit make that fast convert,rcp,fma (float 16
       | constant 1.0) and xor (same constant)
        
         | purplesyringa wrote:
         | I tried a similar approach with 32-bit FP before, and the
         | problem here is that fast conversion is only fast in the sense
         | of latency. Throughput-wise, it takes 2 uops instead of one, so
         | in the end, a plain float<->int conversion wins.
        
         | bremac wrote:
         | Unfortunately none of the hardware used for testing supports
         | FP16 arithmetic. Between Intel and AMD, the only platform that
         | supports AVX512-FP16 is currently Sapphire Rapids.
        
       | synthos wrote:
       | Newton Raphson could be used to calculate a reciprocal. (in a bit
       | width larger than 8). If starting from a good reciprocal
       | approximation, convergence to bit accurate reciprocal should not
       | take many iterations. Then multiply the reciprocal with the
       | numerator to perform the divide
        
       | dzaima wrote:
       | Here's a version of the vrcpps idea, doing whole vectors of 32
       | u8's, bypassing lane crossing and somewhat simplifying the
       | packing/unpacking, that's ~1.5x faster on Haswell:
       | https://godbolt.org/z/TExGahv3z
        
         | Cold_Miserable wrote:
         | This is ~9.6x faster than "scalar".
         | 
         | ASM_TestDiv proc ;rcx out, rdx A, r8 B mov
         | rax,05555555555555555H kmovq k1,rax vmovdqu8 zmm0,zmmword ptr
         | [rdx] vmovdqu8 zmm4,zmmword ptr [r8] vpbroadcastw zmm3,word ptr
         | [FLOAT16_F8] vmovdqu8 zmm2{k1},zmm0 ;lower 8-bit vmovdqu8
         | zmm16{k1},zmm4 ;lower 8-bit vpsrlw zmm1,zmm0,8 ;higher 8-bit
         | vpsrlw zmm5,zmm4,8 ;higher 8-bit vpord zmm1,zmm1,zmm3 vpord
         | zmm2,zmm2,zmm3 vpord zmm5,zmm5,zmm3 vpord zmm16,zmm16,zmm3
         | vsubph zmm1,zmm1,zmm3{rd-sae} ;fast conv 16FP vsubph
         | zmm2,zmm2,zmm3{rd-sae} vsubph zmm5,zmm5,zmm3{ru-sae} vsubph
         | zmm16,zmm16,zmm3{ru-sae} vrcpph zmm5,zmm5 vrcpph zmm16,zmm16
         | vfmadd213ph zmm1,zmm5,zmm3{rd-sae} vfmadd213ph
         | zmm2,zmm16,zmm3{rd-sae} vxorpd zmm1,zmm1,zmm3 vxorpd
         | zmm2,zmm2,zmm3 vpsllw zmm1,zmm1,8 vpord zmm1,zmm1,zmm2 vmovdqu8
         | zmmword ptr [rcx],zmm1 ;16 8-bit unsigned int ret
        
       | purplesyringa wrote:
       | It's odd that the only reason AVX-512 long division wins in the
       | end is that it's compared to AVX2 reciprocal. Would it be
       | possible to compare it to AVX-512 reciprocal computation?
        
       | jonhohle wrote:
       | I do a fair amount of decompiling and the list of constants is
       | awesome! I usually can make a reasonable guess or have intuition,
       | but frequently end up stepping through several integers to find
       | the right value. Having a handy lookup table will be great.
        
       | afstr wrote:
       | M I'm looking I'm I'm O ; r
        
       | xpe wrote:
       | The bit-division burns my eyes and makes a mockery of my puny
       | intellect, my liege. Mine fellow vassals possess skills most
       | frightful in their potency.
        
       | RaisingSpear wrote:
       | For CPUs that support AVX-512 VBMI, there's a faster reciprocal-
       | based approach:
       | https://avereniect.github.io/2023/04/29/uint8_division_using...
       | 
       | A VBMI2 example implementation can be found in the function named
       | divide_with_lookup_16b: https://godbolt.org/z/xdE1dx5Pj
        
       ___________________________________________________________________
       (page generated 2024-12-22 23:01 UTC)