[HN Gopher] Dividing unsigned 8-bit numbers
___________________________________________________________________
Dividing unsigned 8-bit numbers
Author : mfiguiere
Score : 151 points
Date : 2024-12-21 19:25 UTC (1 days ago)
(HTM) web link (0x80.pl)
(TXT) w3m dump (0x80.pl)
| abcd_f wrote:
| Since it's 8bit by 8bit, a precomputed lookup table (64K in size)
| will be another option.
| mmastrac wrote:
| Lookup tables aren't necessarily faster these days for a lot of
| things when using low-level languages, but it would have been
| interesting to see the comparison here.
| adhoc32 wrote:
| pretty sure a memory access is faster than the methods
| presented in the article.
| dist-epoch wrote:
| Hitting L2 is more than 3-4 cycles
| retrac wrote:
| Access to main memory can be many many cycles; a short
| routine already in cache may be able to recompute a value
| more quickly than pulling it from main memory.
| Retr0id wrote:
| 64K is enough to fill L1 on many systems
| PhilipRoman wrote:
| Depends also heavily on the context. You pay for each cache
| miss twice - once for the miss itself, and next time when
| you access whatever was evicted during the first miss. This
| is why LUTs often shine in microbenchmarks, but drag down
| performance in real world scenarios when mixed with other
| cache bound code.
| ryao wrote:
| An uncached random memory access is around 100 cycles.
| Sesse__ wrote:
| 100 cycles would be very low. Many systems have more than
| 100 ns!
| ashvardanian wrote:
| 64 KB is a pretty significant budget for such a small
| operation. I've had a variant that uses 768 bytes with some
| extra logic, but will deprecate that kernel soon.
|
| https://github.com/ashvardanian/StringZilla/blob/0d47be212c5...
| sroussey wrote:
| This seems like something that could be in hardware to allow
| native execution of the instruction. Is that not the case
| anywhere?
| ack_complete wrote:
| A lookup table in memory can only be accessed an element at a
| time, so it gets bottlenecked in the execution units on the
| load and address generation traffic. This algorithm uses 8-bit
| integers, so the vectorized version is processing between 16-64
| elements per operation depending on the vector width. It's even
| worse if the 8-bit divide is integrated with other vector
| operations as then the lookup table method also has
| insert/extract overhead to exchange the values between the
| vector and integer units.
|
| A hybrid approach using small in-register lookup tables in the
| vector unit (pshufb/tbl) can be lucrative, but are very limited
| in table size.
| mithametacs wrote:
| I've never thought of a micro-lookup-table!
|
| That's cool. What have you used those for?
| ack_complete wrote:
| Typically just like conventional lookup tables, where you
| can get the table size down small enough. Indexed palette /
| quantization coding is a case where this can often work.
| It's pretty niche given the limitations, but if you can
| make it work it's often a major speedup since you're able
| to do 16/32/64 lookups in parallel.
| glangdale wrote:
| Not OP, but we use these tables all the time in Hyperscan
| (for string and character class matching) and it's a long-
| standing technique to use it for things like SIMD
| population count (obsoleted now if you have the right
| AVX-512 ISA, ofc).
| anonymoushn wrote:
| You can test whether a register full of bytes belong to a
| class of bytes with the high bit unset and distinct lower
| nibbles in 2 instructions (each with 1 cycle latency). For
| example the characters that commonly occur in text and must
| be escaped in json are "\"\\\\\r\n\t" (5 different bytes).
| Their lower nibbles are:
|
| \": 0010
|
| \t: 1001
|
| \n: 1010
|
| \\\: 1100
|
| \r: 1101
|
| Since there are no duplicate lower nibbles, we can test for
| membership in this set in 2 instructions. If we want to get
| the value into a general-purpose register and branch on
| whether it's 0 or not, that's 2 more instructions.
|
| For larger sets or sets determined at runtime, see here:
| http://0x80.pl/notesen/2018-10-18-simd-byte-lookup.html
| AlotOfReading wrote:
| I'm not certain it'll actually be faster, but you should be able
| to merge the reciprocal, multiplication and rounding adjustment
| into a single fma in log space via evil floating point bit hacks.
| Then you'd just be paying the cost of converting to and from the
| float representation of the integer.
| ashvardanian wrote:
| I'd also love to see such comparisons. In SimSIMD, if
| AVX-512FP16 is available, I use FP16 for most I8/U8 operations,
| but can't speak about division.
|
| Overall, using F64 directly for I32 division is a well known
| trick, and it should hold for smaller representations as well.
| AlotOfReading wrote:
| Don't have time to figure out the actual numbers myself, but
| here's an example doing the same for various transcendental
| functions: https://github.com/J-Montgomery/hacky_float_math/b
| lob/623ee9...
|
| a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be
| able to do a*(fma(b, magic, constant)) with appropriate
| conversions on either side. Should work in 32 bits for u8s.
| AlotOfReading wrote:
| Put in a couple hours to see how well it worked. The
| approximately correct form is slightly faster than just doing
| an FP division without proper rounding.
| uint8_t magic_divide(uint8_t a, uint8_t b) {
| float x = static_cast<float>(b); int i =
| 0x7eb504f3 - std::bit_cast<int>(x); float
| reciprocal = std::bit_cast<float>(i); return
| a*1.94285123f*reciprocal*(-x*reciprocal+1.43566f); }
|
| No errors, ~20 cycles latency on skylake by uiCA. You can
| optimize it a bit further by approximating the newton raphson
| step though: uint8_t
| approx_magic_divide(uint8_t a, uint8_t b) { float
| x = static_cast<float>(b); int i = 0x7eb504f3 -
| std::bit_cast<int>(x); float reciprocal =
| std::bit_cast<float>(i); reciprocal =
| 1.385356f*reciprocal; return a*reciprocal;
| }
|
| Slightly off when a is a multiple of b, but approximately
| correct. 15.5 cycles on skylake because of the dependency
| chain.
| orlp wrote:
| What's not mentioned is that in most cases you have a _constant_
| divisor which lets you replace division by multiplication with
| the reciprocal. The reciprocal can be rounded to a nearby dyadic
| rational, letting you do the division with a right-shift.
|
| For example, 8-bit division by 3 is equivalent to widening
| multiplication by 171 followed by a right-shift of 9, as 171/2^9
| = 0.333984375 which is close enough to 1/3 that the results match
| exactly.
| edflsafoiewq wrote:
| A shift of 16 is enough for every 8-bit numerator, ie. x/a is
| (u32(x)*b)>>16 for some b depending only on a. You could
| precompute b for each a and store it a lookup table. The
| largest b is b=65536 for a=1 and the smallest is b=258 for
| a=255, so b fits in a u16 if stored with a 1-offset. Not sure
| it's worth it unless you reuse the denominator many times
| though.
| thaumasiotes wrote:
| > For example, 8-bit division by 3 is equivalent to widening
| multiplication by 171 followed by a right-shift of 9, as
| 171/2^9 = 0.333984375 which is close enough to 1/3 that the
| results match exactly.
|
| Is this related to the fact that 171 is the multiplicative
| inverse of 3 (mod 256), or is that a coincidence?
| orlp wrote:
| It's not entirely a coincidence but also not a general result
| that one should use the modular inverse as multiplier.
|
| 171 * 3 = 2^9 + 1, which is not surprising as we know that
| 171 * 3 = 1 (mod 2^8). So rearranged we have 171 / 2^9 = 1/3
| + 1/(2^9*3) which shows it's a close approximation of 1/3.
| zahlman wrote:
| Sort of. (After all, a "reciprocal" for our purposes is just
| a multiplicative inverse in the reals, so it makes sense that
| it would be related to the multiplicative inverse in other
| domains.)
|
| 3 times 171 is 513. So to divide by 3, we could multiply by
| 171 and then divide by 513. Dividing by 513... isn't any
| easier, but 513 is close to 512, so we hope that dividing by
| 512 (which is trivial - just a right-shift) gets us close
| enough.
|
| Suppose for a moment we try dividing 3 by 3 using this trick.
| First we'll multiply by 171 to get 513. Consider that value
| in binary: 1000000001 ^^^^^^^^^
| ~~~~~~~~
|
| Dividing by 512 will shift away the ^ bits. For floor
| division, we therefore want the ^ underlined value to be
| close to zero. (That way, when we divide 255 (say) by 3, the
| error won't be big enough to overflow into the result bits.)
|
| The multiplicative inverse fact is equivalent to telling us
| that the ~ underlined bits are _exactly_ 1. Conveniently,
| that 's close to 0 - but we didn't account for all the ^
| underlined bits. For example, the multiplicative inverse of 7
| (mod 256) is 183, but 7 times 183 is 1281. That's close to
| 1280, but that doesn't really help us - we could right-shift
| by 8 but then we still have to divide by 5. If we just ignore
| the problem and divide by 1024 (right-shift by 10), of course
| we get a lot of wrong results. (Even 6 / 7 would give us 1
| instead of 0.)
|
| It also turns out that we'll need more bits for accurate
| results in the general case. I _think_ it 's possible without
| overflowing 16-bit numbers, but it definitely requires a bit
| more trickery for problematic divisors like (from my testing)
| 195. I thought I remembered the details here but proved
| myself wrong at the Python REPL :(
| orlp wrote:
| Division by 195 is trivial. The answer is simply
| uint8_t div195(uint8_t x) { return x >= 195; }
| zahlman wrote:
| Yes, but you can't incorporate that into an algorithm
| that allows the divisor to vary and avoids branching. 195
| is problematic _for the multiply-shift strategy_.
| kevinventullo wrote:
| Also, if you know ahead of time that it's exact division, there
| is a similar approach that doesn't even need widening
| multiplication!
| orlp wrote:
| Yes, if you know something is an exact multiple of n = r*2^k
| where r is odd, you can divide out the multiple by right-
| shifting k followed by modular multiplication by the modular
| multiplicative inverse of r.
| thaumasiotes wrote:
| Theoretically, you could also take advantage of two's
| complement arithmetic: 00011011 (27)
| x 01010101 (-0.333...) ------------------------
| 11110111 (-9) invert and add one:
| 00001001 (9, like we wanted)
|
| I'd be interested in extending this to non-exact multiples,
| but if that's possible I don't know how.
| Footkerchief wrote:
| Can you provide an example with details? Thanks!
| kevinventullo wrote:
| In 8-bit arithmetic (i.e. mod 256), the multiplicative
| inverse of 11 is 163. So, if you take some multiple of 11,
| say 154, then you can compute 154/11 instead as 154*163.
|
| Indeed,
|
| 154*163 = 25102, and
|
| 25102 = 14 (mod 256).
| 15155 wrote:
| These methods are especially useful in hardware/FPGA
| implementations where it's infeasible to have a ton of fully
| pipelined dividers.
| andrepd wrote:
| They are actually useful for optimising compilers too! Mul or
| mul+shifts is often faster than div
| uticus wrote:
| Does this approach still support modulo?
| dzaima wrote:
| Worst-case, you just multiply the result by the divisor and
| subtract that from the dividend.
| titzer wrote:
| The article might have been updated since you read it, but this
| is in there:
|
| > We try to vectorize the following C++ procedure. The
| procedure cannot assume anything about dividends, especially if
| they are all equal. Thus, it is not possible to employ division
| by a constant.
| ndesaulniers wrote:
| Probably could teach LLVM these tricks. Last I checked,
| division by double word was problematic.
| foundry27 wrote:
| I think the approximate reciprocal approach is interesting here.
| The doc mentions multiplying the dividend by ~1.00025 in the math
| to avoid FP error so you don't end up off-by-one after
| truncation, but I think this hack is still incomplete! On some
| inputs (like 255, or other unlucky divisors near powers of two),
| you might get borderline rounding behaviour that flips a bit of
| the final integer. It's easy to forget that single-precision
| floats don't line up neatly with every 8bit integer ratio in real
| code, and a single off-by-one can break pixel ops or feed subtle
| bugs into a bigger pipeline.
|
| I suspect a hybrid scheme like using approximate reciprocals for
| most values but punting to scalar for unlucky ones could handle
| these corner cases without killing performance. That'd be
| interesting to benchmark
| LegionMammal978 wrote:
| There are only 65280 possible inputs, that's easily small
| enough to test every value for correctness.
| dataflow wrote:
| > SIMD ISAs usually do not provide the integer division; the only
| known exception is RISC-V Vector Extension
|
| It's kind of funny to read "the only known exception is..." in
| this context. What would an unknown exception be - an ISA that
| accidentally implements this but that the author believes nobody
| is aware of yet?
|
| More seriously, I actually don't understand the intended meaning
| here. I assume the author means "out of all the ISAs I know"?
| What is that set of ISAs?
| michaelhoffman wrote:
| > I assume the author means "out of all the ISAs I know"?
|
| Out of all the ISAs where they know whether it provides integer
| division or not.
| dataflow wrote:
| Yeah but my point is that as a reader I'm trying to figure
| out which ISAs actually don't provide this (vs. which ISAs
| the author lacks knowledge of), and I still don't know what
| those are. The sentence looks like it's supposed to tell me,
| but it doesn't.
| perching_aix wrote:
| You could come up with an ISA that implements it and it
| wouldn't be "known". Maybe that helps?
| bee_rider wrote:
| Practically, could the expression "only know exception" mean
| anything other than "known by me?" I mean, it is clearly
| possible for an exception to exist, on account of the existing
| known exception, so they can't know that more exceptions don't
| exist out there somewhere.
|
| I dunno. I think it is a more meaningful statement if we know
| more about the author; if we assume that they are very well
| informed, I guess we would assume that the fact that they don't
| know about something is really meaningful. In the case of a
| blog post where most of us don't know the author, it is hard to
| infer much. But at least it tells us why they decided to do the
| thing.
| zahlman wrote:
| If a SIMD ISA exists, someone must know that it exists,
| because definitionally we only apply the term "SIMD ISA" to
| things that were consciously created to be such. So we could
| simply check every such example. Saying "only known example"
| is indeed silly.
|
| But e.g. in mathematics, if we say that "almost every member
| of set X has property Y; the only known exception is Z" then
| there absolutely could be more exceptions, even if we pool
| the knowledge of every mathematician. It isn't necessary that
| X is finite, or even enumerable. It could be possible for
| exceptions other than Z to exist even though every other
| member of the set that we know about has the property. It
| could be possible to prove that there are at most finitely
| many exceptions in an infinite set, and only know of Z but
| not be able to rule out the possibility of more exceptions
| than that.
|
| We don't even need to appeal to infinities. For example,
| there are problems in discrete math where nobody has found
| the exact answer (which necessarily is integer, by the
| construction of the problem) but we can prove upper and lower
| bounds. Suppose we find a problem where the known bounds are
| very tight (but not exact) and the bounded value is positive.
| Now, construct a set of integers ranging from 0 up to (proven
| upper bound + 1) inclusive... you can probably see where this
| is going.
|
| The latter doesn't apply to SIMD ISAs, because we know all
| the interesting (big hand-wave!) properties of all of them
| rather precisely - since they're _designed_ to have those
| properties.
| camel-cdr wrote:
| The Convex C1 [0] and for a newer example NEC SX Aurora [1]
| both also support vector integer division.
|
| [0]
| https://bitsavers.org/pdf/convex/080-000120-000_CONVEX_Archi...
|
| [1] https://ftp.libre-
| soc.org/NEC_SX_Aurora_TSUBASA_VectorEngine...
| dzaima wrote:
| Some SIMD ISAs with integer division:
|
| - Arm SVE, though only for 32-bit and 64-bit element types:
| https://developer.arm.com/architectures/instruction-sets/int...
|
| - loongarch64 LSX/LASX: https://jia.je/unofficial-loongarch-
| intrinsics-guide/viewer/...
|
| - MRISC32 (though that's somewhat obvious as basically
| everything in it is shared between scalar and vector).
| ryao wrote:
| Why not use libdivide?
|
| https://libdivide.com/
| LegionMammal978 wrote:
| libdivide is optimized for the case of a common divisor used
| many times, not for a long array of distinct divisors.
| mlochbaum wrote:
| We implemented something like this in CBQN last year (mainly for
| modulus, as floor division isn't a primitive). Commit is
| https://github.com/dzaima/CBQN/commit/d333902, some proofs of
| when and why it works at
| https://mlochbaum.github.io/BQN/implementation/primitive/ari....
| ack_complete wrote:
| Think the SSE2 implementation could be tightened up by using the
| same register for the dividend and the quotient, shifting the
| quotient bits into the dividend as the dividend bits are shifted
| out. This was common practice in software CPU division routines.
| Cold_Miserable wrote:
| Heh? Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a
| no-brainer? edit make that fast convert,rcp,fma (float 16
| constant 1.0) and xor (same constant)
| purplesyringa wrote:
| I tried a similar approach with 32-bit FP before, and the
| problem here is that fast conversion is only fast in the sense
| of latency. Throughput-wise, it takes 2 uops instead of one, so
| in the end, a plain float<->int conversion wins.
| bremac wrote:
| Unfortunately none of the hardware used for testing supports
| FP16 arithmetic. Between Intel and AMD, the only platform that
| supports AVX512-FP16 is currently Sapphire Rapids.
| synthos wrote:
| Newton Raphson could be used to calculate a reciprocal. (in a bit
| width larger than 8). If starting from a good reciprocal
| approximation, convergence to bit accurate reciprocal should not
| take many iterations. Then multiply the reciprocal with the
| numerator to perform the divide
| dzaima wrote:
| Here's a version of the vrcpps idea, doing whole vectors of 32
| u8's, bypassing lane crossing and somewhat simplifying the
| packing/unpacking, that's ~1.5x faster on Haswell:
| https://godbolt.org/z/TExGahv3z
| Cold_Miserable wrote:
| This is ~9.6x faster than "scalar".
|
| ASM_TestDiv proc ;rcx out, rdx A, r8 B mov
| rax,05555555555555555H kmovq k1,rax vmovdqu8 zmm0,zmmword ptr
| [rdx] vmovdqu8 zmm4,zmmword ptr [r8] vpbroadcastw zmm3,word ptr
| [FLOAT16_F8] vmovdqu8 zmm2{k1},zmm0 ;lower 8-bit vmovdqu8
| zmm16{k1},zmm4 ;lower 8-bit vpsrlw zmm1,zmm0,8 ;higher 8-bit
| vpsrlw zmm5,zmm4,8 ;higher 8-bit vpord zmm1,zmm1,zmm3 vpord
| zmm2,zmm2,zmm3 vpord zmm5,zmm5,zmm3 vpord zmm16,zmm16,zmm3
| vsubph zmm1,zmm1,zmm3{rd-sae} ;fast conv 16FP vsubph
| zmm2,zmm2,zmm3{rd-sae} vsubph zmm5,zmm5,zmm3{ru-sae} vsubph
| zmm16,zmm16,zmm3{ru-sae} vrcpph zmm5,zmm5 vrcpph zmm16,zmm16
| vfmadd213ph zmm1,zmm5,zmm3{rd-sae} vfmadd213ph
| zmm2,zmm16,zmm3{rd-sae} vxorpd zmm1,zmm1,zmm3 vxorpd
| zmm2,zmm2,zmm3 vpsllw zmm1,zmm1,8 vpord zmm1,zmm1,zmm2 vmovdqu8
| zmmword ptr [rcx],zmm1 ;16 8-bit unsigned int ret
| purplesyringa wrote:
| It's odd that the only reason AVX-512 long division wins in the
| end is that it's compared to AVX2 reciprocal. Would it be
| possible to compare it to AVX-512 reciprocal computation?
| jonhohle wrote:
| I do a fair amount of decompiling and the list of constants is
| awesome! I usually can make a reasonable guess or have intuition,
| but frequently end up stepping through several integers to find
| the right value. Having a handy lookup table will be great.
| afstr wrote:
| M I'm looking I'm I'm O ; r
| xpe wrote:
| The bit-division burns my eyes and makes a mockery of my puny
| intellect, my liege. Mine fellow vassals possess skills most
| frightful in their potency.
| RaisingSpear wrote:
| For CPUs that support AVX-512 VBMI, there's a faster reciprocal-
| based approach:
| https://avereniect.github.io/2023/04/29/uint8_division_using...
|
| A VBMI2 example implementation can be found in the function named
| divide_with_lookup_16b: https://godbolt.org/z/xdE1dx5Pj
___________________________________________________________________
(page generated 2024-12-22 23:01 UTC)