[HN Gopher] 8-bit number to binary string (2013)
___________________________________________________________________
8-bit number to binary string (2013)
Author : modinfo
Score : 47 points
Date : 2022-05-25 19:01 UTC (3 hours ago)
(HTM) web link (gynvael.coldwind.pl)
(TXT) w3m dump (gynvael.coldwind.pl)
| vardump wrote:
| Those constants are somewhat obfuscated.
|
| 3472328296227680304 == 0x3030303030303030
|
| 9241421688590303745 == 0x8040201008040201
|
| 72340172838076673 == 0x0101010101010101
|
| ... and suddenly things are a lot less mysterious.
|
| Similar tricks were used for poor man's SIMD in one 32-bit
| register before MMX, Altivec etc. You could for example do just
| _one_ multiply RGB 5-5-5 bit alpha blending instead of 3! (
| "Cooked" / premultiplied alpha of course.)
| AdamH12113 wrote:
| Thank you, that makes much more sense.
| glouwbug wrote:
| Isn't that poor man SIMD an Abrash Black Book trick?
| vardump wrote:
| Perhaps, although a lot of people invented these tricks
| independently. I think some of the first ones have been known
| since seventies.
| pitaj wrote:
| Wow that's really clever. I never considered that the 8 unicode
| bytes corresponding to the 8 bits of u8 would fit in a u64.
| leni536 wrote:
| The distribution of bits can also be achieved with a PDEP
| instruction, I wonder if that would make this somewhat practical
| on a modern CPU.
| vanderZwan wrote:
| In the previous discussion someone lamented they wished they
| could do this in JavaScript[0].
|
| Nowadays we have TextDecoder and BigInt64Array so I'm pretty sure
| we actually could do it[1][2].
|
| I doubt it would be faster than calling <number>.toString(2), but
| in principle we could port this.
|
| [0] https://news.ycombinator.com/item?id=10517332
|
| [1] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder
|
| [2] https://developer.mozilla.org/en-
| US/docs/Web/JavaScript/Refe...
| Mr_Modulo wrote:
| <number>.toString(2) Wouldn't be quite right because the output
| needs to be padded out with zeros.
| vanderZwan wrote:
| Oh, right, good catch! So <number>.toString(2).padStart(8,
| "0") then. I still expect it to be faster than this clever
| hack though.
|
| https://developer.mozilla.org/en-
| US/docs/Web/JavaScript/Refe...
| mmphosis wrote:
| x86_64 no loop needed: movabsq
| $3472328296227680304, %rdx movzbl %dil, %eax
| movabsq $-9205322385119247871, %rdi imulq %rdi, %rax
| movabsq $72340172838076673, %rdi shrq $7, %rax
| andq %rdi, %rax addq %rdx, %rax movq %rax, (%rsi)
|
| 6502 with a loop: LDX #$08 loop
| LSR $EB LDA #$30 ADC #$00 DEX STA
| result,X BNE loop
| dragontamer wrote:
| To truly understand this stuff... you have to study SIMD.
|
| But wait, what SIMD instructions? No no no. The 64-bit register
| here is being used as a 64-way 1-bit SIMD.
|
| You can see that addition and the "&" instructions are just
| simple SIMD. The multiplication instruction can be broken up into
| complex 64x bitshifts + 64x add instructions all into a single
| statement.
|
| In general, this is called SWAR (SIMD within a register).
|
| ---------
|
| The "popcount" trick is also SIMD/SWAR (bitshift and add the &
| 0xcccccccc / 0x33333333 / etc. etc.), and is in fact a
| "reduction" / "prefix sum" applied over 1-bit values in a 32-bit
| or 64-bit register applied in a 1-bit SIMD as well.
|
| ------------
|
| This achieves very fast results because its innately parallel. A
| 64-bit processor after all, can be seen as a 64-way parallel
| 1-bit processor. Learning the parallelization techniques, and
| applying them in this fashion leads to this kind of super-fast
| code.
|
| Furthermore: PEXT / PDEP from Intel can be seen as "bitwise 64x
| way gather" and "bitwise 64x way scatter" bit-instructions. The
| 64-bit register as a 64-way parallel computer is truly a
| marvelous abstraction.
| sfblah wrote:
| It's a clever solution. For converting unsigned chars to strings,
| though, a lookup table is going to be the fastest solution, since
| there are only 256 possibilities.
| userbinator wrote:
| Possibly fastest in a microbenchmark, but that's 2k of cache
| (at least) which the table will consume.
| mortenlarsen wrote:
| Are you sure? A cache miss might be very expensive.
| dragontamer wrote:
| Lookup tables hit the load/store units, IIRC only 2 loads per
| clock tick on modern Intel or AMD-Zen chips (though if anyone
| remembers, correct me if I'm wrong).
|
| Modern bitshifts/add instructions can instead perform 3x per
| clock tick on Intel, and 4x per clock tick on AMD Zen.
|
| Multiplication is 5-cycles of latency, but at least 1-per-clock
| tick, maybe 2-per-clock-tick depending on architecture. The
| /128 is just a "bitshift-right 7".
|
| -------
|
| EDIT: So you need 8x lookups (at least 4 clock ticks worth of
| resources, assuming the CPU can perform 2-lookups per clock
| tick) for your lookup table approach.
|
| But this "*(unsigned long long*)out = 3472328296227680304ULL +
| (((c * 9241421688590303745ULL) / 128) & 72340172838076673ULL)"
| operation only has 3x bit operations + 1x multiply, using
| 2-clock ticks worth of resources.
|
| The multiply and lookups have multiple clock-ticks of latency,
| but throughput tends to be the more important metric rather
| than latency in high-speed CPU code in my experience. (CPUs are
| very smart about rearranging code out-of-order to become
| throughput bound rather than latency bound)
___________________________________________________________________
(page generated 2022-05-25 23:01 UTC)