[HN Gopher] Branchless UTF-8 Encoding
___________________________________________________________________
Branchless UTF-8 Encoding
Author : vortex_ape
Score : 65 points
Date : 2025-01-17 19:20 UTC (3 hours ago)
(HTM) web link (cceckman.com)
(TXT) w3m dump (cceckman.com)
| ThatGuyRaion wrote:
| So is this potentially performance improving?.
| PhilipRoman wrote:
| Last time I tested branchless UTF-8 algorithms, I came to the
| conclusion that they only perform [slightly] better for text
| consisting of foreign multibyte characters. Unless you expect
| lots of such inputs on the hot path, just go with traditional
| algorithms instead. Even in the worst case the difference isn't
| that big.
|
| Sometimes people fail to appreciate how insanely fast a
| predictable branch really is.
| not2b wrote:
| Usually people are interested in branchless implementations for
| cryptography applications, to avoid timing side channels
| (though you then have to make sure that the generated
| instructions don't have different timing for different input
| values), and will pay some time penalty if they have to.
| Arnavion wrote:
| >So on x86_64 processors, we have to branch to say "a 32-bit zero
| value has 32 leading zeros". Put differently, the "count leading
| zeros" intrinsic isn't necessarily a branchless instruction. This
| might look nicer on another architecture!
|
| Yes, RISC-V for example defines the instructions for counting
| leading / trailing zeros (clz, clzw, ctz, ctzw) such that an
| N-bit zero value has N of them.
|
| I don't know if I can show it on Rust Godbolt because none of the
| default RISC-V targets that Rust has support the Zbb extension,
| but I checked with a custom target that I use locally for my
| emulator, and `leading_zeros()` indeed compiles to just one `clz`
| without any further branches. Here's a C demonstration though:
| https://gcc.godbolt.org/z/cKx3ajsjh
| xeeeeeeeeeeenu wrote:
| > So on x86_64 processors, we have to branch to say "a 32-bit
| zero value has 32 leading zeros".
|
| Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and
| Piledriver (AMD) introduced the LZCNT instruction that doesn't
| have this problem.
| sltkr wrote:
| You can also very trivially do (codepoint | 1).leading_zeros(),
| then you can also shave one byte off the LEN table. (This
| doesn't affect the result because LEN[32] == LEN[33] == 1).
| pklausler wrote:
| Easy to count leading zeroes in a branch-free manner without a
| hardware instruction using a conditional move and a de Bruijn
| sequence; see https://github.com/llvm/llvm-
| project/blob/main/flang/include... .
| lxgr wrote:
| > Compiler explorer confirms that, with optimizations enabled,
| this function is branchless.
|
| Only if you don't consider conditional move instructions
| branching/cheating :)
| comex wrote:
| Incidentally, this automatic branch-if-zero from LLVM is being
| improved.
|
| First of all, a recent LLVM patch apparently changes codegen to
| use CMOV instead of a branch:
|
| https://github.com/llvm/llvm-project/pull/102885
|
| Beyond that, Intel recently updated their manual to retroactively
| define the behavior of BSR/BSF on zero inputs: it leaves the
| destination register unmodified. This matches the AMD manual, and
| I suspect it matches the behavior of all existing x86-64
| processors (but that will need to be tested, I guess).
|
| If so, you don't need either a branch or CMOV. Just set a
| register to 32, then run BSR with the same register as
| destination. If the BSR input is nonzero, the 32 is overwritten
| with the trailing-zero count. If the BSR input is zero, then BSR
| leaves the register unmodified and you get 32.
|
| Since this behavior is now guaranteed for future x86-64
| processors, and assuming it's indeed compatible with all existing
| x86-64 processors (maybe even all x86 processors period?), LLVM
| will no longer need the old path regardless of what it's
| targeting.
|
| Note that if you're targeting a newer x86-64 version, LLVM will
| just emit TZCNT, which just does what you'd expect and returns 32
| if the input is zero (or 64 for a 64-bit TZCNT). But as the blog
| post demonstrates, many people still build for baseline x86_64.
|
| (Intel does document one discrepancy between processors: "On some
| older processors, use of a 32-bit operand size may clear the
| upper 32 bits of a 64-bit destination while leaving the lower 32
| bits unmodified.")
| orlp wrote:
| If you have access to the BMI2 instruction set I can do
| branchless UTF-8 encoding like in the article using only 9
| instructions and 73 bytes of lookup tables:
| branchless_utf8: mov rax, rdi lzcnt
| ecx, esi lea rdx, [rip + .L__unnamed_1]
| movzx ecx, byte ptr [rcx + rdx] lea rdx, [rip +
| example::DEP_AND_OR::h78cbe1dc7fe823a9] pdep esi,
| esi, dword ptr [rdx + 8*rcx] or esi, dword ptr
| [rdx + 8*rcx + 4] movbe dword ptr [rdi], esi
| mov qword ptr [rdi + 8], rcx ret
|
| The code: static DEP_AND_OR: [(u32, u32); 5] =
| [ (0, 0),
| (0b01111111_00000000_00000000_00000000,
| 0b00000000_00000000_00000000_00000000),
| (0b00011111_00111111_00000000_00000000,
| 0b11000000_10000000_00000000_00000000),
| (0b00001111_00111111_00111111_00000000,
| 0b11100000_10000000_10000000_00000000),
| (0b00000111_00111111_00111111_00111111,
| 0b11110000_10000000_10000000_10000000), ];
| const LEN: [u8; 33] = [ // 0-10 leading zeros: not
| valid. 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
| // 11-15 leading zeros: 4 bytes. 4, 4, 4, 4, 4,
| // 16-20 leading zeros: 3 bytes. 3, 3, 3, 3, 3,
| // 21-24 leading zeros: 2 bytes. 2, 2, 2, 2,
| // 25-32 leading zeros: 1 byte. 1, 1, 1, 1, 1, 1, 1,
| 1, ]; pub unsafe fn
| branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
| let leading_zeros = codepoint.leading_zeros() as usize;
| let bytes = LEN[leading_zeros] as usize; let (mask,
| or) = *DEP_AND_OR.get_unchecked(bytes); let ret =
| core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
| (ret.swap_bytes().to_le_bytes(), bytes) }
| Validark wrote:
| I wrote this stub a while back
|
| https://gist.github.com/Validark/457b6db8aa00ded26a6681d4d25...
| Dwedit wrote:
| Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM
| for every character (possibly to the same address)?
| ngoldbaum wrote:
| You could do two passes over the string, first get the total
| length in bytes, then fill it in codepoint by codepoint.
|
| You could also pessimistically over-allocate assuming four
| bytes per character and then resize afterwards.
|
| With the API in the linked blog post it's up to the user to
| decide how they want to use the output [u8;4] array.
| koala_man wrote:
| I'm surprised there are no UTF-8 specific decode instructions
| yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert
| to Signed fixed-point, rounding toward Zero"
| HeliumHydride wrote:
| https://developer.arm.com/documentation/ddi0602/2024-12/SIMD...
| jsheard wrote:
| FJCVTZS isn't really as specific to Javascript as the name
| suggests, it actually copies the semantics of the equivalent
| x86 instruction, which JS took its semantics from. ARM
| obviously doesn't want to name-drop x86 so they call it a JS
| instruction instead. JS runtimes do use it but it's also useful
| for x86 emulation in general.
| RenThraysk wrote:
| Or-ing 1 onto codepoint before calling leading_zeroes() should
| get a decent compiler to remove the branch.
___________________________________________________________________
(page generated 2025-01-17 23:00 UTC)