[HN Gopher] Branchless UTF-8 Encoding
       ___________________________________________________________________
        
       Branchless UTF-8 Encoding
        
       Author : vortex_ape
       Score  : 65 points
       Date   : 2025-01-17 19:20 UTC (3 hours ago)
        
 (HTM) web link (cceckman.com)
 (TXT) w3m dump (cceckman.com)
        
       | ThatGuyRaion wrote:
       | So is this potentially performance improving?.
        
         | PhilipRoman wrote:
         | Last time I tested branchless UTF-8 algorithms, I came to the
         | conclusion that they only perform [slightly] better for text
         | consisting of foreign multibyte characters. Unless you expect
         | lots of such inputs on the hot path, just go with traditional
         | algorithms instead. Even in the worst case the difference isn't
         | that big.
         | 
         | Sometimes people fail to appreciate how insanely fast a
         | predictable branch really is.
        
         | not2b wrote:
         | Usually people are interested in branchless implementations for
         | cryptography applications, to avoid timing side channels
         | (though you then have to make sure that the generated
         | instructions don't have different timing for different input
         | values), and will pay some time penalty if they have to.
        
       | Arnavion wrote:
       | >So on x86_64 processors, we have to branch to say "a 32-bit zero
       | value has 32 leading zeros". Put differently, the "count leading
       | zeros" intrinsic isn't necessarily a branchless instruction. This
       | might look nicer on another architecture!
       | 
       | Yes, RISC-V for example defines the instructions for counting
       | leading / trailing zeros (clz, clzw, ctz, ctzw) such that an
       | N-bit zero value has N of them.
       | 
       | I don't know if I can show it on Rust Godbolt because none of the
       | default RISC-V targets that Rust has support the Zbb extension,
       | but I checked with a custom target that I use locally for my
       | emulator, and `leading_zeros()` indeed compiles to just one `clz`
       | without any further branches. Here's a C demonstration though:
       | https://gcc.godbolt.org/z/cKx3ajsjh
        
       | xeeeeeeeeeeenu wrote:
       | > So on x86_64 processors, we have to branch to say "a 32-bit
       | zero value has 32 leading zeros".
       | 
       | Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and
       | Piledriver (AMD) introduced the LZCNT instruction that doesn't
       | have this problem.
        
         | sltkr wrote:
         | You can also very trivially do (codepoint | 1).leading_zeros(),
         | then you can also shave one byte off the LEN table. (This
         | doesn't affect the result because LEN[32] == LEN[33] == 1).
        
         | pklausler wrote:
         | Easy to count leading zeroes in a branch-free manner without a
         | hardware instruction using a conditional move and a de Bruijn
         | sequence; see https://github.com/llvm/llvm-
         | project/blob/main/flang/include... .
        
       | lxgr wrote:
       | > Compiler explorer confirms that, with optimizations enabled,
       | this function is branchless.
       | 
       | Only if you don't consider conditional move instructions
       | branching/cheating :)
        
       | comex wrote:
       | Incidentally, this automatic branch-if-zero from LLVM is being
       | improved.
       | 
       | First of all, a recent LLVM patch apparently changes codegen to
       | use CMOV instead of a branch:
       | 
       | https://github.com/llvm/llvm-project/pull/102885
       | 
       | Beyond that, Intel recently updated their manual to retroactively
       | define the behavior of BSR/BSF on zero inputs: it leaves the
       | destination register unmodified. This matches the AMD manual, and
       | I suspect it matches the behavior of all existing x86-64
       | processors (but that will need to be tested, I guess).
       | 
       | If so, you don't need either a branch or CMOV. Just set a
       | register to 32, then run BSR with the same register as
       | destination. If the BSR input is nonzero, the 32 is overwritten
       | with the trailing-zero count. If the BSR input is zero, then BSR
       | leaves the register unmodified and you get 32.
       | 
       | Since this behavior is now guaranteed for future x86-64
       | processors, and assuming it's indeed compatible with all existing
       | x86-64 processors (maybe even all x86 processors period?), LLVM
       | will no longer need the old path regardless of what it's
       | targeting.
       | 
       | Note that if you're targeting a newer x86-64 version, LLVM will
       | just emit TZCNT, which just does what you'd expect and returns 32
       | if the input is zero (or 64 for a 64-bit TZCNT). But as the blog
       | post demonstrates, many people still build for baseline x86_64.
       | 
       | (Intel does document one discrepancy between processors: "On some
       | older processors, use of a 32-bit operand size may clear the
       | upper 32 bits of a 64-bit destination while leaving the lower 32
       | bits unmodified.")
        
       | orlp wrote:
       | If you have access to the BMI2 instruction set I can do
       | branchless UTF-8 encoding like in the article using only 9
       | instructions and 73 bytes of lookup tables:
       | branchless_utf8:             mov     rax, rdi             lzcnt
       | ecx, esi             lea     rdx, [rip + .L__unnamed_1]
       | movzx   ecx, byte ptr [rcx + rdx]             lea     rdx, [rip +
       | example::DEP_AND_OR::h78cbe1dc7fe823a9]             pdep    esi,
       | esi, dword ptr [rdx + 8*rcx]             or      esi, dword ptr
       | [rdx + 8*rcx + 4]             movbe   dword ptr [rdi], esi
       | mov     qword ptr [rdi + 8], rcx             ret
       | 
       | The code:                   static DEP_AND_OR: [(u32, u32); 5] =
       | [             (0, 0),
       | (0b01111111_00000000_00000000_00000000,
       | 0b00000000_00000000_00000000_00000000),
       | (0b00011111_00111111_00000000_00000000,
       | 0b11000000_10000000_00000000_00000000),
       | (0b00001111_00111111_00111111_00000000,
       | 0b11100000_10000000_10000000_00000000),
       | (0b00000111_00111111_00111111_00111111,
       | 0b11110000_10000000_10000000_10000000),         ];
       | const LEN: [u8; 33] = [             // 0-10 leading zeros: not
       | valid.             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       | // 11-15 leading zeros: 4 bytes.             4, 4, 4, 4, 4,
       | // 16-20 leading zeros: 3 bytes.             3, 3, 3, 3, 3,
       | // 21-24 leading zeros: 2 bytes.             2, 2, 2, 2,
       | // 25-32 leading zeros: 1 byte.             1, 1, 1, 1, 1, 1, 1,
       | 1,         ];              pub unsafe fn
       | branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
       | let leading_zeros = codepoint.leading_zeros() as usize;
       | let bytes = LEN[leading_zeros] as usize;             let (mask,
       | or) = *DEP_AND_OR.get_unchecked(bytes);             let ret =
       | core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
       | (ret.swap_bytes().to_le_bytes(), bytes)         }
        
       | Validark wrote:
       | I wrote this stub a while back
       | 
       | https://gist.github.com/Validark/457b6db8aa00ded26a6681d4d25...
        
       | Dwedit wrote:
       | Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM
       | for every character (possibly to the same address)?
        
         | ngoldbaum wrote:
         | You could do two passes over the string, first get the total
         | length in bytes, then fill it in codepoint by codepoint.
         | 
         | You could also pessimistically over-allocate assuming four
         | bytes per character and then resize afterwards.
         | 
         | With the API in the linked blog post it's up to the user to
         | decide how they want to use the output [u8;4] array.
        
       | koala_man wrote:
       | I'm surprised there are no UTF-8 specific decode instructions
       | yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert
       | to Signed fixed-point, rounding toward Zero"
        
         | HeliumHydride wrote:
         | https://developer.arm.com/documentation/ddi0602/2024-12/SIMD...
        
         | jsheard wrote:
         | FJCVTZS isn't really as specific to Javascript as the name
         | suggests, it actually copies the semantics of the equivalent
         | x86 instruction, which JS took its semantics from. ARM
         | obviously doesn't want to name-drop x86 so they call it a JS
         | instruction instead. JS runtimes do use it but it's also useful
         | for x86 emulation in general.
        
       | RenThraysk wrote:
       | Or-ing 1 onto codepoint before calling leading_zeroes() should
       | get a decent compiler to remove the branch.
        
       ___________________________________________________________________
       (page generated 2025-01-17 23:00 UTC)