[HN Gopher] Optimizing uint64_t Digit Counting: A Method that Be...
       ___________________________________________________________________
        
       Optimizing uint64_t Digit Counting: A Method that Beats Lemire's by
       up to 143%
        
       Author : realtimechris
       Score  : 61 points
       Date   : 2025-01-06 00:23 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | realtimechris wrote:
       | *Optimizing uint64_t Digit Counting: A New Method that Beats
       | Lemire's by Up to 27%*
       | 
       | In the quest to improve the performance of my high-speed JSON
       | library, JSONIFIER, I recently stumbled upon a breakthrough in
       | optimizing the calculation of digit counts for `uint64_t` values.
       | While Lemire's method of using the `lzcnt` instruction for
       | determining the number of digits in a 32-bit unsigned integer has
       | been widely regarded as efficient, I've developed a new method
       | that achieves even faster results for 64-bit unsigned integers
       | (i.e., `uint64_t`), with significant gains across different
       | compilers and platforms.
       | 
       | ### The Existing Method: Lemire's Approach
       | 
       | Lemire's method, known for its efficiency, calculates the number
       | of digits in a `uint32_t` by leveraging the `lzcnt` instruction,
       | which finds the index of the most significant bit set to 1. This
       | is combined with a static lookup table to map the result to the
       | corresponding number of digits.
       | 
       | Here's the code for Lemire's method:
       | 
       | ```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 -
       | simd_internal::lzcnt(x | 1); }
       | 
       | JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static
       | uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >>
       | 32; } ```
       | 
       | While this approach works well for 32-bit integers, the need for
       | a faster and more efficient solution for `uint64_t` led me to
       | create an alternative method, which uses a more streamlined
       | approach without the overhead of a lookup table.
       | 
       | ### My New Method: RTC-64-Bit Digit Counting
       | 
       | I've designed a new approach for 64-bit unsigned integers that
       | leverages a similar logic but optimizes the process by storing
       | precomputed digit counts for specific ranges and applying simple
       | threshold checks. The result is faster execution with reduced
       | computational overhead.
       | 
       | Here's the code for the new method:
       | 
       | ```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };
       | 
       | JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };
       | 
       | JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t
       | inputValue) { const uint64_t originalDigitCount{
       | digitCounts[simd_internal::lzcnt(inputValue)] }; return
       | originalDigitCount + static_cast<uint64_t>(inputValue >
       | digitCountThresholds[originalDigitCount]); } ```
       | 
       | This method works by using a static array to hold the precomputed
       | digit counts and another array for threshold values that
       | determine the exact number of digits in a `uint64_t`. The key
       | optimization lies in the efficient use of a bit manipulation
       | technique and direct threshold checking to avoid unnecessary
       | computations.
       | 
       | ### [Benchmark Results](https://github.com/RealTimeChris/Benchmar
       | kSuite/blob/digit-c...)
       | 
       | I ran performance benchmarks comparing my new RTC-64-bit method
       | with Lemire's approach and the traditional `log10` method across
       | various platforms and compilers. The results were consistently
       | impressive:
       | 
       | #### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by
       | *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive
       | *814.16%*.
       | 
       | #### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by
       | *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.
       | 
       | #### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than
       | *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by
       | *515.90%*.
       | 
       | #### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than
       | *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by
       | *343.97%*.
       | 
       | ### Key Takeaways
       | 
       | The RTC-64-bit method not only delivers improved performance on
       | modern hardware but also significantly reduces the overhead of
       | traditional methods by eliminating the need for a large lookup
       | table. This is especially beneficial for high-performance
       | applications where every cycle counts, such as in JSON
       | serialization and parsing, where speed is critical.
        
         | nick__m wrote:
         | if you prefix your code by two space on every lines it will be
         | more readable, ex:                 JSONIFIER_INLINE int
         | fast_digit_count(uint32_t x)       {         static uint64_t
         | table[] = { ... };          return (x + table[int_log2(x)]) >>
         | 32;        }
        
         | orra wrote:
         | Very nice work. But could you please explain _why_ counting
         | digits quickly (or even slowly) is useful for a JSON
         | serializer? This is lower level than I 'm used to working. Is
         | this so you can alloc the right amount of memory, or something
         | else?
        
           | ComputerGuru wrote:
           | Yes. And for string formatting in general such as sprintf to
           | calculate buffer size. Also for proposes of calculating
           | length for string padding and alignment.
        
           | o11c wrote:
           | Honestly? Most of the time you can use an approximation. At
           | worst you allocate a single extra byte unless your integers
           | are hundreds of bits long (I worked this out once but I
           | forget where I left the details). Just multiply by the ratio
           | of logs or something like that.
        
             | Nevermark wrote:
             | Nice. The best kind of optimization: Go lazy, to go fast.
             | 
             | I often go slow, to go fast. But lazy is the supremum.
        
             | eapriv wrote:
             | Isn't "the ratio of logs" going to be slower?
        
               | o11c wrote:
               | I mean the constant logs of the bases; you get to hard-
               | code this in the algorithm.
               | 
               | One interesting coincidence is that log10(2) ~= log2(10)
               | - 3, thus the 83 ~= 78 below.
               | 
               | If you use an 8-bit shift, depending on which direction
               | you're going, you multiply the input length by 851 (which
               | is 3*256 + 83) or 78, then shift by 8. This starts being
               | wasteful at 23 bits, but is pretty cheap on 8-bit CPUs.
               | 
               | For a more accurate approximation (slackening at 289
               | bits), 217706 (3<<16 + 21098) or 19729, then >> 16
        
             | exyi wrote:
             | I'd assume that the serializer is writing directly into the
             | output buffer, so you'd have to shift everything left by
             | one character if you overallocate. With the added checks
             | for this, it might be faster to compute it precisely the
             | first time.
        
           | vbezhenar wrote:
           | Because JSON is not well suitable as a fast machine exchange
           | format and people trying to make it work anyway.
           | 
           | If you ask me, JSON serializers should be simple and slow,
           | and if you need speed, change wire format for something
           | intrinsically fast.
           | 
           | I understand that this is not the approach that people will
           | take, most developers prefer simple interface with complex
           | implementation over complex interface with simple
           | implementation. But my opinion is that implementation
           | simplicity is what matters in the end.
        
             | touisteur wrote:
             | Another way of seeing this is: whatever way people are
             | using JSON, over a fleet of all the machines running the
             | parser (or serializer), reducing the execution time (a good
             | enough proxy to energy consumption) on the whole fleet is
             | worthwile. Especially if it has zero impact on the code
             | calling this parser/serializer (so, no second-order energy
             | spent because of the change breaking some API or contract).
        
           | GuB-42 wrote:
           | I think it is the case: first allocate the space, then write.
           | 
           | However, I am not sure how significant the gain is here.
           | Actually printing the number requires a division every digit,
           | maybe 2 or 3, which I believe is much slower than even naive
           | digit counting, and you will have to do it eventually.
           | 
           | Maybe it is worth it if there is parallelism involved: write
           | the JSON with blank spaces for the numbers, then have another
           | thread write the numbers. Writing the numbers is
           | computationally expensive because of the divisions, so if you
           | have a lot of large numbers, it may be worth dedicating some
           | cores to this task.
        
             | realtimechris wrote:
             | No, you only require division for digit-counts greater than
             | or equal to 10: https://github.com/RealTimeChris/Jsonifier/
             | blob/dev/Include/...
        
           | realtimechris wrote:
           | So we can properly dispatch the correct length-based function
           | with minimal branching to detect which length to serialize
           | for: https://github.com/RealTimeChris/Jsonifier/blob/dev/Incl
           | ude/...
        
         | mabster wrote:
         | FYI, Hackers Delight uses a slight variation of Lemire's. He
         | multiplies by 19/64 (so the divisor is a shift) and has the
         | highest integer at the end of his table (i.e. 2^64 - 1). He
         | also uses a shift instead of a conditional, so his code is
         | branchless.
         | 
         | Generally I prefer this kind of approach. But, I suspect your
         | digit counting is a very hot path so your tables will generally
         | be in cache. So your approach will likely win anyway.
        
       | dzaima wrote:
       | A latency improvement would be to have digitCountThresholds be
       | indexed by the lzcnt, instead of the result of the other LUT.
       | Increases the size of that lookup table from 160B to 512B though.
       | Funnily enough I've had this approach in a local copy of Ryu when
       | I was working on cutting it down for my purposes. Unfortunately
       | whatever ended up public had cut this out too.
       | 
       | edit: some chat messages of me at [0]. Some previous unrelated
       | discussion found at [1].
       | 
       | [0]:
       | https://app.element.io/#/room/#bqn:matrix.org/$KKIK86x0tygAf...
       | (arbitrarily capped at 18 digits because Ryu didn't need more)
       | 
       | [1]: https://github.com/ulfjack/ryu/issues/34
        
         | adgjlsfhk1 wrote:
         | you can cut back the table size at the cost of 1 cycle latency
         | by shifting the lzcnt right by 3. this works since 2^3<10
        
           | dzaima wrote:
           | That doesn't work - the numbers change every 3 or 4 entries,
           | i.e. ~log2(10), not 10. (and the precise point of change is
           | important too)
        
             | adgjlsfhk1 wrote:
             | Good point. The other option is to use
             | (1233*(64-lzcnt(x)))>>12 to bypass the first lookup
             | entirely (not sure if this is faster or not), which works
             | since 1233/2^12 is close enough to log_2(10).
        
               | dzaima wrote:
               | Multiplication typically takes 3 cycles, so a total of
               | 1+3+1=5 cycles of latency. A load from L1 takes 4-5
               | cycles on non-ancient x86 and Apple M1. So roughly on-
               | par, perhaps a bit faster, at the cost of eating some ILP
               | in the odd case that there's not still plenty left (also
               | better if not already in-cache, though only significantly
               | so if using that instead of the lzcnt directly as the
               | threshold table index).
        
       | 0xcoffee wrote:
       | C# Version:                   private static uint
       | FastDigitCount(ulong value)         {
       | ReadOnlySpan<byte> digitCounts = [19, 19, 19, 19, 18, 18, 18, 17,
       | 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 13,
       | 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7,
       | 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1,
       | 1];             ReadOnlySpan<ulong> digitCountThresholds = [0, 9,
       | 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999,
       | 9999999999, 99999999999, 999999999999, 9999999999999,
       | 99999999999999, 999999999999999, 9999999999999999,
       | 99999999999999999, 999999999999999999, 9999999999999999999];
       | var leadingZeros = BitOperations.LeadingZeroCount(value);
       | var originalDigitCount = digitCounts[leadingZeros];
       | return originalDigitCount + (value >
       | digitCountThresholds[originalDigitCount] ? 1u : 0u);         }
       | 
       | Benchmark: https://stackoverflow.com/a/79337820/4503491
        
       | hairtuq wrote:
       | Here is a cute variant that doesn't need lzcnt nor tables, but
       | only works up to 99,999:                   (((x + 393206) & (x +
       | 524188)) ^ ((x + 916504) & (x + 514288))) >> 17
       | 
       | This is for integer log 10, but could be adapted for number of
       | digits. It needs a wrapper for 64 bit to invoke it multiple
       | times, but most numbers in a JSON are small, so it might even be
       | competitive; it needs only 4 cycles with enough instruction level
       | parallelism.
       | 
       | I gathered this idea from the output of a superoptimizer, it was
       | fun to figure out how it works. For spoilers, see [1].
       | 
       | [1] https://github.com/rust-
       | lang/rust/blob/master/library/core/s...
        
         | Aardwolf wrote:
         | > that doesn't need lzcnt
         | 
         | Is it a big advantage to not need it, or can we safely assume
         | CPU's have fast instructions for this today?
        
           | touisteur wrote:
           | Sometimes you have a scalar instruction but not a vectorized
           | one, or it doesn't match the 'lanes' you want to operate
           | (ISTR weird holes in AVX where I could find a specific
           | instruction for 8, 32, 64-bits lanes but not for 16). Always
           | good to have an escape hatch there, especially a highly
           | pipelined (or pipeline-able) one.
        
         | Veedrac wrote:
         | It's not obvious to me that this is worth much over
         | (x > 9) + (x > 99) + (x > 999) + (x > 9999)
         | 
         | (x > CONST) does tend to need a pair of instructions to get the
         | status flag into a register, but those can also fuse with the
         | adds. Critical path latency is cmp-setae-sbb-add, or also four
         | cycles.
        
       | billpg wrote:
       | If you were to ask me to write some code to convert an int64 into
       | ascii decimals, I'd start with a 20 character buffer (because
       | int64 will never need more), loop to populate the buffer with
       | ascii decimal digits from the lowest significant end, then once
       | the loop has finished and we know how many digits, copy the
       | buffer into its place.
       | 
       | I'd love to told that this method in the article is actually
       | significantly more efficient than my possibly naive approach.
        
         | matheusmoreira wrote:
         | > I'd start with a 20 character buffer (because int64 will
         | never need more)
         | 
         | Just in case anyone is wondering, this is because the maximum
         | number of decimal digits for a given number of bits is given
         | by:                 digits(bits) = ceil(bits * log10(2))
         | 
         | This can be generalized to any number of bits.
         | digits(32)   = 10       digits(64)   = 20       digits(128)  =
         | 39       digits(256)  = 78       digits(512)  = 155
         | digits(1024) = 309
        
       | Cold_Miserable wrote:
       | mov eax,64 lzcnt r8,rcx sub eax,r8d imul eax,1233 shr eax,12
       | 
       | Accurate to within 1.
        
       ___________________________________________________________________
       (page generated 2025-01-08 23:02 UTC)