[HN Gopher] Optimizing uint64_t Digit Counting: A Method that Be...
___________________________________________________________________
Optimizing uint64_t Digit Counting: A Method that Beats Lemire's by
up to 143%
Author : realtimechris
Score : 61 points
Date : 2025-01-06 00:23 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| realtimechris wrote:
| *Optimizing uint64_t Digit Counting: A New Method that Beats
| Lemire's by Up to 27%*
|
| In the quest to improve the performance of my high-speed JSON
| library, JSONIFIER, I recently stumbled upon a breakthrough in
| optimizing the calculation of digit counts for `uint64_t` values.
| While Lemire's method of using the `lzcnt` instruction for
| determining the number of digits in a 32-bit unsigned integer has
| been widely regarded as efficient, I've developed a new method
| that achieves even faster results for 64-bit unsigned integers
| (i.e., `uint64_t`), with significant gains across different
| compilers and platforms.
|
| ### The Existing Method: Lemire's Approach
|
| Lemire's method, known for its efficiency, calculates the number
| of digits in a `uint32_t` by leveraging the `lzcnt` instruction,
| which finds the index of the most significant bit set to 1. This
| is combined with a static lookup table to map the result to the
| corresponding number of digits.
|
| Here's the code for Lemire's method:
|
| ```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 -
| simd_internal::lzcnt(x | 1); }
|
| JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static
| uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >>
| 32; } ```
|
| While this approach works well for 32-bit integers, the need for
| a faster and more efficient solution for `uint64_t` led me to
| create an alternative method, which uses a more streamlined
| approach without the overhead of a lookup table.
|
| ### My New Method: RTC-64-Bit Digit Counting
|
| I've designed a new approach for 64-bit unsigned integers that
| leverages a similar logic but optimizes the process by storing
| precomputed digit counts for specific ranges and applying simple
| threshold checks. The result is faster execution with reduced
| computational overhead.
|
| Here's the code for the new method:
|
| ```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };
|
| JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };
|
| JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t
| inputValue) { const uint64_t originalDigitCount{
| digitCounts[simd_internal::lzcnt(inputValue)] }; return
| originalDigitCount + static_cast<uint64_t>(inputValue >
| digitCountThresholds[originalDigitCount]); } ```
|
| This method works by using a static array to hold the precomputed
| digit counts and another array for threshold values that
| determine the exact number of digits in a `uint64_t`. The key
| optimization lies in the efficient use of a bit manipulation
| technique and direct threshold checking to avoid unnecessary
| computations.
|
| ### [Benchmark Results](https://github.com/RealTimeChris/Benchmar
| kSuite/blob/digit-c...)
|
| I ran performance benchmarks comparing my new RTC-64-bit method
| with Lemire's approach and the traditional `log10` method across
| various platforms and compilers. The results were consistently
| impressive:
|
| #### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by
| *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive
| *814.16%*.
|
| #### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by
| *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.
|
| #### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than
| *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by
| *515.90%*.
|
| #### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than
| *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by
| *343.97%*.
|
| ### Key Takeaways
|
| The RTC-64-bit method not only delivers improved performance on
| modern hardware but also significantly reduces the overhead of
| traditional methods by eliminating the need for a large lookup
| table. This is especially beneficial for high-performance
| applications where every cycle counts, such as in JSON
| serialization and parsing, where speed is critical.
| nick__m wrote:
| if you prefix your code by two space on every lines it will be
| more readable, ex: JSONIFIER_INLINE int
| fast_digit_count(uint32_t x) { static uint64_t
| table[] = { ... }; return (x + table[int_log2(x)]) >>
| 32; }
| orra wrote:
| Very nice work. But could you please explain _why_ counting
| digits quickly (or even slowly) is useful for a JSON
| serializer? This is lower level than I 'm used to working. Is
| this so you can alloc the right amount of memory, or something
| else?
| ComputerGuru wrote:
| Yes. And for string formatting in general such as sprintf to
| calculate buffer size. Also for proposes of calculating
| length for string padding and alignment.
| o11c wrote:
| Honestly? Most of the time you can use an approximation. At
| worst you allocate a single extra byte unless your integers
| are hundreds of bits long (I worked this out once but I
| forget where I left the details). Just multiply by the ratio
| of logs or something like that.
| Nevermark wrote:
| Nice. The best kind of optimization: Go lazy, to go fast.
|
| I often go slow, to go fast. But lazy is the supremum.
| eapriv wrote:
| Isn't "the ratio of logs" going to be slower?
| o11c wrote:
| I mean the constant logs of the bases; you get to hard-
| code this in the algorithm.
|
| One interesting coincidence is that log10(2) ~= log2(10)
| - 3, thus the 83 ~= 78 below.
|
| If you use an 8-bit shift, depending on which direction
| you're going, you multiply the input length by 851 (which
| is 3*256 + 83) or 78, then shift by 8. This starts being
| wasteful at 23 bits, but is pretty cheap on 8-bit CPUs.
|
| For a more accurate approximation (slackening at 289
| bits), 217706 (3<<16 + 21098) or 19729, then >> 16
| exyi wrote:
| I'd assume that the serializer is writing directly into the
| output buffer, so you'd have to shift everything left by
| one character if you overallocate. With the added checks
| for this, it might be faster to compute it precisely the
| first time.
| vbezhenar wrote:
| Because JSON is not well suitable as a fast machine exchange
| format and people trying to make it work anyway.
|
| If you ask me, JSON serializers should be simple and slow,
| and if you need speed, change wire format for something
| intrinsically fast.
|
| I understand that this is not the approach that people will
| take, most developers prefer simple interface with complex
| implementation over complex interface with simple
| implementation. But my opinion is that implementation
| simplicity is what matters in the end.
| touisteur wrote:
| Another way of seeing this is: whatever way people are
| using JSON, over a fleet of all the machines running the
| parser (or serializer), reducing the execution time (a good
| enough proxy to energy consumption) on the whole fleet is
| worthwile. Especially if it has zero impact on the code
| calling this parser/serializer (so, no second-order energy
| spent because of the change breaking some API or contract).
| GuB-42 wrote:
| I think it is the case: first allocate the space, then write.
|
| However, I am not sure how significant the gain is here.
| Actually printing the number requires a division every digit,
| maybe 2 or 3, which I believe is much slower than even naive
| digit counting, and you will have to do it eventually.
|
| Maybe it is worth it if there is parallelism involved: write
| the JSON with blank spaces for the numbers, then have another
| thread write the numbers. Writing the numbers is
| computationally expensive because of the divisions, so if you
| have a lot of large numbers, it may be worth dedicating some
| cores to this task.
| realtimechris wrote:
| No, you only require division for digit-counts greater than
| or equal to 10: https://github.com/RealTimeChris/Jsonifier/
| blob/dev/Include/...
| realtimechris wrote:
| So we can properly dispatch the correct length-based function
| with minimal branching to detect which length to serialize
| for: https://github.com/RealTimeChris/Jsonifier/blob/dev/Incl
| ude/...
| mabster wrote:
| FYI, Hackers Delight uses a slight variation of Lemire's. He
| multiplies by 19/64 (so the divisor is a shift) and has the
| highest integer at the end of his table (i.e. 2^64 - 1). He
| also uses a shift instead of a conditional, so his code is
| branchless.
|
| Generally I prefer this kind of approach. But, I suspect your
| digit counting is a very hot path so your tables will generally
| be in cache. So your approach will likely win anyway.
| dzaima wrote:
| A latency improvement would be to have digitCountThresholds be
| indexed by the lzcnt, instead of the result of the other LUT.
| Increases the size of that lookup table from 160B to 512B though.
| Funnily enough I've had this approach in a local copy of Ryu when
| I was working on cutting it down for my purposes. Unfortunately
| whatever ended up public had cut this out too.
|
| edit: some chat messages of me at [0]. Some previous unrelated
| discussion found at [1].
|
| [0]:
| https://app.element.io/#/room/#bqn:matrix.org/$KKIK86x0tygAf...
| (arbitrarily capped at 18 digits because Ryu didn't need more)
|
| [1]: https://github.com/ulfjack/ryu/issues/34
| adgjlsfhk1 wrote:
| you can cut back the table size at the cost of 1 cycle latency
| by shifting the lzcnt right by 3. this works since 2^3<10
| dzaima wrote:
| That doesn't work - the numbers change every 3 or 4 entries,
| i.e. ~log2(10), not 10. (and the precise point of change is
| important too)
| adgjlsfhk1 wrote:
| Good point. The other option is to use
| (1233*(64-lzcnt(x)))>>12 to bypass the first lookup
| entirely (not sure if this is faster or not), which works
| since 1233/2^12 is close enough to log_2(10).
| dzaima wrote:
| Multiplication typically takes 3 cycles, so a total of
| 1+3+1=5 cycles of latency. A load from L1 takes 4-5
| cycles on non-ancient x86 and Apple M1. So roughly on-
| par, perhaps a bit faster, at the cost of eating some ILP
| in the odd case that there's not still plenty left (also
| better if not already in-cache, though only significantly
| so if using that instead of the lzcnt directly as the
| threshold table index).
| 0xcoffee wrote:
| C# Version: private static uint
| FastDigitCount(ulong value) {
| ReadOnlySpan<byte> digitCounts = [19, 19, 19, 19, 18, 18, 18, 17,
| 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 13,
| 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7,
| 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1,
| 1]; ReadOnlySpan<ulong> digitCountThresholds = [0, 9,
| 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999,
| 9999999999, 99999999999, 999999999999, 9999999999999,
| 99999999999999, 999999999999999, 9999999999999999,
| 99999999999999999, 999999999999999999, 9999999999999999999];
| var leadingZeros = BitOperations.LeadingZeroCount(value);
| var originalDigitCount = digitCounts[leadingZeros];
| return originalDigitCount + (value >
| digitCountThresholds[originalDigitCount] ? 1u : 0u); }
|
| Benchmark: https://stackoverflow.com/a/79337820/4503491
| hairtuq wrote:
| Here is a cute variant that doesn't need lzcnt nor tables, but
| only works up to 99,999: (((x + 393206) & (x +
| 524188)) ^ ((x + 916504) & (x + 514288))) >> 17
|
| This is for integer log 10, but could be adapted for number of
| digits. It needs a wrapper for 64 bit to invoke it multiple
| times, but most numbers in a JSON are small, so it might even be
| competitive; it needs only 4 cycles with enough instruction level
| parallelism.
|
| I gathered this idea from the output of a superoptimizer, it was
| fun to figure out how it works. For spoilers, see [1].
|
| [1] https://github.com/rust-
| lang/rust/blob/master/library/core/s...
| Aardwolf wrote:
| > that doesn't need lzcnt
|
| Is it a big advantage to not need it, or can we safely assume
| CPU's have fast instructions for this today?
| touisteur wrote:
| Sometimes you have a scalar instruction but not a vectorized
| one, or it doesn't match the 'lanes' you want to operate
| (ISTR weird holes in AVX where I could find a specific
| instruction for 8, 32, 64-bits lanes but not for 16). Always
| good to have an escape hatch there, especially a highly
| pipelined (or pipeline-able) one.
| Veedrac wrote:
| It's not obvious to me that this is worth much over
| (x > 9) + (x > 99) + (x > 999) + (x > 9999)
|
| (x > CONST) does tend to need a pair of instructions to get the
| status flag into a register, but those can also fuse with the
| adds. Critical path latency is cmp-setae-sbb-add, or also four
| cycles.
| billpg wrote:
| If you were to ask me to write some code to convert an int64 into
| ascii decimals, I'd start with a 20 character buffer (because
| int64 will never need more), loop to populate the buffer with
| ascii decimal digits from the lowest significant end, then once
| the loop has finished and we know how many digits, copy the
| buffer into its place.
|
| I'd love to told that this method in the article is actually
| significantly more efficient than my possibly naive approach.
| matheusmoreira wrote:
| > I'd start with a 20 character buffer (because int64 will
| never need more)
|
| Just in case anyone is wondering, this is because the maximum
| number of decimal digits for a given number of bits is given
| by: digits(bits) = ceil(bits * log10(2))
|
| This can be generalized to any number of bits.
| digits(32) = 10 digits(64) = 20 digits(128) =
| 39 digits(256) = 78 digits(512) = 155
| digits(1024) = 309
| Cold_Miserable wrote:
| mov eax,64 lzcnt r8,rcx sub eax,r8d imul eax,1233 shr eax,12
|
| Accurate to within 1.
___________________________________________________________________
(page generated 2025-01-08 23:02 UTC)