https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate. Menu and widgets * My home page * My papers * My software Subscribe Join 12,500 subscribers: Email Address [ ] [ ] [Subscribe by email] You can also follow this blog on telegram. Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can support the blog with donations through paypal. Please consider getting in touch if you are a supporter so that I can thank you. You can also support my work on GitHub. Recent Posts * Converting integers to decimal strings faster with AVX-512 * Writing out large arrays in Go: binary.Write is inefficient for large arrays * Enforcement by software * The Canadian Common CV and the captured academy * How many digits in a product? Recent Comments * Pika Kasinot on Enforcement by software * Dan on Don't underestimate the nerds * Daniel Lemire on Fastest way to compute the greatest common divisor * Daniel Lemire on Writing out large arrays in Go: binary.Write is inefficient for large arrays * mischa sandberg on Writing out large arrays in Go: binary.Write is inefficient for large arrays Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My readers * My sayings * Predictions * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org Converting integers to decimal strings faster with AVX-512 In most systems, integers are stored using a fixed binary representation. It is common to store integers using 32-bit or 64-bit words. You sometimes need to convert it to a string. For example, the integer 12345 might need to be converted to the five characters '12345'. In an earlier blog post, I presented the simpler problem of converting integers to fixed-digit strings, using exactly 16-characters with leading zeroes as needed. For example, the integer 12345 becomes the string '0000000000012345'. For this problem, the most practical approach might be a tree-based version with a small table. The core idea is to start from the integer, compute an integer representing the 8 most significant decimal digits, and another integer representing the least significant 8 decimal digit. Then we repeat, dividing the two eight-digit integers into two four-digit integers, and so forth until we get to two-digit integers in which case we use a small table to convert them to a decimal representation. The code in C++ might look as follows: void to_string_tree_table(uint64_t x, char *out) { static const char table[200] = { 0x30, 0x30, 0x30, 0x31, 0x30, 0x32, 0x30, 0x33, 0x30, 0x34, 0x30, 0x35, 0x30, 0x36, 0x30, 0x37, 0x30, 0x38, 0x30, 0x39, 0x31, 0x30, 0x31, 0x31, 0x31, 0x32, 0x31, 0x33, 0x31, 0x34, 0x31, 0x35, 0x31, 0x36, 0x31, 0x37, 0x31, 0x38, 0x31, 0x39, 0x32, 0x30, 0x32, 0x31, 0x32, 0x32, 0x32, 0x33, 0x32, 0x34, 0x32, 0x35, 0x32, 0x36, 0x32, 0x37, 0x32, 0x38, 0x32, 0x39, 0x33, 0x30, 0x33, 0x31, 0x33, 0x32, 0x33, 0x33, 0x33, 0x34, 0x33, 0x35, 0x33, 0x36, 0x33, 0x37, 0x33, 0x38, 0x33, 0x39, 0x34, 0x30, 0x34, 0x31, 0x34, 0x32, 0x34, 0x33, 0x34, 0x34, 0x34, 0x35, 0x34, 0x36, 0x34, 0x37, 0x34, 0x38, 0x34, 0x39, 0x35, 0x30, 0x35, 0x31, 0x35, 0x32, 0x35, 0x33, 0x35, 0x34, 0x35, 0x35, 0x35, 0x36, 0x35, 0x37, 0x35, 0x38, 0x35, 0x39, 0x36, 0x30, 0x36, 0x31, 0x36, 0x32, 0x36, 0x33, 0x36, 0x34, 0x36, 0x35, 0x36, 0x36, 0x36, 0x37, 0x36, 0x38, 0x36, 0x39, 0x37, 0x30, 0x37, 0x31, 0x37, 0x32, 0x37, 0x33, 0x37, 0x34, 0x37, 0x35, 0x37, 0x36, 0x37, 0x37, 0x37, 0x38, 0x37, 0x39, 0x38, 0x30, 0x38, 0x31, 0x38, 0x32, 0x38, 0x33, 0x38, 0x34, 0x38, 0x35, 0x38, 0x36, 0x38, 0x37, 0x38, 0x38, 0x38, 0x39, 0x39, 0x30, 0x39, 0x31, 0x39, 0x32, 0x39, 0x33, 0x39, 0x34, 0x39, 0x35, 0x39, 0x36, 0x39, 0x37, 0x39, 0x38, 0x39, 0x39, }; uint64_t top = x / 100000000; uint64_t bottom = x % 100000000; uint64_t toptop = top / 10000; uint64_t topbottom = top % 10000; uint64_t bottomtop = bottom / 10000; uint64_t bottombottom = bottom % 10000; uint64_t toptoptop = toptop / 100; uint64_t toptopbottom = toptop % 100; uint64_t topbottomtop = topbottom / 100; uint64_t topbottombottom = topbottom % 100; uint64_t bottomtoptop = bottomtop / 100; uint64_t bottomtopbottom = bottomtop % 100; uint64_t bottombottomtop = bottombottom / 100; uint64_t bottombottombottom = bottombottom % 100; // memcpy(out, &table[2 * toptoptop], 2); memcpy(out + 2, &table[2 * toptopbottom], 2); memcpy(out + 4, &table[2 * topbottomtop], 2); memcpy(out + 6, &table[2 * topbottombottom], 2); memcpy(out + 8, &table[2 * bottomtoptop], 2); memcpy(out + 10, &table[2 * bottomtopbottom], 2); memcpy(out + 12, &table[2 * bottombottomtop], 2); memcpy(out + 14, &table[2 * bottombottombottom], 2); } It compiles down to dozens of instructions. Could you do better without using a much larger table? It turns out that you can do much better if you have a recent Intel processor with the appropriate AVX-512 instructions (IFMA, VBMI), as demonstrated by an Internet user called InstLatX64. We rely on the observation that you can compute directly the quotient and the remainder of the division using a series of multiplications and shifts (Lemire et al. 2019). The code is a bit technical, but remarkably, it does not require a table. And it generates several times fewer instructions. For the sake of simplicity, I merely provide an implementation using Intel intrinsics. Importantly, you are not expected to follow through with the code, but you should notice that it is rather short. void to_string_avx512ifma(uint64_t n, char *out) { uint64_t n_15_08 = n / 100000000; uint64_t n_07_00 = n % 100000000; __m512i bcstq_h = _mm512_set1_epi64(n_15_08); __m512i bcstq_l = _mm512_set1_epi64(n_07_00); __m512i zmmzero = _mm512_castsi128_si512(_mm_cvtsi64_si128(0x1A1A400)); __m512i zmmTen = _mm512_set1_epi64(10); __m512i asciiZero = _mm512_set1_epi64('0'); __m512i ifma_const = _mm512_setr_epi64(0x00000000002af31dc, 0x0000000001ad7f29b, 0x0000000010c6f7a0c, 0x00000000a7c5ac472, 0x000000068db8bac72, 0x0000004189374bc6b, 0x0000028f5c28f5c29, 0x0000199999999999a); __m512i permb_const = _mm512_castsi128_si512(_mm_set_epi8(0x78, 0x70, 0x68, 0x60, 0x58, 0x50, 0x48, 0x40, 0x38, 0x30, 0x28, 0x20, 0x18, 0x10, 0x08, 0x00)); __m512i lowbits_h = _mm512_madd52lo_epu64(zmmzero, bcstq_h, ifma_const); __m512i lowbits_l = _mm512_madd52lo_epu64(zmmzero, bcstq_l, ifma_const); __m512i highbits_h = _mm512_madd52hi_epu64(asciiZero, zmmTen, lowbits_h); __m512i highbits_l = _mm512_madd52hi_epu64(asciiZero, zmmTen, lowbits_l); __m512i perm = _mm512_permutex2var_epi8(highbits_h, permb_const, highbits_l); __m128i digits_15_0 = _mm512_castsi512_si128(perm); _mm_storeu_si128((__m128i *)out, digits_15_0); } Remarkably, the AVX-512 is 3.5 times faster than the table-based approach: function time per conversion table 8.8 ns AVX-512 2.5 ns I use GCC 9 and an Intel Tiger Lake processor (3.30GHz). My benchmarking code is available. A downside of this nifty approach is that it is (obviously) non-portable. There are still few Intel processors supporting these nifty extensions, and it is currently limited to Intel: no AMD or ARM processor can do the same right now. However, the gain might be sufficient that it is worth the effort deploying it in some instances. Published by [2ca999] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on March 28, 2022March 28, 2022Author Daniel LemireCategories Leave a Reply Cancel reply Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com. [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment * [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] You may subscribe to this blog by email. Post navigation Previous Previous post: Writing out large arrays in Go: binary.Write is inefficient for large arrays Terms of use Proudly powered by WordPress