[HN Gopher] Parsing 8-bit integers quickly
___________________________________________________________________
Parsing 8-bit integers quickly
Author : usdogu
Score : 60 points
Date : 2023-11-28 20:19 UTC (2 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| nasretdinov wrote:
| I imagine that you are not allowed to allocate a constant array
| that would contain a mapping between ASCII values of integers and
| the actual ints :)? The're just 255 of them needed. Or woukd it
| be slower?
| jstanley wrote:
| How would you actually implement that?
|
| The naive approach where you just index with the ASCII values
| you have would have to contain 16 million elements since it
| needs to look at 3 bytes.
|
| Anything more space efficient than that would probably be
| slower than TFA.
| kazinator wrote:
| Here is a table-driven idea. #include
| <stdio.h> int bcd2dec[] = { [0x000] = 0,
| [0x001] = 1, [0x002] = 2, [0x003] = 3, [0x004] = 4,
| [0x005] = 5, [0x006] = 6, [0x007] = 7, [0x008] = 8, [0x009] =
| 9, [0x010] = 10, [0x011] = 11, [0x012] = 12, [0x013]
| = 13, [0x014] = 14, [0x015] = 15, [0x016] = 16,
| [0x017] = 17, [0x018] = 18, [0x019] = 19, [0x020] =
| 20, [0x021] = 21, [0x022] = 22, [0x023] = 23, [0x024] = 24,
| [0x025] = 25, [0x026] = 26, [0x027] = 27, [0x028] = 28,
| [0x029] = 29, [0x030] = 30, [0x031] = 31, [0x032] =
| 32, [0x033] = 33, [0x034] = 34, [0x035] = 35, [0x036]
| = 36, [0x037] = 37, [0x038] = 38, [0x039] = 39,
| [0x040] = 40, [0x041] = 41, [0x042] = 42, [0x043] = 43,
| [0x044] = 44, [0x045] = 45, [0x046] = 46, [0x047] =
| 47, [0x048] = 48, [0x049] = 49, [0x050] = 50, [0x051]
| = 51, [0x052] = 52, [0x053] = 53, [0x054] = 54,
| [0x055] = 55, [0x056] = 56, [0x057] = 57, [0x058] = 58,
| [0x059] = 59, [0x060] = 60, [0x061] = 61, [0x062] =
| 62, [0x063] = 63, [0x064] = 64, [0x065] = 65, [0x066]
| = 66, [0x067] = 67, [0x068] = 68, [0x069] = 69,
| [0x070] = 70, [0x071] = 71, [0x072] = 72, [0x073] = 73,
| [0x074] = 74, [0x075] = 75, [0x076] = 76, [0x077] =
| 77, [0x078] = 78, [0x079] = 79, [0x080] = 80, [0x081]
| = 81, [0x082] = 82, [0x083] = 83, [0x084] = 84,
| [0x085] = 85, [0x086] = 86, [0x087] = 87, [0x088] = 88,
| [0x089] = 89, [0x090] = 90, [0x091] = 91, [0x092] =
| 92, [0x093] = 93, [0x094] = 94, [0x095] = 95, [0x096]
| = 96, [0x097] = 97, [0x098] = 98, [0x099] = 99,
| [0x100] = 100, [0x101] = 101, [0x102] = 102, [0x103] = 103,
| [0x104] = 104, [0x105] = 105, [0x106] = 106, [0x107]
| = 107, [0x108] = 108, [0x109] = 109, [0x110] = 110,
| [0x111] = 111, [0x112] = 112, [0x113] = 113, [0x114] = 114,
| [0x115] = 115, [0x116] = 116, [0x117] = 117, [0x118] = 118,
| [0x119] = 119, [0x120] = 120, [0x121] = 121, [0x122]
| = 122, [0x123] = 123, [0x124] = 124, [0x125] = 125,
| [0x126] = 126, [0x127] = 127, [0x128] = 128, [0x129] = 129,
| [0x130] = 130, [0x131] = 131, [0x132] = 132, [0x133] = 133,
| [0x134] = 134, [0x135] = 135, [0x136] = 136, [0x137]
| = 137, [0x138] = 138, [0x139] = 139, [0x140] = 140,
| [0x141] = 141, [0x142] = 142, [0x143] = 143, [0x144] = 144,
| [0x145] = 145, [0x146] = 146, [0x147] = 147, [0x148] = 148,
| [0x149] = 149, [0x150] = 150, [0x151] = 151, [0x152]
| = 152, [0x153] = 153, [0x154] = 154, [0x155] = 155,
| [0x156] = 156, [0x157] = 157, [0x158] = 158, [0x159] = 159,
| [0x160] = 160, [0x161] = 161, [0x162] = 162, [0x163] = 163,
| [0x164] = 164, [0x165] = 165, [0x166] = 166, [0x167]
| = 167, [0x168] = 168, [0x169] = 169, [0x170] = 170,
| [0x171] = 171, [0x172] = 172, [0x173] = 173, [0x174] = 174,
| [0x175] = 175, [0x176] = 176, [0x177] = 177, [0x178] = 178,
| [0x179] = 179, [0x180] = 180, [0x181] = 181, [0x182]
| = 182, [0x183] = 183, [0x184] = 184, [0x185] = 185,
| [0x186] = 186, [0x187] = 187, [0x188] = 188, [0x189] = 189,
| [0x190] = 190, [0x191] = 191, [0x192] = 192, [0x193] = 193,
| [0x194] = 194, [0x195] = 195, [0x196] = 196, [0x197]
| = 197, [0x198] = 198, [0x199] = 199, [0x200] = 200,
| [0x201] = 201, [0x202] = 202, [0x203] = 203, [0x204] = 204,
| [0x205] = 205, [0x206] = 206, [0x207] = 207, [0x208] = 208,
| [0x209] = 209, [0x210] = 210, [0x211] = 211, [0x212]
| = 212, [0x213] = 213, [0x214] = 214, [0x215] = 215,
| [0x216] = 216, [0x217] = 217, [0x218] = 218, [0x219] = 219,
| [0x220] = 220, [0x221] = 221, [0x222] = 222, [0x223] = 223,
| [0x224] = 224, [0x225] = 225, [0x226] = 226, [0x227]
| = 227, [0x228] = 228, [0x229] = 229, [0x230] = 230,
| [0x231] = 231, [0x232] = 232, [0x233] = 233, [0x234] = 234,
| [0x235] = 235, [0x236] = 236, [0x237] = 237, [0x238] = 238,
| [0x239] = 239, [0x240] = 240, [0x241] = 241, [0x242]
| = 242, [0x243] = 243, [0x244] = 244, [0x245] = 245,
| [0x246] = 246, [0x247] = 247, [0x248] = 248, [0x249] = 249,
| [0x250] = 250, [0x251] = 251, [0x252] = 252, [0x253] = 253,
| [0x254] = 254, [0x255] = 255, }; /*
| * If the input is an empty string or a string of length four
| or more, * we return -1. Other than that, we assume we
| have digits. */ int parse_uint8_bcd(const char
| *str) { if (!str[0]) return -1;
| if (!str[1]) return str[0] & 0xF; if
| (!str[2]) return bcd2dec[(str[0] & 0xF) << 4 |
| (str[1] & 0xF)]; if (!str[3]) return
| bcd2dec[(str[0] & 0xF) << 8 | (str[1] & 0xF) << 4 | (str[2] &
| 0xF)]; return -1; } int main(int
| argc, char **argv) { if (argv[1])
| printf("value of %s is %d\n", argv[1],
| parse_uint8_bcd(argv[1])); return 0; }
| zamadatix wrote:
| For most CPUs you can do multiple lookups at once and have
| multiple in flight. This would allow you to use 256 bytes
| which should be able to stay in lower level cache once you
| start doing any meaningful number of conversions.
|
| So not necessarily enormous and dog slow... but the real test
| is put it with the rest of your real code and see what
| happens. If you come to a certain answer it may not stay the
| same for another real bit of code.
| nasretdinov wrote:
| Yeah you're right, I didn't consider that. However even if
| you allocate a 16 mln array you'll only need to access ~26
| different cache lines, so it might still be ok? That's
| assuming that my calculation is correct that numbers will be
| grouped in tens, e.g. 240-249 would be on the same cache line
| as their numeric values are different by just 10
| pxx wrote:
| That is likely to be extremely slow.
| https://news.ycombinator.com/item?id=37823805
|
| Also, the 8-bit lookup table only saves you a couple
| subtractions and conditional moves from the naive solution?
| That seems like one of the worst tradeoffs to make.
| gpvos wrote:
| You'd need a very sparse array of 2^(3*8) =~ 16 million
| entries, not 256, since you're going in the other direction. A
| hashtable might work.
| wholesomepotato wrote:
| Would use XOR 0x30303030lu, then OR with a value shifted left
| 12 bits, take bits 24..12 and lookup in 4k pre-computed lookup
| array.
| kazinator wrote:
| $ cat parse256.c #include <stdio.h> #include
| <stdint.h> #include <string.h> int
| parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
| union { uint8_t as_str[4]; uint32_t as_int;
| } digits; memcpy(&digits.as_int, str, sizeof(digits));
| digits.as_int ^= 0x30303030lu; digits.as_int <<= (4 -
| (len & 0x3)) * 8; uint32_t y = ((UINT64_C(0x640a0100) *
| digits.as_int)>>32)&0xff; *num = (uint8_t)(y);
| return (digits.as_int & 0xf0f0f0f0) == 0 && y < 256 && len != 0
| && len < 4; } int main(int argc, char **argv)
| { if (argv[1]) { uint8_t val = 0; if
| (parse_uint8_fastswar(argv[1], strlen(argv[1]), &val)) {
| printf("value of %s is %d\n", argv[1], (int) val); }
| else { printf("%s is invalid, value stored %d\n",
| argv[1], (int) val); } } return 0;
| } $ ./parse256 123 123 is invalid, value stored 225
| $ uname -a Linux gcc1-power7.osuosl.org
| 3.10.0-862.14.4.el7.ppc64 #1 SMP Wed Sep 26 20:38:32 GMT 2018
| ppc64 ppc64 ppc64 GNU/Linux
|
| Oops!
| zimpenfish wrote:
| ~ $ ./a.out 123 value of 123 is 123 ~ $ uname
| -a Darwin CyanistesCyanus.local 23.2.0 Darwin Kernel
| Version 23.2.0: Thu Nov 9 06:28:34 PST 2023; root:xnu-
| 10002.60.71.505.1~3/RELEASE_ARM64_T8103 arm64 ~ $ gcc
| -v Apple clang version 15.0.0 (clang-1500.0.40.1)
| Target: arm64-apple-darwin23.2.0 Thread model: posix
| InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolch
| ains/XcodeDefault.xctoolchain/usr/bin
| bangonkeyboard wrote:
| These gotchas make keeping a big-endian machine around totally
| worth it.
| dataflow wrote:
| I'm guessing there isn't any way to simulate big endian on
| x86?
| jstanley wrote:
| I tried this out but it parsed the string "123\n" as 32.
|
| Also it parses "400" as 144, when the reference implementation
| considers it not-a-uint8, but I don't mind so much about that.
|
| EDIT: Ah, I think it assumes the string contains _only_ a uint8,
| rather than trying to parse a uint8 from the start of a string.
| So you need to zero out the "\n" separately, and then it works.
| Sharlin wrote:
| Like the article says, it uses SWAR and _requires_ a buffer of
| at least four bytes, with the trailing bytes zeroed. Your
| strings were both four bytes which saved you; had you tried
| with "42" or something, that would've been undefined behavior!
|
| Edit: No, wait, the `len` parameter is there to ensure that
| trailing bytes don't matter. But note that it must be the
| length of the number-containing prefix part, not the whole
| input string!
|
| > Also it parses "400" as 144
|
| Did you check the return value? If false (as it is in case of
| overflow), the result is meaningless.
| jstanley wrote:
| Yes I checked the return value. Example using the program
| from https://news.ycombinator.com/item?id=38451560
| $ ./parse256 400 value of 400 is 144
| firebaze wrote:
| Looking forward to "parsing bit sequences in roman literals
| quickly"
|
| https://hn.algolia.com/?q=lemire+parsing
| nvartolomei wrote:
| dlemire, you note that the read "overflows". Why can't you copy
| just `len` bytes? Does it slow too much because of the
| branch/more load/store operations?
| vinkelhake wrote:
| The SWAR algorithm accepts the 6 ASCII characters after '9'.
| It'll parse ":>" as 114. int res =
| parse_uint8_fastswar(":>\0", 2, &num);
|
| Returns true and num is 114.
| zokier wrote:
| The use of union here is bit confusing (I think its
| unnecessary?), although I don't imagine it making any difference
| in the generated code.
| mwkaufma wrote:
| That's the accepted method in c to do well-defined type-
| punning.
| vinkelhake wrote:
| But the `as_str` field is never used. The type punning (and
| all the wonderful endian issues) is already happening in the
| memcpy.
| mwkaufma wrote:
| Oh yeah, wheee, I'm not carefully reading. Maybe it's
| vestigial leftovers from a port from c to c++ (major impls
| handle it in C++ like in C but the standard insists it's
| UB).
| progmetaldev wrote:
| Agreed, as someone that isn't very familiar with C, but has
| used many C-like languages, the names "as_str" and "as_int"
| threw me off.
| amelius wrote:
| Fetching the data should be the bottleneck (by far), so why is
| the naive approach 2x slower than this smarter approach?
|
| Sounds like the CPU should be designed in a smarter way, not the
| code.
| IshKebab wrote:
| Presumably this also only works well if the data is 4-byte
| aligned.
___________________________________________________________________
(page generated 2023-11-28 23:00 UTC)