https://www.scattered-thoughts.net/writing/smolderingly-fast-btrees/ Smolderingly fast b-trees Published 2024-10-06 (This is part of a series on the design of a language. See the list of posts here.) Many 'scripting' languages use a hashmap for their default associative data-structure (javascript objects, python dicts, etc). Hashtables have a lot of annoying properties: * Vulnerable to hash flooding. * If protected against hash flooding by random seeds, the iteration order becomes non-deterministic which is annoying for snapshot testing, reproducible builds, etc. * May have to rehash on insert, which produces terrible worst-case latencies for large hashmaps. * Repeatedly growing large allocations without fragmentation is difficult on wasm targets, because virtual memory tricks are not available and pages can't be unmapped. * Vector instructions on wasm are limited and there are no AES instructions. This makes many hash functions much slower. Ordered data-structures like b-trees don't have any of these disadvantages. They are typically slower than hashmaps, but I was surprised to find fairly wide variation in people's expectations of how much slower. So let's compare: * rust's std::collections::HashMap with siphash * rust's std::collections::BTreeMap * zig's std.HashMap with siphash * zig's std.HashMap with wyhash * My own bptree with various compile-time options for different experiments. microbenchmarks are hard Let's start with the dumbest possible benchmark - fill a map with uniformly distributed random integers, measure how many cycles takes to lookup all those integers, and take the mean over many iterations. const before = rdtscp(); for (keys) |key| { const value_found = map.get(key); if (value_found == null) { panic("Value not found", .{}); } } const after = rdtscp(); record("lookup_hit_all", map.count(), @divTrunc(after - before, map.count())); lookup_hit_all / uniform u64 log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) rust 46 26 19 54 78 94 106 133 152 166 193 215 236 262 290 316 367 btree rust hashmap 102 67 49 40 37 34 33 32 32 32 32 33 33 34 37 43 51 siphash zig 46 26 37 56 64 87 100 110 123 143 165 180 197 220 235 269 294 b+tree zig hashmap 99 84 64 69 70 68 68 68 68 68 68 68 69 68 69 73 77 siphash zig hashmap 80 35 29 35 34 35 34 33 34 32 33 31 31 32 32 33 34 wyhash That makes btrees look pretty bad. Much worse, in fact, than the other benchmark that several people pointed me at, where at similar sizes btree lookups are only ~2x slower than hashmaps. But don't worry, we can reproduce that too: for (keys) |key| { const before = rdtscp(); const value_found = map.get(key); const after = rdtscp(); record("lookup_hit_one", map.count(), after - before); if (value_found == null) { panic("Value not found", .{}); } } lookup_hit_one / uniform u64 log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) rust 47 48 50 82 107 123 135 163 182 200 227 256 279 309 328 358 405 btree rust hashmap 101 101 101 100 105 103 102 103 103 103 103 108 112 116 124 140 166 siphash zig 45 45 59 72 85 107 126 135 148 170 188 203 223 246 264 292 319 b+tree zig hashmap 106 108 116 117 120 121 123 124 124 124 123 124 127 130 132 137 147 siphash zig hashmap 53 53 57 63 68 69 72 72 73 71 72 69 76 81 78 87 92 wyhash All we did differently was average over one lookup at a time instead of over many, but somehow that made the hashmaps 2-3x slower! Both of these benchmarks are pretty bad, so let's make better versions of both before trying to explain the differences. I only did this for the zig data-structures, because I am lazy. First, rather than looking the keys up in the same order we inserted them, we'll precompute a random sample (with replacement): const keys_hitting = try map.allocator.alloc(Key, @max(batch_size, count)); var permutation = XorShift64{}; for (keys_hitting) |*key| { const i = permutation.next() % count; key.* = keys[i]; } Then rather than measuring a single lookup at a time, we'll measure a batch of 256 lookups to amortize out the overhead of the rdtscp instruction and pull the panics out of the measurement: for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| { const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size]; var values_found: [batch_size]?Key = undefined; const before = rdtscp(); var key: Key = keys_batch[0]; for (&values_found, 0..) |*value, i| { value.* = map.get(key); key = keys_batch[(i + 1) % keys_batch.len]; } const after = rdtscp(); report("lookup_hit_batch", map.count(), @divTrunc(after - before, batch_size)); for (values_found) |value_found| { if (value_found == null) { panic("Value not found", .{}); } } } The results look similar to the results for lookup_hit_all: lookup_hit_batch / uniform u64 log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) zig 6 9 31 47 60 82 102 112 126 147 174 186 204 230 245 276 299 b+tree zig hashmap 52 53 61 68 71 72 75 76 76 75 75 76 78 80 81 88 95 siphash zig hashmap 29 29 31 35 38 38 40 41 42 40 40 42 41 39 42 46 43 wyhash Now we make one tiny tweak. Instead of iterating through the batch in order, we'll use the value we just looked up to pick the next key. for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| { const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size]; var values_found: [batch_size]?Key = undefined; const before = rdtscp(); var key: Key = keys_batch[0]; for (&values_found, 0..) |*value, i| { value.* = map.get(key); key = keys_batch[(i + value.*.?) % keys_batch.len]; // <-- we changed this line } const after = rdtscp(); report("lookup_hit_chain", map.count(), @divTrunc(after - before, batch_size)); for (values_found) |value_found| { if (value_found == null) { panic("Value not found", .{}); } } } Now we have two benchmarks with the exact same batch size and the exact same instructions to execute. The only difference is that in lookup_hit_chain we've introduced a data dependency between each iteration of the loop - we need to know value before we know what to lookup next. This prevents successful speculative execution of the next loop iteration. lookup_hit_chain / uniform u64 log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) zig 7 25 37 48 59 82 105 115 126 148 172 186 198 219 240 273 300 b+tree zig hashmap 77 79 88 88 90 91 93 94 94 93 95 99 102 103 106 115 130 siphash zig hashmap 35 37 44 48 52 52 55 57 55 54 57 63 64 63 70 76 88 wyhash The wyhash hashmap halves it's lookup time in lookup_hit_batch vs lookup_hit_chain but the btree doesn't benefit at all. I don't understand the limits to speculative execution at all, but let's make some mildly informed guesses. At 2^16 keys the whole dataset is about 1mb, which fits comfortably in L2 but is much bigger than L1. Each hashmap lookup costs 1 or maybe 2 L2 cache lookups, each of which have ~20 cycle latency. The wyhash fast path for u64 is only a handful of instructions with no branches: bench[0x1014710] <+0>: rorxq $0x20, %rdi, %rdx bench[0x1014716] <+6>: movabsq $-0x18fc812e5f4bd725, %rax ; imm = 0xE7037ED1A0B428DB bench[0x1014720] <+16>: xorq %rax, %rdx bench[0x1014723] <+19>: movabsq $0x1ff5c2923a788d2c, %rcx ; imm = 0x1FF5C2923A788D2C bench[0x101472d] <+29>: xorq %rdi, %rcx bench[0x1014730] <+32>: mulxq %rcx, %rcx, %rdx bench[0x1014735] <+37>: movabsq $-0x5f89e29b87429bd9, %rsi ; imm = 0xA0761D6478BD6427 bench[0x101473f] <+47>: xorq %rcx, %rsi bench[0x1014742] <+50>: xorq %rax, %rdx bench[0x1014745] <+53>: mulxq %rsi, %rcx, %rax bench[0x101474a] <+58>: xorq %rcx, %rax bench[0x101474d] <+61>: retq So while we're waiting for one cache lookup we can start hashing the next key (if we can predict what it is) and maybe even get started on the next cache lookup. The btree at 2^16 keys has 4 levels. There are typically 15-16 keys per node when inserting random keys, so we'll expect to do around 32 comparisons on average before finding our key. The body of that search loop looks like: bench[0x10121b0] <+16>: cmpq %rdx, (%rdi,%rax,8) bench[0x10121b4] <+20>: jae 0x10121c2 ; <+34> at bptree.zig bench[0x10121b6] <+22>: addq $0x1, %rax bench[0x10121ba] <+26>: cmpq %rax, %rsi bench[0x10121bd] <+29>: jne 0x10121b0 ; <+16> [inlined] bench.less_than_u64 + 9 at bench.zig:159:5 So 5 instructions, including two branches, per comparison. At least 160 instructions and 64 branches for the whole lookup. The 16 keys take up 2 cachelines, so we'll average 1.5 cache lookups for each linear key search, plus 1 cache lookup to hit the child/ values array. 10 cache lookups in total for the whole btree lookup. Each of those cache lookups depends on the previous lookup so it'll be hard to speculate correctly, but prefetching might help within a single node. If every cache lookup hit L2 in strict series we'd expect ~200 cycles, but probably some of the earlier nodes are in L1. Anyway, let's leave that rabbit hole alone for now. It's enough to notice that hashmaps benefit from speculative execution between multiple lookups and btrees don't. We can roughly think of lookup_hit_batch as measuring throughput and lookup_hit_chain as measuring latency. All the other benchmarks I've seen only measure one or the other, which starts to explain the disagreement over btree vs hashmap performance. string keys Btrees do a lot of key comparisons. If the keys are integers then they are stored in the btree node, so it doesn't matter too much. But for strings keys we potentially pay for a cache lookup for every comparison. lookup_hit_chain / random strings log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) zig 101 124 144 160 184 219 255 279 310 366 427 470 512 577 700 826 965 b+tree zig hashmap 145 147 154 159 162 163 165 167 168 173 181 186 196 211 247 287 312 siphash zig hashmap 49 50 59 65 68 69 72 74 75 81 88 93 103 119 154 188 219 wyhash Completely random strings is actually pretty unlikely and unfairly benefits the btree, which typically only has to compare the first character of each string to find that they are not equal. But real datasets (eg urls in a log) often have large common prefixes. We can see the difference if we make the first 16 characters constant: lookup_hit_chain / random-ish strings log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) zig 102 150 221 338 546 618 807 911 1077 1393 1507 1641 1812 2095 2628 2848 3302 b+tree zig hashmap 145 147 153 159 162 163 165 167 168 174 182 187 197 210 245 282 313 siphash zig hashmap 49 50 59 66 69 69 72 74 74 81 88 93 102 117 154 188 214 wyhash Hashmaps don't care, but the btree is really hurting. If we're just supporting strings we can optimize for this case by storing the length of the common prefix within each node. But it's hard to imagine how to generalize that to arbitrary types. What if the key is a tuple of strings? Or a small set? wasm hashes Let's try the same in wasm, where hash functions have less access to fast vector instructions. Wasm also doesn't have rdtscp so times here are in nanoseconds rather than cycles. lookup_hit_chain / uniform u64 / wasm log2(#keys) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 zig b+tree 5 14 19 22 27 38 45 49 53 59 70 75 80 87 98 111 120 zig hashmap 33 34 37 38 40 40 41 42 41 41 42 43 44 45 46 52 58 siphash zig hashmap 29 30 33 35 36 36 37 38 38 37 38 40 40 42 43 47 51 wyhash lookup_hit_chain / random strings / wasm log2(# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 keys) zig 64 77 88 97 117 141 154 167 192 216 269 287 294 308 369 436 500 b+tree zig hashmap 65 64 68 70 73 71 72 73 73 75 78 81 84 88 102 118 135 siphash zig hashmap 45 45 48 50 52 52 53 54 54 56 59 62 65 69 79 93 105 whyhash The overall ratios are fairly similar, although wyhash seems to have been penalized a little relative to siphash. Neither the hash functions nor the table scan generates vector instructions at all, even when comparing strings. And, uh, now that I think to look, they don't generate vector instructions on x86 either. So I guess that's a non-issue. btree tuning I implemented both btrees and b+trees. I didn't see much difference between them for insert/lookup, so I preferred the b+tree for the easier/faster implementation of scans and range queries. The rust btreemap fixes the max key count per node to 11. For all the workloads I've tried the sweet spot seems to be to fix the node size to 512 bytes, which is 31 keys for u64 and 20 keys for strings. Allowing leaves and branches to have different sizes didn't help. I gained a small speedup by manually laying out the node like key_count, keys, values/children. Zig by default prefers to put key_count at the end of the struct to avoid padding, but we always read the key_count first so it's nice to get some keys on the same cacheline. Maintaining this optimization across different architectures was annoying though, so I rolled it back and it's not reflected in the tests above. The rust btreemap switches on Ordering. I got a small boost from instead using less_than during the search and calling equal afterwards. For a lookup at 2^16 keys we'll expect to call less_than 32 times and equal once, so it's worth paying for the extra call to equal in exchange for tightening up the inner search loop. I use tried various different binary searches. The best was this 'branchless' variant: fn searchBinary(keys: []Key, search_key: Key) usize { if (keys.len == 0) return 0; var offset: usize = 0; var length: usize = keys.len; while (length > 1) { const half = length / 2; const mid = offset + half; if (less_than(keys[mid], search_key)) { @branchHint(.unpredictable); offset = mid; } length -= half; } offset += @intFromBool(less_than(keys[offset], search_key)); return offset; } while (length > 1) requires a branch but it's easily predictable. branchHint(.unpredictable) causes llvm to emit a conditional move for offset = mid. bench[0x101f5e0] <+0>: movq %rsi, %rax bench[0x101f5e3] <+3>: testq %rsi, %rsi bench[0x101f5e6] <+6>: je 0x101f616 ; <+54> at bptree.zig bench[0x101f5e8] <+8>: xorl %ecx, %ecx bench[0x101f5ea] <+10>: cmpq $0x1, %rax bench[0x101f5ee] <+14>: je 0x101f60b ; <+43> [inlined] bench.less_than_u64 at bench.zig:185:5 bench[0x101f5f0] <+16>: movq %rax, %rsi bench[0x101f5f3] <+19>: shrq %rsi bench[0x101f5f6] <+22>: leaq (%rcx,%rsi), %r8 bench[0x101f5fa] <+26>: cmpq %rdx, (%rdi,%r8,8) bench[0x101f5fe] <+30>: cmovbq %r8, %rcx bench[0x101f602] <+34>: subq %rsi, %rax bench[0x101f605] <+37>: cmpq $0x1, %rax bench[0x101f609] <+41>: ja 0x101f5f0 ; <+16> at bptree.zig:334:37 bench[0x101f60b] <+43>: cmpq %rdx, (%rdi,%rcx,8) bench[0x101f60f] <+47>: adcq $0x0, %rcx bench[0x101f613] <+51>: movq %rcx, %rax bench[0x101f616] <+54>: retq Linear search was still faster in most benchmarks though, even with ludicrously large node sizes. I also tried a btree variant I saw in some paper where leaves are left unsorted and keys are always inserted at the end of the leaf. This saves some memcopying in the common case, but having to sort the leaf before splitting negates the benefit. outcome All the benchmarks above are basically best case scenarios, where we're doing a lot of lookups in a row. If we were just doing one lookup in the middle of some other work then the btree might not be in cache at all and each of those 2.5 cache lookups per level are going all the way to main memory. That's catastrophic. Whereas an open-addressed hashmap will typically only hit 1 or 2 cachelines per lookup regardless of size. And while btrees avoid many of the performance edge cases of hashmaps, they also have some cliffs of their own to fall off as comparisons get more expensive and touch more memory, as we saw with the random-ish strings above. I haven't measure space usage yet, but we can expect it to be worse for btrees. For random keys the typical node occupancy is 50%, minus per-node overhead like key_count, whereas I've been running the zig hashmaps at 80%. So we can guesstimate the btrees will use >60% more memory. EDIT Duh, hashmaps have low occupancy after doubling, so their average occupancy is actually more like 57%. Whereas the btree with random keys is 50% but with steady churn will approach 67%. Space usage is really bad for small maps too. I'd need to add some extra tweaks to allow the btree root node to start small and grow, rather than paying the full 512 bytes for a map with only 1 key. Overall, I'm unconvinced that it's worth exploring btrees further. I'll stick to hashmaps and I'll either iterate in insertion order or I'll require sorting entries before iterating. But if anyone else likes the look of this rabbit hole I left some other ideas untouched: * Consider the hash function part of the stable api for the compiler. Use an open-addressed hashtable that preserves hash order. Solve hash flooding without seeding the hash eg by falling back to a tree. * Hash-array mapped tries have similar O(log(n)) cacheline behaviour to btrees. But if we used open-addressed hashtables at the leaves we could keep the nesting depth pretty low. * In-memory LSMs with incremental merging work pretty well in differential dataflow, but still have this O(log(n)) cacheline behaviour. But maybe with a hashmap as secondary index lookups might be reasonable, and we can consider slow inserts the price to pay for not rehashing. miscellanea All the experiments here are using: * rustc 1.77.2 * zig 0.14.0-dev.1694+3b465ebec * wasmtime-cli 20.0.2 Running on an intel i7-1260P with: * efficiency cores disabled * scaling governer set to performance * turbo mode disabled * aslr disabled Usually I would also use cset to pin experiments to a shielded core, but it seems that recent systemd versions have broken that workflow and I haven't figured out a replacement yet. The rest of the benchmark setup can be found in jamii/maps. I measured more than just lookups, but btrees have similar performance for every operation due to having to find a key first. Cache latency numbers came from mlc and roughly match claims I found online. Thanks to Dan Luu for help thinking through cache behaviour. --------------------------------------------------------------------- jamie@scattered-thoughts.net Updates via atom or [ ] [] Support my work on github sponsors.