https://www.scattered-thoughts.net/writing/smolderingly-fast-btrees/

 

Smolderingly fast b-trees

Published 2024-10-06

(This is part of a series on the design of a language. See the list
of posts here.)

Many 'scripting' languages use a hashmap for their default
associative data-structure (javascript objects, python dicts, etc).
Hashtables have a lot of annoying properties:

  * Vulnerable to hash flooding.
  * If protected against hash flooding by random seeds, the iteration
    order becomes non-deterministic which is annoying for snapshot
    testing, reproducible builds, etc.
  * May have to rehash on insert, which produces terrible worst-case
    latencies for large hashmaps.
  * Repeatedly growing large allocations without fragmentation is
    difficult on wasm targets, because virtual memory tricks are not
    available and pages can't be unmapped.
  * Vector instructions on wasm are limited and there are no AES
    instructions. This makes many hash functions much slower.

Ordered data-structures like b-trees don't have any of these
disadvantages. They are typically slower than hashmaps, but I was
surprised to find fairly wide variation in people's expectations of
how much slower. So let's compare:

  * rust's std::collections::HashMap with siphash
  * rust's std::collections::BTreeMap
  * zig's std.HashMap with siphash
  * zig's std.HashMap with wyhash
  * My own bptree with various compile-time options for different
    experiments.

microbenchmarks are hard

Let's start with the dumbest possible benchmark - fill a map with
uniformly distributed random integers, measure how many cycles takes
to lookup all those integers, and take the mean over many iterations.

const before = rdtscp();
for (keys) |key| {
    const value_found = map.get(key);
    if (value_found == null) {
        panic("Value not found", .{});
    }
}
const after = rdtscp();
record("lookup_hit_all", map.count(), @divTrunc(after - before, map.count()));

lookup_hit_all / uniform u64

log2(#   0  1  2  3  4  5   6   7   8   9  10  11  12  13  14  15  16
 keys)
rust    46  26 19 54 78 94 106 133 152 166 193 215 236 262 290 316 367
btree
rust
hashmap 102 67 49 40 37 34 33  32  32  32  32  33  33  34  37  43  51
siphash
zig     46  26 37 56 64 87 100 110 123 143 165 180 197 220 235 269 294
b+tree
zig
hashmap 99  84 64 69 70 68 68  68  68  68  68  68  69  68  69  73  77
siphash
zig
hashmap 80  35 29 35 34 35 34  33  34  32  33  31  31  32  32  33  34
wyhash

That makes btrees look pretty bad. Much worse, in fact, than the
other benchmark that several people pointed me at, where at similar
sizes btree lookups are only ~2x slower than hashmaps. But don't
worry, we can reproduce that too:

for (keys) |key| {
    const before = rdtscp();
    const value_found = map.get(key);
    const after = rdtscp();
    record("lookup_hit_one", map.count(), after - before);
    if (value_found == null) {
        panic("Value not found", .{});
    }
}

lookup_hit_one / uniform u64

log2(#   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
 keys)
rust    47  48  50  82  107 123 135 163 182 200 227 256 279 309 328 358 405
btree
rust
hashmap 101 101 101 100 105 103 102 103 103 103 103 108 112 116 124 140 166
siphash
zig     45  45  59  72  85  107 126 135 148 170 188 203 223 246 264 292 319
b+tree
zig
hashmap 106 108 116 117 120 121 123 124 124 124 123 124 127 130 132 137 147
siphash
zig
hashmap 53  53  57  63  68  69  72  72  73  71  72  69  76  81  78  87  92
wyhash

All we did differently was average over one lookup at a time instead
of over many, but somehow that made the hashmaps 2-3x slower!

Both of these benchmarks are pretty bad, so let's make better
versions of both before trying to explain the differences. I only did
this for the zig data-structures, because I am lazy.

First, rather than looking the keys up in the same order we inserted
them, we'll precompute a random sample (with replacement):

const keys_hitting = try map.allocator.alloc(Key, @max(batch_size, count));
var permutation = XorShift64{};
for (keys_hitting) |*key| {
    const i = permutation.next() % count;
    key.* = keys[i];
}

Then rather than measuring a single lookup at a time, we'll measure a
batch of 256 lookups to amortize out the overhead of the rdtscp
instruction and pull the panics out of the measurement:

for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
    const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
    var values_found: [batch_size]?Key = undefined;

    const before = rdtscp();
    var key: Key = keys_batch[0];
    for (&values_found, 0..) |*value, i| {
        value.* = map.get(key);
        key = keys_batch[(i + 1) % keys_batch.len];
    }
    const after = rdtscp();
    report("lookup_hit_batch", map.count(), @divTrunc(after - before, batch_size));

    for (values_found) |value_found| {
        if (value_found == null) {
            panic("Value not found", .{});
        }
    }
}

The results look similar to the results for lookup_hit_all:

lookup_hit_batch / uniform u64

log2(#  0  1  2  3  4  5   6   7   8   9  10  11  12  13  14  15  16
 keys)
zig     6  9  31 47 60 82 102 112 126 147 174 186 204 230 245 276 299
b+tree
zig
hashmap 52 53 61 68 71 72 75  76  76  75  75  76  78  80  81  88  95
siphash
zig
hashmap 29 29 31 35 38 38 40  41  42  40  40  42  41  39  42  46  43
wyhash

Now we make one tiny tweak. Instead of iterating through the batch in
order, we'll use the value we just looked up to pick the next key.

for (0..@divTrunc(keys_hitting.len, batch_size)) |batch| {
    const keys_batch = keys_hitting[batch * batch_size ..][0..batch_size];
    var values_found: [batch_size]?Key = undefined;

    const before = rdtscp();
    var key: Key = keys_batch[0];
    for (&values_found, 0..) |*value, i| {
        value.* = map.get(key);
        key = keys_batch[(i + value.*.?) % keys_batch.len]; // <-- we changed this line
    }
    const after = rdtscp();
    report("lookup_hit_chain", map.count(), @divTrunc(after - before, batch_size));

    for (values_found) |value_found| {
        if (value_found == null) {
            panic("Value not found", .{});
        }
    }
}

Now we have two benchmarks with the exact same batch size and the
exact same instructions to execute. The only difference is that in
lookup_hit_chain we've introduced a data dependency between each
iteration of the loop - we need to know value before we know what to
lookup next. This prevents successful speculative execution of the
next loop iteration.

lookup_hit_chain / uniform u64

log2(#  0  1  2  3  4  5   6   7   8   9  10  11  12  13  14  15  16
 keys)
zig     7  25 37 48 59 82 105 115 126 148 172 186 198 219 240 273 300
b+tree
zig
hashmap 77 79 88 88 90 91 93  94  94  93  95  99  102 103 106 115 130
siphash
zig
hashmap 35 37 44 48 52 52 55  57  55  54  57  63  64  63  70  76  88
wyhash

The wyhash hashmap halves it's lookup time in lookup_hit_batch vs
lookup_hit_chain but the btree doesn't benefit at all. I don't
understand the limits to speculative execution at all, but let's make
some mildly informed guesses.

At 2^16 keys the whole dataset is about 1mb, which fits comfortably
in L2 but is much bigger than L1. Each hashmap lookup costs 1 or
maybe 2 L2 cache lookups, each of which have ~20 cycle latency. The
wyhash fast path for u64 is only a handful of instructions with no
branches:

bench[0x1014710] <+0>:  rorxq  $0x20, %rdi, %rdx
bench[0x1014716] <+6>:  movabsq $-0x18fc812e5f4bd725, %rax ; imm = 0xE7037ED1A0B428DB
bench[0x1014720] <+16>: xorq   %rax, %rdx
bench[0x1014723] <+19>: movabsq $0x1ff5c2923a788d2c, %rcx ; imm = 0x1FF5C2923A788D2C
bench[0x101472d] <+29>: xorq   %rdi, %rcx
bench[0x1014730] <+32>: mulxq  %rcx, %rcx, %rdx
bench[0x1014735] <+37>: movabsq $-0x5f89e29b87429bd9, %rsi ; imm = 0xA0761D6478BD6427
bench[0x101473f] <+47>: xorq   %rcx, %rsi
bench[0x1014742] <+50>: xorq   %rax, %rdx
bench[0x1014745] <+53>: mulxq  %rsi, %rcx, %rax
bench[0x101474a] <+58>: xorq   %rcx, %rax
bench[0x101474d] <+61>: retq

So while we're waiting for one cache lookup we can start hashing the
next key (if we can predict what it is) and maybe even get started on
the next cache lookup.

The btree at 2^16 keys has 4 levels. There are typically 15-16 keys
per node when inserting random keys, so we'll expect to do around 32
comparisons on average before finding our key. The body of that
search loop looks like:

bench[0x10121b0] <+16>: cmpq   %rdx, (%rdi,%rax,8)
bench[0x10121b4] <+20>: jae    0x10121c2                 ; <+34> at bptree.zig
bench[0x10121b6] <+22>: addq   $0x1, %rax
bench[0x10121ba] <+26>: cmpq   %rax, %rsi
bench[0x10121bd] <+29>: jne    0x10121b0                 ; <+16> [inlined] bench.less_than_u64 + 9 at bench.zig:159:5

So 5 instructions, including two branches, per comparison. At least
160 instructions and 64 branches for the whole lookup.

The 16 keys take up 2 cachelines, so we'll average 1.5 cache lookups
for each linear key search, plus 1 cache lookup to hit the child/
values array. 10 cache lookups in total for the whole btree lookup.
Each of those cache lookups depends on the previous lookup so it'll
be hard to speculate correctly, but prefetching might help within a
single node. If every cache lookup hit L2 in strict series we'd
expect ~200 cycles, but probably some of the earlier nodes are in L1.

Anyway, let's leave that rabbit hole alone for now. It's enough to
notice that hashmaps benefit from speculative execution between
multiple lookups and btrees don't.

We can roughly think of lookup_hit_batch as measuring throughput and
lookup_hit_chain as measuring latency. All the other benchmarks I've
seen only measure one or the other, which starts to explain the
disagreement over btree vs hashmap performance.

string keys

Btrees do a lot of key comparisons. If the keys are integers then
they are stored in the btree node, so it doesn't matter too much. But
for strings keys we potentially pay for a cache lookup for every
comparison.

lookup_hit_chain / random strings

log2(#   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
 keys)
zig     101 124 144 160 184 219 255 279 310 366 427 470 512 577 700 826 965
b+tree
zig
hashmap 145 147 154 159 162 163 165 167 168 173 181 186 196 211 247 287 312
siphash
zig
hashmap 49  50  59  65  68  69  72  74  75  81  88  93  103 119 154 188 219
wyhash

Completely random strings is actually pretty unlikely and unfairly
benefits the btree, which typically only has to compare the first
character of each string to find that they are not equal. But real
datasets (eg urls in a log) often have large common prefixes. We can
see the difference if we make the first 16 characters constant:

lookup_hit_chain / random-ish strings

log2(#   0   1   2   3   4   5   6   7   8    9    10   11   12   13   14   15   16
 keys)
zig     102 150 221 338 546 618 807 911 1077 1393 1507 1641 1812 2095 2628 2848 3302
b+tree
zig
hashmap 145 147 153 159 162 163 165 167 168  174  182  187  197  210  245  282  313
siphash
zig
hashmap 49  50  59  66  69  69  72  74  74   81   88   93   102  117  154  188  214
wyhash

Hashmaps don't care, but the btree is really hurting.

If we're just supporting strings we can optimize for this case by
storing the length of the common prefix within each node. But it's
hard to imagine how to generalize that to arbitrary types. What if
the key is a tuple of strings? Or a small set?

wasm hashes

Let's try the same in wasm, where hash functions have less access to
fast vector instructions. Wasm also doesn't have rdtscp so times here
are in nanoseconds rather than cycles.

lookup_hit_chain / uniform u64 / wasm

   log2(#keys)    0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15  16
zig b+tree        5  14 19 22 27 38 45 49 53 59 70 75 80 87 98 111 120
zig hashmap       33 34 37 38 40 40 41 42 41 41 42 43 44 45 46 52  58
siphash
zig hashmap       29 30 33 35 36 36 37 38 38 37 38 40 40 42 43 47  51
wyhash

lookup_hit_chain / random strings / wasm

log2(#  0  1  2  3   4   5   6   7   8   9  10  11  12  13  14  15  16
 keys)
zig     64 77 88 97 117 141 154 167 192 216 269 287 294 308 369 436 500
b+tree
zig
hashmap 65 64 68 70 73  71  72  73  73  75  78  81  84  88  102 118 135
siphash
zig
hashmap 45 45 48 50 52  52  53  54  54  56  59  62  65  69  79  93  105
whyhash

The overall ratios are fairly similar, although wyhash seems to have
been penalized a little relative to siphash.

Neither the hash functions nor the table scan generates vector
instructions at all, even when comparing strings. And, uh, now that I
think to look, they don't generate vector instructions on x86 either.
So I guess that's a non-issue.

btree tuning

I implemented both btrees and b+trees. I didn't see much difference
between them for insert/lookup, so I preferred the b+tree for the
easier/faster implementation of scans and range queries.

The rust btreemap fixes the max key count per node to 11. For all the
workloads I've tried the sweet spot seems to be to fix the node size
to 512 bytes, which is 31 keys for u64 and 20 keys for strings.

Allowing leaves and branches to have different sizes didn't help.

I gained a small speedup by manually laying out the node like
key_count, keys, values/children. Zig by default prefers to put
key_count at the end of the struct to avoid padding, but we always
read the key_count first so it's nice to get some keys on the same
cacheline. Maintaining this optimization across different
architectures was annoying though, so I rolled it back and it's not
reflected in the tests above.

The rust btreemap switches on Ordering. I got a small boost from
instead using less_than during the search and calling equal
afterwards. For a lookup at 2^16 keys we'll expect to call less_than
32 times and equal once, so it's worth paying for the extra call to
equal in exchange for tightening up the inner search loop.

I use tried various different binary searches. The best was this
'branchless' variant:

fn searchBinary(keys: []Key, search_key: Key) usize {
    if (keys.len == 0) return 0;
    var offset: usize = 0;
    var length: usize = keys.len;
    while (length > 1) {
        const half = length / 2;
        const mid = offset + half;
        if (less_than(keys[mid], search_key)) {
            @branchHint(.unpredictable);
            offset = mid;
        }
        length -= half;
    }
    offset += @intFromBool(less_than(keys[offset], search_key));
    return offset;
}

while (length > 1) requires a branch but it's easily predictable.
branchHint(.unpredictable) causes llvm to emit a conditional move for
offset = mid.

bench[0x101f5e0] <+0>:  movq   %rsi, %rax
bench[0x101f5e3] <+3>:  testq  %rsi, %rsi
bench[0x101f5e6] <+6>:  je     0x101f616                 ; <+54> at bptree.zig
bench[0x101f5e8] <+8>:  xorl   %ecx, %ecx
bench[0x101f5ea] <+10>: cmpq   $0x1, %rax
bench[0x101f5ee] <+14>: je     0x101f60b                 ; <+43> [inlined] bench.less_than_u64 at bench.zig:185:5
bench[0x101f5f0] <+16>: movq   %rax, %rsi
bench[0x101f5f3] <+19>: shrq   %rsi
bench[0x101f5f6] <+22>: leaq   (%rcx,%rsi), %r8
bench[0x101f5fa] <+26>: cmpq   %rdx, (%rdi,%r8,8)
bench[0x101f5fe] <+30>: cmovbq %r8, %rcx
bench[0x101f602] <+34>: subq   %rsi, %rax
bench[0x101f605] <+37>: cmpq   $0x1, %rax
bench[0x101f609] <+41>: ja     0x101f5f0                 ; <+16> at bptree.zig:334:37
bench[0x101f60b] <+43>: cmpq   %rdx, (%rdi,%rcx,8)
bench[0x101f60f] <+47>: adcq   $0x0, %rcx
bench[0x101f613] <+51>: movq   %rcx, %rax
bench[0x101f616] <+54>: retq

Linear search was still faster in most benchmarks though, even with
ludicrously large node sizes.

I also tried a btree variant I saw in some paper where leaves are
left unsorted and keys are always inserted at the end of the leaf.
This saves some memcopying in the common case, but having to sort the
leaf before splitting negates the benefit.

outcome

All the benchmarks above are basically best case scenarios, where
we're doing a lot of lookups in a row. If we were just doing one
lookup in the middle of some other work then the btree might not be
in cache at all and each of those 2.5 cache lookups per level are
going all the way to main memory. That's catastrophic. Whereas an
open-addressed hashmap will typically only hit 1 or 2 cachelines per
lookup regardless of size.

And while btrees avoid many of the performance edge cases of
hashmaps, they also have some cliffs of their own to fall off as
comparisons get more expensive and touch more memory, as we saw with
the random-ish strings above.

I haven't measure space usage yet, but we can expect it to be worse
for btrees. For random keys the typical node occupancy is 50%, minus
per-node overhead like key_count, whereas I've been running the zig
hashmaps at 80%. So we can guesstimate the btrees will use >60% more
memory. EDIT Duh, hashmaps have low occupancy after doubling, so
their average occupancy is actually more like 57%. Whereas the btree
with random keys is 50% but with steady churn will approach 67%.

Space usage is really bad for small maps too. I'd need to add some
extra tweaks to allow the btree root node to start small and grow,
rather than paying the full 512 bytes for a map with only 1 key.

Overall, I'm unconvinced that it's worth exploring btrees further.
I'll stick to hashmaps and I'll either iterate in insertion order or
I'll require sorting entries before iterating. But if anyone else
likes the look of this rabbit hole I left some other ideas untouched:

  * Consider the hash function part of the stable api for the
    compiler. Use an open-addressed hashtable that preserves hash
    order. Solve hash flooding without seeding the hash eg by falling
    back to a tree.
  * Hash-array mapped tries have similar O(log(n)) cacheline
    behaviour to btrees. But if we used open-addressed hashtables at
    the leaves we could keep the nesting depth pretty low.
  * In-memory LSMs with incremental merging work pretty well in
    differential dataflow, but still have this O(log(n)) cacheline
    behaviour. But maybe with a hashmap as secondary index lookups
    might be reasonable, and we can consider slow inserts the price
    to pay for not rehashing.

miscellanea

All the experiments here are using:

  * rustc 1.77.2
  * zig 0.14.0-dev.1694+3b465ebec
  * wasmtime-cli 20.0.2

Running on an intel i7-1260P with:

  * efficiency cores disabled
  * scaling governer set to performance
  * turbo mode disabled
  * aslr disabled

Usually I would also use cset to pin experiments to a shielded core,
but it seems that recent systemd versions have broken that workflow
and I haven't figured out a replacement yet.

The rest of the benchmark setup can be found in jamii/maps. I
measured more than just lookups, but btrees have similar performance
for every operation due to having to find a key first.

Cache latency numbers came from mlc and roughly match claims I found
online.

Thanks to Dan Luu for help thinking through cache behaviour.

---------------------------------------------------------------------
jamie@scattered-thoughts.net
Updates via atom or [                    ] []
Support my work on github sponsors.