[HN Gopher] Fast B-Trees
       ___________________________________________________________________
        
       Fast B-Trees
        
       Author : surprisetalk
       Score  : 223 points
       Date   : 2024-10-07 01:42 UTC (21 hours ago)
        
 (HTM) web link (www.scattered-thoughts.net)
 (TXT) w3m dump (www.scattered-thoughts.net)
        
       | tekknolagi wrote:
       | I thought a lot of b(+)tree advantage was in bigger-than-RAM
       | something or other for large databases and these benchmarks seem
       | relatively small in comparison
        
         | foota wrote:
         | B-Trees are good for in memory data too because they have
         | fairly good cache behavior.
        
         | robertclaus wrote:
         | Ya, I would be curious to see how this performs on out-of-cache
         | data on an SSD and actual hard drive. On the other hand, the
         | findings are definitely still relevant since RAM has gotten
         | fairly cheap and most applications probably fit in it just
         | fine.
         | 
         | Regarding databases - Btrees also have a natural sort order,
         | which hash tables don't. This means a btree as your main data
         | structure helps with sort, range, or list operations in a way a
         | hash tables can't. That being said, even traditional databases
         | obviously still use hash tables extensively (ex. Hash joins).
        
           | scotty79 wrote:
           | In Rust thanks to it you can have BTreeSet of BTreeSet-s.
        
             | scotty79 wrote:
             | I think my comment was way too short.
             | 
             | What I meant was that I needed a set type in Rust. At first
             | I tried to use HashSet but I needed it's elements to also
             | be sets. I couldn't because HashSet doesn't itself have a
             | natural Hash. I'd need to implement it myself.
             | 
             | However while using BTreeSet the requirement on the
             | elements is that they have natural ordering (have Ord
             | trait) but the BTreeSet itself also implements Ord so I
             | could have a BTreeSet of BTreeSets without implementing
             | anything additionally. Thanks to the fact that BTrees have
             | natural ordering.
             | 
             | So I used BTreeSet instead,
        
         | xxs wrote:
         | b-trees are just 'better' binary trees as they have lower
         | amounts of indirections (nodes)
        
         | crest wrote:
         | As long as your puny little working set (2^16 small keys) fits
         | into L2 cache and get is perfectly covered by the L1 dTLB you
         | won't see the cost of touching random pages in a big hash table
         | larger than the last level TLB coverage and on chip caches.
         | There won't be any TLB stalls waiting for the page walkers and
         | you won't miss the lost spacial locality in the key-space
         | preserved by B(+)trees if everything is in L2 cache. At the
         | very least it proves that hash tables can be a good fit for
         | point queries of datasets too large for linear searching or
         | sorting + binary searches, but not yet large enough to exhaust
         | CPU cache capacity.
        
         | marginalia_nu wrote:
         | You can line them up with disk blocks, or with CPU cache lines,
         | the effect is relatively similar.
        
       | lsb wrote:
       | Clojure, for example, uses Hash Array Mapped Tries as its
       | associative data structure, and those work well
        
       | BeeOnRope wrote:
       | To answer a question implied in the article, per-lookup timing
       | with rdtscp hurts the hash more than the btree for the same
       | reason the hash is hurt by the data-depending chaining: rdtscp is
       | an execution barrier which prevents successive lookups from
       | overlapping. rdtsc (no p) isn't, and would probably produce quite
       | different timings.
       | 
       | That the btree doesn't benefit from overlapping adjacent
       | lookup/inserts is intereting.
       | 
       | I suppose it is because btree access (here) involves data-
       | dependent branches, and so with random access you'll get about L
       | mispredicts per lookup in an L-deep tree, so adjacent lookups are
       | separated by at least one mispredict: so adjacent lookup can
       | overlap execution, but the overlapping is useless since
       | everything beyond the next mispredict is useless as it is on the
       | bad path.
       | 
       | That's probably at least true for the small map regime. For the
       | larger maps, the next iteration is actually very useful, even if
       | on a mispredicted path, because the date accesses are at the
       | right location so it serves to bring in all the nodes for the
       | next iteration. This matters a lot outside of L2. At 5
       | instructions per comparison and 32-element nodes, however, there
       | are just so many instructions in the window for 1 lookup it's
       | hard to make it to the next iteration.
       | 
       | So b-trees benefit a lot from a tight linear seach (e.g. 2
       | instructions per check, macro-fused to 1 op), or a branch-free
       | linear search, or far better than those for big nodes, a
       | vectorized branch-free search.
        
         | dwattttt wrote:
         | > the overlapping is useless since everything beyond the next
         | mispredict is useless as it is on the bad path
         | 
         | Is this a consequence of Spectre et al mitigations?
        
           | BeeOnRope wrote:
           | No, just a consequence of how mispredicts work: all execution
           | after a mispredict is thrown away: though some traces remain
           | in the cache, which can be very important for performance
           | (and also, of course, Spectre).
        
             | dwattttt wrote:
             | That's the part I was curious about; whether there would've
             | been a helpful cache impact, if not for modern Spectre
             | prevention.
        
               | starspangled wrote:
               | Spectre mitigations don't change that, some of them do
               | require adding speculation barriers or otherwise turn off
               | branch prediction for cases where unprivileged state can
               | be used to direct mis-predicted privileged branches into
               | gadgets which can create a side-band to privileged data
               | with speculative state.
               | 
               | But in general execution (i.e., no privilege domain
               | crossings), this mechanism is not required.
               | 
               | Performance effects of executing mispredicted branches
               | (called something like "wrong-path execution" or
               | "mispredicted path ..." in literature) is interesting and
               | it has been studied. I don't know what the state of the
               | art is, although I've seen results showing both speedups
               | and slowdowns (as you would expect with any cache / BP
               | kind of topic :P).
        
               | BeeOnRope wrote:
               | > Spectre mitigations don't change that, ...
               | 
               | Yes, exactly. To the first order I think Spectre didn't
               | really change the performance of existing userspace-only
               | code. What slowed down was system calls, kernel code and
               | some things which were recompiled or otherwise adjusted
               | to mitigate some aspects of Spectre. There might be a
               | rare exception, e.g., IIRC `lfence` slowed down on AMD in
               | order to make it more useful as a speculation barrier on
               | AMD but this is hardly an instruction that saw much use
               | before.
               | 
               | > I don't know what the state of the art is, although
               | I've seen results showing both speedups and slowdowns
               | 
               | Yeah. This seems like a pretty cut and dry case where
               | you'd get a speedup from wrong-path misses, since the
               | independent next search will be correctly predicted from
               | the start and access exactly the right nodes, so it
               | serves as highly accurate prefetching: it only gets
               | thrown out because of a mispredict at the end of the
               | _prior_ search.
               | 
               | Something like the misses within a single binary search
               | are more ambiguous: for random input the accuracy drops
               | off like 0.5^n as you predict n levels deep, but that
               | still adds up to ~double MLP compared to not speculating,
               | so in a microbenchmark it tends to look good. In the real
               | world with 1 lookup mixed in with a lot of other code,
               | the many cache lines brought in on the bad path may be
               | overall worse than inserting a speculation barrier
               | yourself.
               | 
               | That's the cool part: we can choose whether we want
               | speculation or not if we know up front if it's harmful.
        
             | jnordwick wrote:
             | Can you be more specific about "all execution after a
             | mispredict is thrown away". Are you saying even non-
             | dependant instructions? I thought the CPU just ignored the
             | registers resulting from the wrong instruction path and
             | started calculating the correct path - that the way the
             | register file worked allowed it be to much better than just
             | junking everything.
             | 
             | A little less verbosely: "is non-dependant ILP work also
             | tossed on a mispredict even though that won't change?
        
       | ur-whale wrote:
       | Unless I'm missing something, title of the article doesn't really
       | correlate with its conclusion.
        
         | dpatterbee wrote:
         | The title of the article is "Smolderingly fast b-trees".
         | Smoldering is (sorta) an antonym of blazing. Blazingly fast
         | means very fast, smolderingly fast would therefore mean not
         | very fast.
        
       | aidenn0 wrote:
       | It's been a while since I last tried things, but I found crit-bit
       | trees[1] to be much faster than b-trees. Hash array-mapped tries
       | are also good if you don't need the things that trees give you
       | (e.g. in-order traversal, get all values in a certain range).
       | 
       | 1: https://cr.yp.to/critbit.html
        
         | fweimer wrote:
         | B-trees are supposed to address the bad cache behavior of
         | binary trees because they are generally much shallower. Crit-
         | bit trees as originally described do not have this property.
        
         | shpongled wrote:
         | Something like a radix trie can give you the in-order traversal
         | and range query aspect, while still keeping some of the nice
         | properties of HAMTs. In practice though (for my domain,
         | scientific floating point data), I have found that b-trees and
         | simple sorted arrays with binary search (sadly) outperform
         | radix tries and cleverer solutions.
        
           | aidenn0 wrote:
           | For very small number of keys, an array with linear searching
           | can win out.
        
       | josephg wrote:
       | I'd be curious to see how performance would change from storing
       | b-tree entries in a semi-sorted array, and applying various other
       | optimizations from here:
       | 
       | https://en.algorithmica.org/hpc/data-structures/b-tree/
       | 
       | The aggregate performance improvements Sergey Slotin gets from
       | applying various "tricks" is insane.
        
         | vlovich123 wrote:
         | Notably I believe his data structures tend to ignore string
         | keys because it's less amenable to SIMD. Would be interesting
         | to see if his ideas about layout still show improvements to
         | strings.
        
           | anonymoushn wrote:
           | You can do strings. You probably want to store a bunch of
           | string prefixes inline though, so that when the prefixes are
           | not equal, you can use essentially the same code that he uses
           | for integers.
        
         | rebanevapustus wrote:
         | That's how it's done in the rust stdlib alternative
         | https://github.com/brurucy/indexset
         | 
         | Faster reads, slower inserts, but then you get the capability
         | of indexing by position in (almost) O(1). In regular B-Trees
         | this can only happen in O(n).
        
       | nialv7 wrote:
       | I feel I missed point of this article. I thought the author is
       | trying to prove that b-tren isn't that bad compared to hashmaps.
       | But taking 2~3x longer looks pretty bad.
       | 
       | If I need predictable ordering (but not actually sorting the
       | keys) I will use something like indexmap, not b-tree.
        
         | magicalhippo wrote:
         | The point seems to be the author found very different estimates
         | of just how much worse b-trees would be. As the author notes,
         | hashmaps have some less desirable properties as well. So the
         | author ran some benchmarks to find out, and ended with the
         | following conclusion:
         | 
         |  _Overall, I 'm unconvinced that it's worth exploring btrees
         | further. I'll stick to hashmaps and I'll either iterate in
         | insertion order or I'll require sorting entries before
         | iterating._
        
       | ww520 wrote:
       | Adaptive radix tree is pretty good as well, with support for in
       | order listing and range query. It can beat b-tree and come
       | closely behind hashmap.
        
         | vlowther wrote:
         | I reach for adaptive radix trees over b-trees when I have keys
         | and don't need to have arbitrary sort orderings these days.
         | They are just that much more CPU and memory efficient.
        
       | kibo_money wrote:
       | Very interesting ! You mentioned the memory usage at the end,
       | BTreeMaps are actually better than HashMaps most of the time, at
       | least for Rust
       | 
       | Here's a good break down: https://ntietz.com/blog/rust-hashmap-
       | overhead/
        
         | pclmulqdq wrote:
         | Btrees don't waste much memory, while hash tables have to have
         | excess capacity if you want them to go fast.
        
           | marginalia_nu wrote:
           | That's true for on-disk b-trees which typically have large
           | node sizes (typically 4KB), but in-memory btrees tend to aim
           | for CPU cache lines (typically a small multiple of 32B), and
           | thus do actually waste a fair amount of memory with their
           | comparatively low branching factor, and thus relatively large
           | number of branches compared to their number of leaves.
        
             | pclmulqdq wrote:
             | The waste of abseil's b-tree, for example, is small per
             | value: https://abseil.io/about/design/btree
             | 
             | The efficiency compared to hash tables very much does carry
             | over to the small b-trees used for in-memory data.
        
       | orlp wrote:
       | Why was Rust's hashmap only tested with SipHash? It's known to be
       | pretty bad for performance.
       | 
       | I'm biased as the author of course, but try adding a benchmark
       | with the Rust hasher + foldhash as well:
       | https://github.com/orlp/foldhash.
        
         | espadrine wrote:
         | They are looking for a data structure that is robust against
         | hash flooding attacks like
         | https://www.cve.org/CVERecord?id=CVE-2011-4815
         | 
         | You mention that foldhash does not claim to provide HashDoS
         | resistance against interactive attackers, so perhaps that
         | disqualifies it.
         | 
         | If anything, given this requirement, comparing with wyhash, as
         | they do in the article, is misleading.
        
           | vlovich123 wrote:
           | Xxh3 would have this property and would be drastically
           | faster. Siphash is just a bad default choice imho.
        
             | orlp wrote:
             | XXH3 does not have this property, no more than foldhash
             | does.
        
           | orlp wrote:
           | > You mention that foldhash does not claim to provide HashDoS
           | resistance against interactive attackers, so perhaps that
           | disqualifies it.
           | 
           | The linked CVE is not an interactive attack either FYI, so
           | foldhash would be sufficient to protect against that. When I
           | say an "interactive attacker" I mean one that analyzes hash
           | outcomes (either directly or indirectly through things like
           | timing attacks and iteration order) to try and reverse
           | engineer the hidden internal state.
           | 
           | > If anything, given this requirement, comparing with wyhash,
           | as they do in the article, is misleading.
           | 
           | That is correct. There is no reason to believe wyhash is
           | secure against interactive attackers.
        
       | georgehill wrote:
       | related: https://github.com/cloudflare/trie-hard
        
       | helltone wrote:
       | Possibly off topic, but I was wondering: what are the most
       | comprehensive data structure benchmarks out there?
        
       | cmrdporcupine wrote:
       | Appreciate the attention to detail on the microbenchmarks.
       | 
       | Skimming through, need to read in more detail later, but what I
       | would love to see is a real world comparison against just linear
       | search in a vector. Either of associated pairs, or two vectors
       | (one for key, one for value, with matching offsets).
       | 
       | My hunch is that people in general are more often than not
       | reaching for hash-tables (and sometimes trees) for the _API
       | convenience_ of an associative structure -- but that on modern
       | architectures with decent cache sizes and for small(ish) data
       | sets they can be outperformed by a simple O(N) lookup.
       | 
       | For example, it would be an interesting experiment to take
       | something like the Python runtime (or the JDK etc) and replace
       | its dictionary type with vectors -- at least for small
       | dictionaries -- and then run through some common applications and
       | frameworks and see what impact this has.
        
         | SPascareli13 wrote:
         | I think I tested very casually some time ago with Go maps and
         | up to like one hundred items the linear search on array was
         | faster than map lookup. Considering that many times when we use
         | Maps for convenience they will have less than a hundred items
         | this could be useful.
         | 
         | Unfortunately I don't have the results (or the test code)
         | anymore, but it shouldn't be hard to do again (casually at
         | least).
        
         | tialaramex wrote:
         | I expect this experiment provides a small perf win for very
         | small N, but that's swamped by the price of deciding whether to
         | try this and in many applications it's also noise compared to
         | the perf win from using hash tables for larger N.
         | 
         | A hash table with a very cheap hash (remember in C++ out of the
         | box their hash for integers is usually the identity function)
         | well be cheaper for quite modest N because it's mostly just
         | doing less work. I could believe N>=4 for example
        
           | cmrdporcupine wrote:
           | I think N=>4 is woefully pessimistic. Contiguous, cached
           | memory... it's just so much faster on modern machines.
           | 
           | That and all the potential for branch prediction misses in a
           | hashtable or tree vs a simple array lookup...
           | 
           | Linear scan be stupid fast.
        
             | tialaramex wrote:
             | Linear scan is very fast, but it's not competing against a
             | worse scan it's competing against _not scanning_ which is
             | what the hash table is going to do.
             | 
             | At N=4, say we're mapping one 64-bit integer to another for
             | some reason, so that's 16 bytes per entry, four entries, 64
             | bytes, conveniently sized linear scan table.
             | 
             | The linear scan reads each of the four keys in turn, if one
             | matches that's a hit, if none do we missed. Four 64-bit
             | reads, four 64-bit compares & branches.
             | 
             | A Swiss table big enough for N=4 has eight slots (128
             | butes) plus 16 bytes of metadata (total 144 bytes). Half of
             | the slots are empty.
             | 
             | We do a single arithmetic operation on the key, picking the
             | only 16 bytes of metadata which exist, then we load that
             | data, and a single SIMD instruction matches either zero or
             | one entry, and if it wasn't zero we load and compare that
             | entry.
             | 
             | In practical systems you'll get more leeway because Swiss
             | tables want a real hash, your keys probably aren't suitable
             | so maybe there's a dozen or more ALU operations to hash a
             | key - which is a substantial head start for a linear scan.
             | But hey we're talking small amounts of data, who even cares
             | about hash quality?
             | 
             | But do measure, I too am interested, at least in a real
             | shoot out between an actual linear scan and a decent hash
             | table. If the "same API as a hash table but same
             | implementation as linear scan" makes sense for some sizes
             | maybe it's worth writing that collection type.
        
       | pjdesno wrote:
       | It would be interesting to compare the Python sortedcontainers
       | algorithm - I've been using a C++ version of it that works quite
       | well.
       | 
       | Note also that nodes in B-trees (and other splitting-based data
       | structures) have a mean load factor more like 75% - 50% is the
       | minimum load for non-root nodes, right after splitting, and 100%
       | is the max right before splitting.
        
       | BeeOnRope wrote:
       | Nice article!
       | 
       | Very cool to see both the "independent" and "serially dependent"
       | cases addressed. Microbenchmarks still have lots of ways of
       | giving the wrong answer, but looking at both these cases exposes
       | one of the big variables which cause that.
       | 
       | In my experience looking at container performance you often pass
       | through two distinct regimes (in a microbenchmark!):
       | 
       | Small regime: for small containers, instruction count,
       | instruction dependencies and IPC (including the effect of branch
       | missed) dominate.
       | 
       | In this regime fastest container in a "throughput" sense will
       | often be the one with fewest micro-operations (including those
       | executed on the wrong-path). Fewer operations helps both in raw
       | speed and also in overlapping more multiple independent lookups
       | within the instruction window. Any type of per-lookup
       | misprediction is hugely destructive to performance. For random
       | keys, this often favors hash tables because they can be written
       | to have << 1 mispredict per lookup.
       | 
       | In this small regime the fastest container in a latency sense is
       | the one with the shortest critical path from input to output,
       | again considering mispredicts. The non-memory latency instruction
       | will be very important in this critical path and again
       | mispredicts are very destructive since usually mispredicts add
       | directly to the critical path (not always!). There are lots of
       | tricks to keeping the critical path including hashes with
       | somewhat higher operation counts but smaller critical paths
       | (e.g., a short merge), linear searches which have > 1 independent
       | stream, etc. If the keys are predictable, hashes containers can
       | look bad because they tend to have a long data-dependency from
       | the hash through the lookup to the output. Tree-like containers
       | tend to replace those with control, so the data-dependent
       | critical path can be very short! With random keys, hashes win
       | again because mispredicts are so destructive.
       | 
       | Then in the large regime, a lot of the same themes repeat but
       | instead of applying to "all instructions", it's mostly about
       | memory access. I.e., the winning throughput containers are the
       | ones that can get the highest useful MLP, and the winning latency
       | containers are the ones with the shortest critical path of data-
       | dependent memory accesses, mostly ignoring everything else.
       | Instructions still matter because MLP is often dependent on how
       | many accesses you can stuff into the processors OoOE execution
       | window, and the size of that structure is (roughly speaking)
       | counted in instructions. Software prefetching can help a ton with
       | stuffing the right memory accesses in the OoOE window.
       | 
       | For random keys and "usually hit", hashes again tend to win in
       | this regime, because they can usually get down to 1 miss per
       | lookup, and that's math the other structures just can't overcome.
       | For non-random keys, the field is wide open, it depends heavily
       | on the key distribution. For lookups which often miss there are
       | plenty of ways to break the 1 miss-per-lookup barrier too.
        
       | waynecochran wrote:
       | If you can keep all the data in core memory why not just use you
       | favorite balanced binary search tree (BST) like a red-black tree.
       | B-Trees only help performance, at least in common understanding,
       | when all the data can not be in core memory (which is why
       | databases use them).
        
         | ismailmaj wrote:
         | For cache locality, if the layout is more wide instead of deep,
         | you can avoid many cache misses.
        
           | winwang wrote:
           | Yep -- and more generally, sequential access > random access,
           | just a bit less so in memory than on SSD, not even mentioning
           | spinning disk.
        
       | scythmic_waves wrote:
       | I appreciate write-ups of failed experiments like this. They're
       | sorta like null results in science, but for engineering. And they
       | can help others from needlessly walking down the same path.
       | 
       | If everyone only wrote about their successes, we'd all have to
       | independently rediscover failures behind closed doors.
        
       | winwang wrote:
       | Would be really interesting to see a similar study but with
       | skiplists! I'd imagine it would be slower than hashmaps for many
       | of the exact same reasons outlined in the article, but numbers
       | would be cool.
        
       ___________________________________________________________________
       (page generated 2024-10-07 23:01 UTC)