[HN Gopher] Fast B-Trees
___________________________________________________________________
Fast B-Trees
Author : surprisetalk
Score : 223 points
Date : 2024-10-07 01:42 UTC (21 hours ago)
(HTM) web link (www.scattered-thoughts.net)
(TXT) w3m dump (www.scattered-thoughts.net)
| tekknolagi wrote:
| I thought a lot of b(+)tree advantage was in bigger-than-RAM
| something or other for large databases and these benchmarks seem
| relatively small in comparison
| foota wrote:
| B-Trees are good for in memory data too because they have
| fairly good cache behavior.
| robertclaus wrote:
| Ya, I would be curious to see how this performs on out-of-cache
| data on an SSD and actual hard drive. On the other hand, the
| findings are definitely still relevant since RAM has gotten
| fairly cheap and most applications probably fit in it just
| fine.
|
| Regarding databases - Btrees also have a natural sort order,
| which hash tables don't. This means a btree as your main data
| structure helps with sort, range, or list operations in a way a
| hash tables can't. That being said, even traditional databases
| obviously still use hash tables extensively (ex. Hash joins).
| scotty79 wrote:
| In Rust thanks to it you can have BTreeSet of BTreeSet-s.
| scotty79 wrote:
| I think my comment was way too short.
|
| What I meant was that I needed a set type in Rust. At first
| I tried to use HashSet but I needed it's elements to also
| be sets. I couldn't because HashSet doesn't itself have a
| natural Hash. I'd need to implement it myself.
|
| However while using BTreeSet the requirement on the
| elements is that they have natural ordering (have Ord
| trait) but the BTreeSet itself also implements Ord so I
| could have a BTreeSet of BTreeSets without implementing
| anything additionally. Thanks to the fact that BTrees have
| natural ordering.
|
| So I used BTreeSet instead,
| xxs wrote:
| b-trees are just 'better' binary trees as they have lower
| amounts of indirections (nodes)
| crest wrote:
| As long as your puny little working set (2^16 small keys) fits
| into L2 cache and get is perfectly covered by the L1 dTLB you
| won't see the cost of touching random pages in a big hash table
| larger than the last level TLB coverage and on chip caches.
| There won't be any TLB stalls waiting for the page walkers and
| you won't miss the lost spacial locality in the key-space
| preserved by B(+)trees if everything is in L2 cache. At the
| very least it proves that hash tables can be a good fit for
| point queries of datasets too large for linear searching or
| sorting + binary searches, but not yet large enough to exhaust
| CPU cache capacity.
| marginalia_nu wrote:
| You can line them up with disk blocks, or with CPU cache lines,
| the effect is relatively similar.
| lsb wrote:
| Clojure, for example, uses Hash Array Mapped Tries as its
| associative data structure, and those work well
| BeeOnRope wrote:
| To answer a question implied in the article, per-lookup timing
| with rdtscp hurts the hash more than the btree for the same
| reason the hash is hurt by the data-depending chaining: rdtscp is
| an execution barrier which prevents successive lookups from
| overlapping. rdtsc (no p) isn't, and would probably produce quite
| different timings.
|
| That the btree doesn't benefit from overlapping adjacent
| lookup/inserts is intereting.
|
| I suppose it is because btree access (here) involves data-
| dependent branches, and so with random access you'll get about L
| mispredicts per lookup in an L-deep tree, so adjacent lookups are
| separated by at least one mispredict: so adjacent lookup can
| overlap execution, but the overlapping is useless since
| everything beyond the next mispredict is useless as it is on the
| bad path.
|
| That's probably at least true for the small map regime. For the
| larger maps, the next iteration is actually very useful, even if
| on a mispredicted path, because the date accesses are at the
| right location so it serves to bring in all the nodes for the
| next iteration. This matters a lot outside of L2. At 5
| instructions per comparison and 32-element nodes, however, there
| are just so many instructions in the window for 1 lookup it's
| hard to make it to the next iteration.
|
| So b-trees benefit a lot from a tight linear seach (e.g. 2
| instructions per check, macro-fused to 1 op), or a branch-free
| linear search, or far better than those for big nodes, a
| vectorized branch-free search.
| dwattttt wrote:
| > the overlapping is useless since everything beyond the next
| mispredict is useless as it is on the bad path
|
| Is this a consequence of Spectre et al mitigations?
| BeeOnRope wrote:
| No, just a consequence of how mispredicts work: all execution
| after a mispredict is thrown away: though some traces remain
| in the cache, which can be very important for performance
| (and also, of course, Spectre).
| dwattttt wrote:
| That's the part I was curious about; whether there would've
| been a helpful cache impact, if not for modern Spectre
| prevention.
| starspangled wrote:
| Spectre mitigations don't change that, some of them do
| require adding speculation barriers or otherwise turn off
| branch prediction for cases where unprivileged state can
| be used to direct mis-predicted privileged branches into
| gadgets which can create a side-band to privileged data
| with speculative state.
|
| But in general execution (i.e., no privilege domain
| crossings), this mechanism is not required.
|
| Performance effects of executing mispredicted branches
| (called something like "wrong-path execution" or
| "mispredicted path ..." in literature) is interesting and
| it has been studied. I don't know what the state of the
| art is, although I've seen results showing both speedups
| and slowdowns (as you would expect with any cache / BP
| kind of topic :P).
| BeeOnRope wrote:
| > Spectre mitigations don't change that, ...
|
| Yes, exactly. To the first order I think Spectre didn't
| really change the performance of existing userspace-only
| code. What slowed down was system calls, kernel code and
| some things which were recompiled or otherwise adjusted
| to mitigate some aspects of Spectre. There might be a
| rare exception, e.g., IIRC `lfence` slowed down on AMD in
| order to make it more useful as a speculation barrier on
| AMD but this is hardly an instruction that saw much use
| before.
|
| > I don't know what the state of the art is, although
| I've seen results showing both speedups and slowdowns
|
| Yeah. This seems like a pretty cut and dry case where
| you'd get a speedup from wrong-path misses, since the
| independent next search will be correctly predicted from
| the start and access exactly the right nodes, so it
| serves as highly accurate prefetching: it only gets
| thrown out because of a mispredict at the end of the
| _prior_ search.
|
| Something like the misses within a single binary search
| are more ambiguous: for random input the accuracy drops
| off like 0.5^n as you predict n levels deep, but that
| still adds up to ~double MLP compared to not speculating,
| so in a microbenchmark it tends to look good. In the real
| world with 1 lookup mixed in with a lot of other code,
| the many cache lines brought in on the bad path may be
| overall worse than inserting a speculation barrier
| yourself.
|
| That's the cool part: we can choose whether we want
| speculation or not if we know up front if it's harmful.
| jnordwick wrote:
| Can you be more specific about "all execution after a
| mispredict is thrown away". Are you saying even non-
| dependant instructions? I thought the CPU just ignored the
| registers resulting from the wrong instruction path and
| started calculating the correct path - that the way the
| register file worked allowed it be to much better than just
| junking everything.
|
| A little less verbosely: "is non-dependant ILP work also
| tossed on a mispredict even though that won't change?
| ur-whale wrote:
| Unless I'm missing something, title of the article doesn't really
| correlate with its conclusion.
| dpatterbee wrote:
| The title of the article is "Smolderingly fast b-trees".
| Smoldering is (sorta) an antonym of blazing. Blazingly fast
| means very fast, smolderingly fast would therefore mean not
| very fast.
| aidenn0 wrote:
| It's been a while since I last tried things, but I found crit-bit
| trees[1] to be much faster than b-trees. Hash array-mapped tries
| are also good if you don't need the things that trees give you
| (e.g. in-order traversal, get all values in a certain range).
|
| 1: https://cr.yp.to/critbit.html
| fweimer wrote:
| B-trees are supposed to address the bad cache behavior of
| binary trees because they are generally much shallower. Crit-
| bit trees as originally described do not have this property.
| shpongled wrote:
| Something like a radix trie can give you the in-order traversal
| and range query aspect, while still keeping some of the nice
| properties of HAMTs. In practice though (for my domain,
| scientific floating point data), I have found that b-trees and
| simple sorted arrays with binary search (sadly) outperform
| radix tries and cleverer solutions.
| aidenn0 wrote:
| For very small number of keys, an array with linear searching
| can win out.
| josephg wrote:
| I'd be curious to see how performance would change from storing
| b-tree entries in a semi-sorted array, and applying various other
| optimizations from here:
|
| https://en.algorithmica.org/hpc/data-structures/b-tree/
|
| The aggregate performance improvements Sergey Slotin gets from
| applying various "tricks" is insane.
| vlovich123 wrote:
| Notably I believe his data structures tend to ignore string
| keys because it's less amenable to SIMD. Would be interesting
| to see if his ideas about layout still show improvements to
| strings.
| anonymoushn wrote:
| You can do strings. You probably want to store a bunch of
| string prefixes inline though, so that when the prefixes are
| not equal, you can use essentially the same code that he uses
| for integers.
| rebanevapustus wrote:
| That's how it's done in the rust stdlib alternative
| https://github.com/brurucy/indexset
|
| Faster reads, slower inserts, but then you get the capability
| of indexing by position in (almost) O(1). In regular B-Trees
| this can only happen in O(n).
| nialv7 wrote:
| I feel I missed point of this article. I thought the author is
| trying to prove that b-tren isn't that bad compared to hashmaps.
| But taking 2~3x longer looks pretty bad.
|
| If I need predictable ordering (but not actually sorting the
| keys) I will use something like indexmap, not b-tree.
| magicalhippo wrote:
| The point seems to be the author found very different estimates
| of just how much worse b-trees would be. As the author notes,
| hashmaps have some less desirable properties as well. So the
| author ran some benchmarks to find out, and ended with the
| following conclusion:
|
| _Overall, I 'm unconvinced that it's worth exploring btrees
| further. I'll stick to hashmaps and I'll either iterate in
| insertion order or I'll require sorting entries before
| iterating._
| ww520 wrote:
| Adaptive radix tree is pretty good as well, with support for in
| order listing and range query. It can beat b-tree and come
| closely behind hashmap.
| vlowther wrote:
| I reach for adaptive radix trees over b-trees when I have keys
| and don't need to have arbitrary sort orderings these days.
| They are just that much more CPU and memory efficient.
| kibo_money wrote:
| Very interesting ! You mentioned the memory usage at the end,
| BTreeMaps are actually better than HashMaps most of the time, at
| least for Rust
|
| Here's a good break down: https://ntietz.com/blog/rust-hashmap-
| overhead/
| pclmulqdq wrote:
| Btrees don't waste much memory, while hash tables have to have
| excess capacity if you want them to go fast.
| marginalia_nu wrote:
| That's true for on-disk b-trees which typically have large
| node sizes (typically 4KB), but in-memory btrees tend to aim
| for CPU cache lines (typically a small multiple of 32B), and
| thus do actually waste a fair amount of memory with their
| comparatively low branching factor, and thus relatively large
| number of branches compared to their number of leaves.
| pclmulqdq wrote:
| The waste of abseil's b-tree, for example, is small per
| value: https://abseil.io/about/design/btree
|
| The efficiency compared to hash tables very much does carry
| over to the small b-trees used for in-memory data.
| orlp wrote:
| Why was Rust's hashmap only tested with SipHash? It's known to be
| pretty bad for performance.
|
| I'm biased as the author of course, but try adding a benchmark
| with the Rust hasher + foldhash as well:
| https://github.com/orlp/foldhash.
| espadrine wrote:
| They are looking for a data structure that is robust against
| hash flooding attacks like
| https://www.cve.org/CVERecord?id=CVE-2011-4815
|
| You mention that foldhash does not claim to provide HashDoS
| resistance against interactive attackers, so perhaps that
| disqualifies it.
|
| If anything, given this requirement, comparing with wyhash, as
| they do in the article, is misleading.
| vlovich123 wrote:
| Xxh3 would have this property and would be drastically
| faster. Siphash is just a bad default choice imho.
| orlp wrote:
| XXH3 does not have this property, no more than foldhash
| does.
| orlp wrote:
| > You mention that foldhash does not claim to provide HashDoS
| resistance against interactive attackers, so perhaps that
| disqualifies it.
|
| The linked CVE is not an interactive attack either FYI, so
| foldhash would be sufficient to protect against that. When I
| say an "interactive attacker" I mean one that analyzes hash
| outcomes (either directly or indirectly through things like
| timing attacks and iteration order) to try and reverse
| engineer the hidden internal state.
|
| > If anything, given this requirement, comparing with wyhash,
| as they do in the article, is misleading.
|
| That is correct. There is no reason to believe wyhash is
| secure against interactive attackers.
| georgehill wrote:
| related: https://github.com/cloudflare/trie-hard
| helltone wrote:
| Possibly off topic, but I was wondering: what are the most
| comprehensive data structure benchmarks out there?
| cmrdporcupine wrote:
| Appreciate the attention to detail on the microbenchmarks.
|
| Skimming through, need to read in more detail later, but what I
| would love to see is a real world comparison against just linear
| search in a vector. Either of associated pairs, or two vectors
| (one for key, one for value, with matching offsets).
|
| My hunch is that people in general are more often than not
| reaching for hash-tables (and sometimes trees) for the _API
| convenience_ of an associative structure -- but that on modern
| architectures with decent cache sizes and for small(ish) data
| sets they can be outperformed by a simple O(N) lookup.
|
| For example, it would be an interesting experiment to take
| something like the Python runtime (or the JDK etc) and replace
| its dictionary type with vectors -- at least for small
| dictionaries -- and then run through some common applications and
| frameworks and see what impact this has.
| SPascareli13 wrote:
| I think I tested very casually some time ago with Go maps and
| up to like one hundred items the linear search on array was
| faster than map lookup. Considering that many times when we use
| Maps for convenience they will have less than a hundred items
| this could be useful.
|
| Unfortunately I don't have the results (or the test code)
| anymore, but it shouldn't be hard to do again (casually at
| least).
| tialaramex wrote:
| I expect this experiment provides a small perf win for very
| small N, but that's swamped by the price of deciding whether to
| try this and in many applications it's also noise compared to
| the perf win from using hash tables for larger N.
|
| A hash table with a very cheap hash (remember in C++ out of the
| box their hash for integers is usually the identity function)
| well be cheaper for quite modest N because it's mostly just
| doing less work. I could believe N>=4 for example
| cmrdporcupine wrote:
| I think N=>4 is woefully pessimistic. Contiguous, cached
| memory... it's just so much faster on modern machines.
|
| That and all the potential for branch prediction misses in a
| hashtable or tree vs a simple array lookup...
|
| Linear scan be stupid fast.
| tialaramex wrote:
| Linear scan is very fast, but it's not competing against a
| worse scan it's competing against _not scanning_ which is
| what the hash table is going to do.
|
| At N=4, say we're mapping one 64-bit integer to another for
| some reason, so that's 16 bytes per entry, four entries, 64
| bytes, conveniently sized linear scan table.
|
| The linear scan reads each of the four keys in turn, if one
| matches that's a hit, if none do we missed. Four 64-bit
| reads, four 64-bit compares & branches.
|
| A Swiss table big enough for N=4 has eight slots (128
| butes) plus 16 bytes of metadata (total 144 bytes). Half of
| the slots are empty.
|
| We do a single arithmetic operation on the key, picking the
| only 16 bytes of metadata which exist, then we load that
| data, and a single SIMD instruction matches either zero or
| one entry, and if it wasn't zero we load and compare that
| entry.
|
| In practical systems you'll get more leeway because Swiss
| tables want a real hash, your keys probably aren't suitable
| so maybe there's a dozen or more ALU operations to hash a
| key - which is a substantial head start for a linear scan.
| But hey we're talking small amounts of data, who even cares
| about hash quality?
|
| But do measure, I too am interested, at least in a real
| shoot out between an actual linear scan and a decent hash
| table. If the "same API as a hash table but same
| implementation as linear scan" makes sense for some sizes
| maybe it's worth writing that collection type.
| pjdesno wrote:
| It would be interesting to compare the Python sortedcontainers
| algorithm - I've been using a C++ version of it that works quite
| well.
|
| Note also that nodes in B-trees (and other splitting-based data
| structures) have a mean load factor more like 75% - 50% is the
| minimum load for non-root nodes, right after splitting, and 100%
| is the max right before splitting.
| BeeOnRope wrote:
| Nice article!
|
| Very cool to see both the "independent" and "serially dependent"
| cases addressed. Microbenchmarks still have lots of ways of
| giving the wrong answer, but looking at both these cases exposes
| one of the big variables which cause that.
|
| In my experience looking at container performance you often pass
| through two distinct regimes (in a microbenchmark!):
|
| Small regime: for small containers, instruction count,
| instruction dependencies and IPC (including the effect of branch
| missed) dominate.
|
| In this regime fastest container in a "throughput" sense will
| often be the one with fewest micro-operations (including those
| executed on the wrong-path). Fewer operations helps both in raw
| speed and also in overlapping more multiple independent lookups
| within the instruction window. Any type of per-lookup
| misprediction is hugely destructive to performance. For random
| keys, this often favors hash tables because they can be written
| to have << 1 mispredict per lookup.
|
| In this small regime the fastest container in a latency sense is
| the one with the shortest critical path from input to output,
| again considering mispredicts. The non-memory latency instruction
| will be very important in this critical path and again
| mispredicts are very destructive since usually mispredicts add
| directly to the critical path (not always!). There are lots of
| tricks to keeping the critical path including hashes with
| somewhat higher operation counts but smaller critical paths
| (e.g., a short merge), linear searches which have > 1 independent
| stream, etc. If the keys are predictable, hashes containers can
| look bad because they tend to have a long data-dependency from
| the hash through the lookup to the output. Tree-like containers
| tend to replace those with control, so the data-dependent
| critical path can be very short! With random keys, hashes win
| again because mispredicts are so destructive.
|
| Then in the large regime, a lot of the same themes repeat but
| instead of applying to "all instructions", it's mostly about
| memory access. I.e., the winning throughput containers are the
| ones that can get the highest useful MLP, and the winning latency
| containers are the ones with the shortest critical path of data-
| dependent memory accesses, mostly ignoring everything else.
| Instructions still matter because MLP is often dependent on how
| many accesses you can stuff into the processors OoOE execution
| window, and the size of that structure is (roughly speaking)
| counted in instructions. Software prefetching can help a ton with
| stuffing the right memory accesses in the OoOE window.
|
| For random keys and "usually hit", hashes again tend to win in
| this regime, because they can usually get down to 1 miss per
| lookup, and that's math the other structures just can't overcome.
| For non-random keys, the field is wide open, it depends heavily
| on the key distribution. For lookups which often miss there are
| plenty of ways to break the 1 miss-per-lookup barrier too.
| waynecochran wrote:
| If you can keep all the data in core memory why not just use you
| favorite balanced binary search tree (BST) like a red-black tree.
| B-Trees only help performance, at least in common understanding,
| when all the data can not be in core memory (which is why
| databases use them).
| ismailmaj wrote:
| For cache locality, if the layout is more wide instead of deep,
| you can avoid many cache misses.
| winwang wrote:
| Yep -- and more generally, sequential access > random access,
| just a bit less so in memory than on SSD, not even mentioning
| spinning disk.
| scythmic_waves wrote:
| I appreciate write-ups of failed experiments like this. They're
| sorta like null results in science, but for engineering. And they
| can help others from needlessly walking down the same path.
|
| If everyone only wrote about their successes, we'd all have to
| independently rediscover failures behind closed doors.
| winwang wrote:
| Would be really interesting to see a similar study but with
| skiplists! I'd imagine it would be slower than hashmaps for many
| of the exact same reasons outlined in the article, but numbers
| would be cool.
___________________________________________________________________
(page generated 2024-10-07 23:01 UTC)