[HN Gopher] B-tree Path Hints
___________________________________________________________________
B-tree Path Hints
Author : eatonphil
Score : 230 points
Date : 2021-07-30 14:46 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ccleve wrote:
| It's a cool idea, but you have to keep in mind that binary
| searches are not optimal for b-tree nodes unless the node is
| really big. If you have a smaller node, < 32 or 64 elements, a
| simple linear search is faster. The reason has something to do
| with processor cache, branch mis-prediction, and pipelining.
|
| A linear search can also be optimized to use SIMD, which is a big
| win.
|
| If you really want to use a path hint, you could just say "start
| my linear search at the point where I left off last time", and if
| the first element is too big, wrap around to the beginning. But
| that might not buy you much.
| ardit33 wrote:
| I agree.... the whole point of a b-tree is to already subdivide
| the dataset in smaller manageable sets. A binary search wont be
| much help on those slices if the tree is constructed properly.
| jldugger wrote:
| As I understand it, the idea is that a path hint is a
| suggestion on which node to scan first, not a suggestion on
| how to binary search an individual node.
| jsnell wrote:
| No, it'll still do a scan all the way from the root node
| down to the leaf nodes. The path is simply saying which
| element to start the binary search from, in each of the
| nodes.
| jleahy wrote:
| The something is data dependencies.
| nextaccountic wrote:
| Isn't the binary search here taken as the whole tree, spanning
| multiple nodes? Each time we follow a pointer we are discarding
| a huge swath of the tree.
|
| In a single node, a linear search is probably better, yes.
| scottlamb wrote:
| > It's a cool idea, but you have to keep in mind that binary
| searches are not optimal for b-tree nodes unless the node is
| really big. If you have a smaller node, < 32 or 64 elements, a
| simple linear search is faster. The reason has something to do
| with processor cache, branch mis-prediction, and pipelining.
|
| I've heard that--and iirc Rust's BTreeMap type uses linear
| search--but does this depend on the type of key? Easy for me to
| see how that might be true for a BTreeMap<u64, V> but not a
| BTreeMap<String, V>. In the latter case, there's a pointer to
| chase for each comparison (and more/variable data to process;
| SIMD over multiple keys doesn't seem likely).
| ccleve wrote:
| Yes, true. If your comparisons involve pointer-chasing or
| string comparison or anything complex, then it's not clear
| which strategy is faster. I always write the simplest
| possible code first, which is usually a linear search,
| benchmark it, and then try to beat the benchmark if it's
| warranted.
| ntonozzi wrote:
| A similar idea is finger search
| (https://en.wikipedia.org/wiki/Finger_search), where you can turn
| a traversal of a tree from O(log(number of items in the tree))
| into O(log(number of items between the finger and the key)).
| uyt wrote:
| I was going to link the same thing except I would say he is
| talking about exactly that without knowing the academic name
| for it.
| namibj wrote:
| A related thing: exponential search. It's the sorted-array
| equivalent.
| mabbo wrote:
| I love little optimizations like this.
|
| What I'm reading is that the node in the b-tree remembers the
| most recently found index during search, and if the next search
| happens to be for the same value or one very close, the binary
| search goes much faster. This only helps in situations where
| searches correlated in time tend to have nearby key values, but
| that's probably very common.
| anonymouse008 wrote:
| I'm not qualified to truly understand this, but when Lex first
| interviewed Jim Keller, Jim basically said regarding processor
| design -- 'yeah, if you just guess the same result as last
| time, you get a 50% speed increase.'
|
| First Interview (where that poorly paraphrased quote resides):
| https://www.youtube.com/watch?v=Nb2tebYAaOA&t=13s
|
| Second Interview: https://www.youtube.com/watch?v=G4hL5Om4IJ4
| sitkack wrote:
| The Weather Prediction Algorithm, weather will be the same as
| yesterday. Only wrong on transitions, very useful when you
| have runs of the same state. After that you implement a
| Markov Chain [1]
|
| https://en.wikipedia.org/wiki/Markov_chain
| Symmetry wrote:
| I'm sort of surprised that the gains are so high. I'd tend to
| expect the cost of a pointer deference in a large data structure
| to be bigger than the comparisons even with the attendant branch
| mispredictions.
|
| EDIT: Oh, duh, if the path hints are useful then cache locality
| means that the data you're looking for is already nearby. If the
| hint works the cost of the lookup will actually be dominated by
| the comparisons.
| iratewizard wrote:
| I wonder how this would compare to a B-Tree where you've
| explicitly loaded large branches of the tree into cpu cache and
| perform multiple comparisons in a single cpu cycle.
| warmfuzzykitten wrote:
| The "path hint" sounds a lot like a cursor. Since a lot of b-tree
| traversal is ordered, knowing the last position in the tree is
| often valuable information.
| bob1029 wrote:
| I think you would find the splay tree provides similar
| properties, potentially in an even more elegant way.
|
| For applications I have used it for, this technique results in
| the hottest data eventually being condensed into a smaller number
| of contiguous storage blocks.
|
| The power of this technique is that it amortizes the optimization
| of the tree across _all_ access (reads), not just inserts &
| deletions (or other special-purpose re-balancing operations).
| foobiekr wrote:
| I love splay trees, skew heaps, etc. data structures that get
| very little love.
|
| But I've actually tried to use splay trees in production code.
| The performance benefits versus a simple Patricia tree seemed
| totally absent in actual practice even for things like very
| frequent adjacent queries. The structure is neat but paying
| attention to memory layout and cache line size is far, far more
| important.
| zinclozenge wrote:
| One of the things that surprised me was that linear search on
| optimally laid out data was frequently faster than binary
| search with frequent pointer chasing.
| bob1029 wrote:
| I use splay trees for their indirect utility. That is, the
| splay tree is a trivial way to calculate minimal, incremental
| updates to an append-only log. With most (all?) other b-tree
| structures, you would need to rewrite an unsustainable (or
| otherwise unpredictable) amount of tree data to the append-
| only log each time it is modified. The magic is that the
| splay operation redefines the root node every time you access
| the tree, so you just have to keep track of all the nodes you
| visited in each transaction batch and then write that sub-
| tree to the end of the log.
|
| >paying attention to memory layout and cache line size is
| far, far more important.
|
| Totally agree. For scenarios where I know I am going to have
| fewer than ~100k of something (assume an array of structs), I
| typically just do a linear scan over memory. If its a bunch
| of objects in memory or a cold access to storage, the trees
| become far more practical.
| jsnell wrote:
| I don't think this is about hot-data optimization as such
| though, but about very short-term local clustering of keys.
|
| I.e. it's not that a:b:c and a:b:d are globally accessed more
| often than x:y:z, but that after a:b:c is accessed it's more
| likely that the next access is a:b:d than x:y:z.
| eternalban wrote:
| The tree structure is invariant across the balancing/search
| algorithms. Maybe you can do splay on loads and then
| rebalance. Q is whether this ends up being more costly over
| time. Now I wonder if there can be a hybric splay with
| distinct sub-roots.
| crdrost wrote:
| It's also that they are different data structures used
| historically for different things, no?
|
| It'd be interesting to see a high-fanout splay tree, not even
| 100% sure how it'd work. Do you splay just the one node up to
| the root and have to split/combine each splay? Or do you only
| splay the first node in a "chunk" up? Or the whole "chunk"?
| Is it better to do a B+ type of thing rather than a B-tree
| type of thing?
| bob1029 wrote:
| In theory, this could provide dramatic performance uplift
| if the splay is done along lines of an aggregate value
| function over the node contents, and the contents of the
| node are selected such that their aggregate value function
| is well-defined & ordered relative to other nodes. I've got
| an implementation I might try this with. There are a ton of
| directions you could take this in depending on specifically
| what you are trying to optimize for.
|
| Looking at this from a mechanical sympathy perspective - If
| you are unable to batch tree operations for whatever
| reason, increasing fanout is a good intermediate answer
| since you can make your storage system do more effective
| work per I/O. You can't ever read or write in units smaller
| than 1 storage block, so maybe shoving more logical items
| into each block is where you need to look.
|
| Practically, this is probably not feasible in a wide number
| of use cases. GUID/UUID keys (non-linear) would break this
| scheme in my mind.
| kolbe wrote:
| I love this. In general, I've found you can do amazing thing to
| speed up common algorithms by tailoring them to your use case. It
| usually feels hacky, but I've found it to be very effective.
| kache_ wrote:
| I love b trees :)
| infogulch wrote:
| Another B-tree-related post I appreciated recently: Evolution of
| tree data structures for indexing -
| https://news.ycombinator.com/item?id=27963753
|
| I'd be interested in more posts on LSM trees.
| kazinator wrote:
| An interesting idea would be to imitate filesystems and provide
| an actual "current working node" concept to the application. The
| current working node is a hint which contains a pointer to a
| BTree node. This could be the leaf node where some item was
| previously found, or one node higher above the leaf level.
|
| When the search for an item fails from the current working node,
| it falls back on earnestly searching from the root.
|
| If a node is split or merged by insertions or deletions, then if
| any current working node pointers to it exist, they get
| invalidated or updated somehow. (The B-Tree object could carry a
| small dynamic set of current working node items which it can
| update when the tree mutates.)
| [deleted]
| jeremyis wrote:
| I really enjoyed this post and hope I get more like it in the
| future:
|
| - It gave a high level, concise background overview of the base
| concept (B-trees) complete with diagrams
|
| - It talked from first principles about how things worked without
| really any Jargon
|
| - It explained simply a clever optimization
|
| Thanks for posting!
| srcreigh wrote:
| I'm not sure this is useful for a traditional database. There are
| lots of costs that are orders of magnitude more important.
|
| This is a CPU optimization, however in a DB the major costs are
| 1) pulling blocks from disk 2) transferring results over a
| network.
|
| Even if you assume nodes are cached in memory, a B-Tree node for
| a database probably fits in L1 cache. SQLite page size is 4098
| bytes by default vs 8KB+ L1 cache size. Shouldn't you pay 100x
| the cost just to load the next node from memory, if you have a
| large tree? In theory that's <1% faster overall.
|
| I'm curious whether 3x improvement is overall, or just from the
| binary search. (according to the benchmark it is overall 2x
| faster - very curious).
|
| Also curious the size of the tree in benchmarks and the spread
| factor (ie how many children each B-Tree node has). I couldn't
| find this info from the repo. This could affect the % of time you
| can find the next node already in a cache.
| srcreigh wrote:
| I found the answers - the tree uses 256 children by default,
| implying ~16kb node size on 64 bit architecture. The size of
| the tree in the benchmark is 1M nodes, so ~16GB tree. Pretty
| hefty tree! :)
|
| It's the seqential insert/get benchmarks which benefits the
| most from path hints. You'd only have 1 hot path down the tree
| in this setup (the rightmost path). Is it possible the CPU is
| caching multiple non-contiguous blocks (ie the hot path)? That
| would explain why memory access doesn't have as much of a
| factor as I originally thought.
|
| In other words - this technique may be useful in DBs! Disk and
| memory are both heavily cached in the CPU, so CPU optimizations
| can help for sequential/repeated local accesses
| tidwall wrote:
| It's a 1M items, not nodes. Each item is a Go interface{},
| about 16 bytes each. the benchmarks actually show that the
| tree is around 36MB.
| sroussey wrote:
| Back in the 1990s, I did something similar with SQL and it worked
| wonders.
| vvern wrote:
| It feels to me like this just a kludge added to deal with a lack
| of a stateful iterator on top of the tree. In this use case, the
| author indicates that it is about exploiting locality in a
| request path. Imagine that the tree offered an iterator that
| maintained a stack of its path and that thing had a seek method.
| You can optimize that thing to only go back up to the root if it
| needs to. Stateful iterators make for a nice pairing with btrees.
| These other hints are papering over the lack of that abstraction
| as far as I can tell.
| anderskaseorg wrote:
| With a stateful iterator, you might need to worry about your
| cached pointers being invalidated by mutations from other
| callers.
| Ar-Curunir wrote:
| Not to sound like a member of the Rust Evangelism
| StrikeForce, but Rust is able to offer stateful iterators
| over its `BTree` exactly by preventing mutation during
| iteration.
| oscardssmith wrote:
| Doesn't that mean that iteration requires a write lock?
| That sounds bad for lots of applications.
| Ar-Curunir wrote:
| The write-lock is checked entirely at compile time, so no
| performance overhead
| jvanderbot wrote:
| In the mid 80s to 90s, there was a lot of interest around "query
| responsive" data structures. There still is, I'm sure, but that
| was what I studied and I've moved on since then.
|
| Of these, the fibbonaci and pairing heaps were the most well
| known products, I reckon.
|
| They essentially move "Active" parts of the data structure
| upwards near the top of the tree. The use case is different
| (being heaps), but I wouldn't be surprised if that line of R&D
| has continued, or if there's a huge gap in performance that can
| be reclaimed with "intelligent" guesses like this one.
|
| I'm rambling and conjecturing, but maybe someday branch
| prediction will get a huge boost to their already tangible smarts
| and we'll see another explosion in practical CPU throughput,
| combined with a little help from algs and data structures, of
| course.
| svnpenn wrote:
| > fibbonaci
| gopalv wrote:
| > we'll see another explosion in practical CPU throughput,
| combined with a little help from algs and data structures
|
| The learned index[1] is basically a variation of this sort of
| "hint" which relies on getting the right answer most of the
| time, but from learning the distribution function as a model
| instead of relying on the programmer to heuristic it.
|
| (from that paper)
|
| > The key idea is that a model can learn the sort order or
| structure of lookup keys and use this signal to effectively
| predict the position or existence of records
|
| I haven't seen anything mix this sort of work with an Eytzinger
| traversal, but there's always diminishing returns for these.
|
| [1] - https://arxiv.org/abs/1712.01208
| nmwnmw wrote:
| My personal favorite of these is the Splay Tree[1]. I've never
| used it in production, but it's really simple to implement and
| reason about. Though my main experience was trying (and
| failing) to prove amortized bounds on the operations.
|
| [1] https://en.wikipedia.org/wiki/Splay_tree
| scottlamb wrote:
| Does that mean any access to these data structures needs an
| exclusive lock? One advantage to the "path hints" approach
| described in the article is that the path hints can be stored
| separately from the btree itself. The author writes:
|
| > For single-threaded programs, it's possible to use one shared
| path hint per B-tree for the life of the program.
|
| > For multi-threaded programs, I find it best to use one path
| hint per B-tree , per thread.
|
| > For server-client programs, one path hint per B-tree, per
| client should suffice.
|
| Seems likely multiple threads/clients would have different
| access patterns anyway, so the hints might actually be better
| this way in addition to more concurrent.
|
| edit: also, this approach feels a lot lighter-weight to me than
| mutating the tree: I can imagine the mutations involving
| frequent allocation/deallocation where this hint is just a
| vector that grows to the maximum depth and stays there.
| bradleyjg wrote:
| Had the same thought. Making every access RW seems like a
| cure worse than the disease in a multithreaded situation.
| jvanderbot wrote:
| > mid 80s
| nomy99 wrote:
| I think OP is using the wrong data structure, and his fix is
| bringing him closer to the ideal solution. You need a hashmap
| function that returns a search range to direct the binary search
| on the B tree. This lets you add logic on how you want data to be
| searched in the future. In base case it will work as a the B tree
| hint
| vecter wrote:
| In case others are unclear since it doesn't seem to be explicitly
| stated in the docs, you update the path hint to be the path of
| the most recently found item. I think this is implied when he
| says "My path may be wrong, and if so please provide me with the
| correct path so I get it right the next time.", but you can see
| where `hint` is set in the implementation also.
| ot wrote:
| The C++ STL has something similar, insert operations accept an
| iterator as hint:
|
| https://en.cppreference.com/w/cpp/container/map/insert
|
| In common implementations iterators are just node pointers, which
| is enough because nodes have parent pointers, so you can recover
| the whole path.
|
| In extreme cases like ordered insertion this effectively shaves
| off a log(n) factor.
| uyt wrote:
| I really like that the hint is optional.
|
| You don't always want the hint because if you're jumping _d_
| steps ahead, you actually need to touch 2 log(d) nodes, once to
| walk up, and another to walk down. This means anytime you 're
| moving more than d=sqrt(n) steps away you're better off just
| walking down from the root instead.
|
| For example if n=1000000, if the hot section is wider than
| d=1000, you're better off going from root all the time. Of
| course if you access stuff that are directly adjacent you
| should still use inorder iteration for those (but I see that as
| a ++ operator, not a separate lookup).
| willvarfar wrote:
| Wow! Obvious in hindsight!
|
| Do any databases use this or similar tricks?
| jlokier wrote:
| Yes. Even better, a good database will avoid traversing the
| blocks and path, when the key is close enough to the previous
| lookup's hint, making all but the first lookup in a nearly-
| consecutive sequence take O(1) instead of O(log N) where N is
| the size of the tree, and runs of n nearly-consecutive
| operations take O(n + log N) instead of O(n log N).
| bartlomieju wrote:
| BuntDB [0] from @tidwall uses this package as a backing data
| structure. And BuntDB is in turn used by Tile38 [1]
|
| [0] https://github.com/tidwall/buntdb [1]
| https://github.com/tidwall/tile38
___________________________________________________________________
(page generated 2021-07-30 23:00 UTC)