[HN Gopher] Efficiently Searching In-Memory Sorted Arrays: Reven...
___________________________________________________________________
Efficiently Searching In-Memory Sorted Arrays: Revenge of
Interpolation Search? [pdf]
Author : greghn
Score : 54 points
Date : 2023-06-03 16:12 UTC (6 hours ago)
(HTM) web link (pages.cs.wisc.edu)
(TXT) w3m dump (pages.cs.wisc.edu)
| scrubs wrote:
| Broadly related: https://news.ycombinator.com/item?id=25899286
| PGM Indexes: Learned indexes that match B-tree performance with
| 83x less space
|
| I might suggest, slightly pushing back on this article's PDF,
| binary search is not the de-facto in-memory search algorithm.
| Cache consciousness btrees, and various optimizations of tries,
| or radix trees are better. Eytzinger arrays are a good BST option
| if data is pre-sorted and/or requires rare resort passes. Which
| brings us back to btrees: keeping data sorted is usually 51% of
| the problem. Search is the other 49%.
|
| OK, if you don't need range search, which needs sorted keys, then
| hashmaps are darn hard to beat.
|
| The PDF author however makes a strong point: if FLOPS is going up
| way faster than memory BW, the obvious thing to do is spend a
| little more code looking into memory smartly. And the PDF even
| gives algos to do it. So my that's why I say BST as de-facto
| algorithm, while I don't think that's true, is only small push
| back. The paper goes way further.
| pvansandt wrote:
| I (author) agree that certainly at the higher end of element
| counts, it would be a very strange decision to not use an
| index. One of my favorite related papers is RadixSpline by
| Kipf, which shows an index based approach.
|
| However, the inner loop of indexes usually is some other kind
| of search algorithm. Graefe mentions in BTree techniques that
| interpolation search is sometimes applicable to inner nodes in
| btrees. One of the SIP evaluations was using SIP as part of the
| inner loop in LevelDB, which doesn't displace the use of an LSM
| for managing the larger data set.
|
| If you don't need a lot of range queries, eytzinger, or cache-
| line optimized btrees do get a lot more interesting, but the
| iteration logic gets a lot worse. SIP returns an element if
| it's present and a sentinel if absent, which is an odd choice
| since the index is usually what's needed. No clue, but
| hopefully you'll have some sympathy for revisiting your old
| code and wondering who wrote it. :)
|
| I do think that optimizing search algorithms is still
| worthwhile in an indexed setting for two reasons: 1. The size
| of a node within an index like a btree depends on the
| efficiency with which that node can be searched, so if you are
| able to decrease the cost of searching a given node in the
| index, that might motivate larger node sizes. I know that a
| common heuristic is to use a multiple of a page size, but that
| is based on the assumption that the whole node will be used as
| part of the search. 2. Pushing algorithmic decisions to just in
| time can be more costly than the time saved from choosing a
| better algorithm. You lose icache efficiency, stall on the
| decision, etc. I think it's still good to steel man, as best
| you can, that your index is actually buying you improved
| performance for additional complexity or update overhead.
| scrubs wrote:
| Your paper is a valid add. And I'm gonna read/digest it. My
| point above while true, is more in the margins. Better to
| keep the focus on engineering know-how in the PDF. Thank you
| for submitting it.
| marginalia_nu wrote:
| Commenting so I find this later.
| ComputerGuru wrote:
| Tip: you can favorite the submission then un-favorite it after
| you've read it.
| alwaystired wrote:
| New leetcode problem just dropped
| lucgagan wrote:
| Wow, this sounds like a really fascinating study! Binary Search
| has long been the go-to method for sorted, in-memory data search
| due to its effectiveness and simplicity. However, given the
| limitations of Binary Search in certain scenarios, it's exciting
| to see new algorithms being proposed that could potentially
| perform better under specific circumstances.
|
| It's intriguing that Interpolation Search, despite its
| historically lackluster performance, is being given a second life
| through SIP and TIP. I love the fact that they're leveraging
| hardware trends and advanced calculations to optimize search
| performance.
|
| The fact that SIP shows up to 4x faster performance on uniformly
| distributed data and TIP is 2-3x faster on non-uniformly
| distributed data is quite promising. It seems these approaches
| could be really useful in practice, given that real-world data
| often follows non-uniform distributions.
|
| The introduction of a meta-algorithm to dynamically select the
| optimal search algorithm based on the data distribution is a
| clever touch. It helps to enhance the overall search performance
| by not sticking to a one-size-fits-all approach.
|
| I'd love to read more about these techniques and their practical
| implementations. Do they have any benchmarks or plans to test
| these new methods in large-scale, real-world systems?
| switchbak wrote:
| Serious question: did you use an LLM tool to craft some or all
| of this comment?
| finexplained wrote:
| The five paragraph format will forever be suspicious.
| adamisom wrote:
| I'd be willing to bet money at 5-1 against that they didn't.
| since that's impossible to prove, I guess also on running an
| llm a lot of times and getting anything close to it.
| pvansandt wrote:
| This was an interesting journey. My original target was at L1
| sized arrays with constant sized error, but that limits its
| applicability to the target audience. My initial exploration
| with spline based indexes were at this target data size, so
| they weren't competitive.
|
| SIP is very sensitive to quality of implementation. I did my
| best to steel man binary search, but if either is not
| implemented carefully, then you can easily get inverted
| results. I tried several different axes of variation for
| implementation of binary search and probably spent more time
| optimizing binary search than SIP or TIP, but binary search is
| very sensitive to cache misses and how that relates to branch
| prediction and conflict misses. Quality of linear search is
| also critical and relies heavily on array padding. SIMD was
| actually a major pessimization since for SIP, it's only worth
| using if linear search is < 100 elements or so.
|
| I spent a lot of effort trying to model the performance of SIP
| to predict from statistics of the data to how it would perform.
| You can consider the cost of each interpolation as a random
| access cache miss, and the cost of a linear search based on its
| distance, but the cost of a linear search has decreasing
| marginal cost with more distance before you reach steady state,
| which is exactly where you are using it in SIP. So, that
| basically led nowhere, but it does point out that an L2 norm is
| the wrong metric for identifying if interpolation would be
| useful on your data.
|
| I think the bigger insight of TIP is around its use of
| efficient approximations for hyperbolic interpolations. There's
| a lot of research around polynomial interpolation, but I think
| that hyperbolas are also worth examining. I don't expect TIP to
| be used in a database engine, and while data might be commonly
| zipf distributed, I'm not sure that people are putting that
| kind of data into an array and doing lookups on them where the
| values at the bottom of the array have a lot of duplicate
| values.
|
| SIP, I think, is more about increasing amounts of spatial cache
| locality, and making the most of out-of-order processors. It is
| more a variation on interpolation-sequential search, so it
| might be more practically used in a SIP-2 or SIP-3 variation
| which a hard-coded, unrolled number of iterations. (SIP-1 would
| just be interpolation sequential.) The downside of an
| interpolation method is that pairs of sequential interpolations
| don't surround the target, which leaves the search unbounded. I
| don't find hybrid approaches very interesting because I don't
| expect an implementation to be able to pipeline useful work
| which has a time per iteration that is sufficiently fast. For
| perspective, on billion element arrays, I think typical number
| of interpolations was around 3. We would expect number of
| interpolations to not increase to 4 until you hit a quintillion
| elements which suggests that the gap in efficiency between log
| N and log log N grows slowly enough that algorithmic gains have
| a nearby, practical upper bound to potential improvement.
|
| One of my favorite related papers is RadixSpline by Kipf et al.
| The amount of additional memory across a number of scenarios
| would be tiny, you could write a very efficient inner loop, and
| it would have basically been tailor fit to work perfectly on
| the FB dataset (one of the real world distributions).
| mlochbaum wrote:
| The paper draws the analogy to continuous root-finding
| ("Interpolation search is the regula falsi method"), nice. One
| thing I've been wondering lately is whether the ITP method[0],
| which is actually more recent than this paper, is a good fit for
| sorted searching. It's an elegant method that guarantees a
| maximum number of queries, for example one more than binary
| search, while taking advantage of interpolation for better-
| behaved distributions. I wrote a short section about this at [1].
| Which is likely to link to the OP soon too.
|
| Since the author's here: the "Optimized binary search" algorithm
| looks good, in that it has an obvious branchless implementation.
| But it'll depend on how the code's written and compiled, and
| clang in particular will turn it into a branch without -mllvm
| --x86-cmov-converter=0. Making it branchless was the intention,
| right? Was the output code checked to make sure it uses cmov?
|
| [0] https://en.wikipedia.org/wiki/ITP_Method
|
| [1]
| https://mlochbaum.github.io/BQN/implementation/primitive/sor...
| [deleted]
| pvansandt wrote:
| Big fan on your presentation on binary search that amortised
| multiple concurrent searches.
|
| I did several variations on binary search and always compared
| against the best one. [0] From what I remember, some variations
| were compiled to a branchless implementation and some were
| compiled to implementations that used a branch. I had many
| passes of analysis by pasting into godbolt with identical
| compilers and flags. Power of 2 binary search did better on
| small arrays iirc, but are the first to hit conflict misses.
| For larger arrays, I believe the branchy search algorithms
| start to do better since half their branches get predicted
| correctly.
|
| I haven't previously looked at ITP, and I would need to study
| further to get a more clear idea on what its aiming for. Some
| hybrid methods, unlike ITP require additional memory accesses,
| but I don't think this does, just compute. On the other hand,
| since at a billion elements you're looking at roughly 3
| interpolations or 30 bisections, there's a pretty narrow window
| of opportunity here, and in the batch case, there are indices
| to contend with. I'm on the BQN discord if you wanted a
| different form of communication.
|
| [0] https://github.com/UWHustle/Efficiently-Searching-In-
| Memory-...
| mlochbaum wrote:
| Sounds good on the basic binary search. I jumped through the
| paper a little too quick and missed the source code link,
| although thanks for the godbolt confirmation as well.
|
| Yeah, the big deal about ITP is not really the interpolation
| but the graceful fallback to binary search. With of course
| that nasty division overhead. I'll have to study your paper
| to see which ideas there could be used.
|
| Glad you liked the talk! And a small world, huh? I've been
| wanting to get an implementation of the vector binary search
| into CBQN for a while, so that an open source version of it
| exists at least. Always some other thing to do. Multiple
| searches are really a different world. Elijah Stone pointed
| out recently[0] that with searches that all have the same
| iteration count (sadly, not interpolation search!), several
| of them can be run at once just by copying the loop body.
| That was new to me at least. And for searches that don't fit
| in cache it's possible to partition the sought values, which
| does the accesses in a perfectly cache-friendly order for
| some extra compute. That's described in the page I linked
| before.
|
| [0] https://news.ycombinator.com/item?id=33648924
___________________________________________________________________
(page generated 2023-06-03 23:00 UTC)