[HN Gopher] FANN: Vector Search in 200 Lines of Rust
___________________________________________________________________
FANN: Vector Search in 200 Lines of Rust
Author : ingve
Score : 148 points
Date : 2023-06-15 10:04 UTC (12 hours ago)
(HTM) web link (fennel.ai)
(TXT) w3m dump (fennel.ai)
| bagavi wrote:
| https://www.researchgate.net/profile/Vivek-Kumar-Bagaria-2/p...
|
| This idea improves the scaling from linear in dimension to
| logarithmic when very little loss using ideas from multi-arm
| bandits
| kristjansson wrote:
| The actual arXiv link for the curious:
| https://arxiv.org/abs/1805.08321
| nikhilgarg28 wrote:
| One of the authors of the post here. Happy to take any questions!
| linhvn wrote:
| Wondering how much details you have considered on using
| Johnson-Lindenstrauss transform for dimensionality reduction on
| ANN search?
|
| e.g. https://www.cs.princeton.edu/~chazelle/pubs/FJLT-
| sicomp09.pd...
| nikhilgarg28 wrote:
| We hadn't considered that at all, will definitely check it
| out and explore further. Thanks!
| tipsytoad wrote:
| I couldn't find anywhere in the article, how long does it take to
| add all 10,000 embeddings to the index? That seems quite
| important for many applications
| zX41ZdbW wrote:
| It should take less than one second on CPU.
|
| For example, see my comment above:
| https://news.ycombinator.com/item?id=36341754
| nikhilgarg28 wrote:
| We did some benchmarking and found that an index of 1M
| vectors using 15 trees can be built up in ~115 seconds (8
| CPU, 8 GB RAM). https://github.com/fennel-ai/fann/pull/8.
|
| Given this, around a second to build an index of 10K items
| sounds about right.
| sfpotter wrote:
| The dot product isn't a distance
| [deleted]
| tipsytoad wrote:
| Dot product is equivalent to cosine similarly for normalized
| vecotrs
| wongarsu wrote:
| Given normalized vectors, it's also proportional to the
| square of the Euclidian distance. Meaning for ranking (and
| thus k-nearest-neighbors) it doesn't matter if you use
| Euclidian distance or cosine similarity, you get the same
| result.
| sfpotter wrote:
| It isn't equivalent to it. See my other comment.
| tipsytoad wrote:
| Dude.
|
| Cos(x, y) = x . y / ||x|| * ||y||
|
| If x and y are normalized:
|
| Cos(x, y) = x . y
| mathisfun123 wrote:
| My favorite kind of pedantry is pedantry that's naive - on like
| day 1 of topology 101 you learn that the "dot" product is an
| inner product and _every_ inner product naturally induces a
| metric so _every_ inner product space is a metric space (with
| the induced metric).
| cochne wrote:
| It's not pedantry. Just because an inner product induced a
| metric space does it make it the same thing as that metric.
| The performance of similarity searches can depend very much
| on whether you use a dot product or a distance.
| sfpotter wrote:
| A distance is a function d(x, y) where d(x, y) = 0 if x = y,
| if x = 0, or if y = 0 (among other properties). If x = y and
| neither x = 0 nor y = 0, and if we let d(x, y) = x*y, then
| d(x, y) != 0 in general, so the dot product is certainly not
| a distance. Your comment isn't relevant, but thanks for
| calling me a pedant. ;-)
| mathisfun123 wrote:
| `is != induces`
| marginalia_nu wrote:
| It is for unit vectors.
| sfpotter wrote:
| The dot product of two unit vectors is the cosine of the
| angle between them. Since cos^-1 (u*v) is the angle between
| them, or the geodesic distance between u and v regarded as
| points on the sphere, which _is_ a distance. But the dot
| product is not.
|
| If u = v, then u*v = 1, but for two points not equal to zero,
| the distance between them should always be zero. Remember
| that u*v = 0 if u and v are perpendicular. If neither u nor v
| are 0, then for them to be perpendicular, they must certainly
| not be equal. The dot product does not behave like a
| distance.
| Solvency wrote:
| My math is weak. I thought dot product measures alignment.
| Help me grok what "distance" means here?
| mecsred wrote:
| Unit vectors have a normalized length, so the only
| difference between two unit vectors is the "heading". The
| angle between them can be interpreted as the the
| "difference" or distance, between them.
|
| Since you can compose any vector in the space out of unit
| vectors, you can extend the concept. See above comment
| inner product -> metric space.
| marginalia_nu wrote:
| Hmm, I think I got my math wrong too.
|
| What is a proper distance function is (1 - |a*b|). That's
| proportional to the distance between those points on the
| plane spanned along the circumference of the
| circle/sphere/hypersphere as it's projected onto that
| plane.
| carterschonwald wrote:
| This random hyper plane algorithm is actually a pretty elegant
| algorithm. It has the following amazing property: as your
| dimension increases, the quality of the random hyper planes
| actually gets better!
| irwt wrote:
| What exactly is elegant about this algorithm? The approach
| seems computationally quite inefficient to me.
| zX41ZdbW wrote:
| It is inefficient to build (unless on GPU), but good for
| search. Moreover, the search naturally benefits from data
| locality after ordering - this allows using MergeTree tables
| in ClickHouse. See
| https://news.ycombinator.com/item?id=36341754
|
| And it is around 10 lines of code in SQL:
| https://presentations.clickhouse.com/meetup74/ai/#35
| gavinray wrote:
| That presentation was pretty cool, thanks for posting.
| irwt wrote:
| That's the thing though, it's very easy to perform
| approximate nearest neighbor search in an efficient way.
| There are plenty of simple solutions for that. The real
| challenge is to design an algorithm that can scale & that
| remains robust even when there are additions and deletions
| to the set.
| yujian wrote:
| Nice, this is a cool version of ANN search. I like that at the
| end there is commentary on what's needed for production as well -
| things like parallelization, RAM considerations, and the
| consideration for balancing the trees. It's really the production
| level considerations that would steer you toward a vector
| database like Milvus/Zilliz
| phillipcarter wrote:
| Our top_k search in our own product is in Go and it's ~25 lines
| of code. Super simple stuff, does the job of funding the top_k in
| 1ms, no index, no framework, nothing. We benefit from the corpus
| we embed being relatively static and we can just store them in
| redis, so it's not a hard problem to solve. Anyways, often you
| don't need some turnkey solution to this stuff.
| zainhoda wrote:
| Same -- I just asked ChatGPT to generate Go code to do cosine
| similarity and I was surprised when the code it generated
| worked and the _worst_ performance I get is like 10ms
| phillipcarter wrote:
| Yep. If you parallelize with goroutines it'll be faster, but
| we're talking about shaving a few ms. The effectiveness of a
| for loop crunching floats in a single thread is not to be
| underestimated.
| huac wrote:
| I have gotten 10x speedups with SIMD on modern hardware -
| goes from fast to blazing. https://github.com/stillmatic/go
| llum/blob/main/vectorstore.g...
|
| FWIW - Weaviate's pure Go implementation of cosine
| similarity is probably as fast as you can possibly get in
| pure go, without SIMD -- https://github.com/weaviate/weavia
| te/blob/2f44dbee52196656bd...
| phillipcarter wrote:
| Yep, we'd go with what they did, except OpenAI embedding
| vectors are already normalized, so you can just do the
| dot product.
| nikhilgarg28 wrote:
| What is the size and dimensionality of your corpus?
| phillipcarter wrote:
| Size can be up to ~140MB of vectors, so not exactly enormous.
| But it's nowhere near enough to stress a single node running
| a for loop on the data and doing a dot product, then sorting
| by relevancy.
| Solvency wrote:
| "Build a hyperplane (analog of a "line" in higher dimensions)
| that passes through C and is perpendicular to the line segment
| connecting A and B."
|
| Isn't that just a fancy way of saying a line from O (a tuple of
| zeroes) to C to infinity?
| medium_spicy wrote:
| No, this is only true if that hyperplane contains the origin;
| imagine an infinite number of hyperplanes that contain A and B;
| there are an infinite number of such planes for dimensions
| higher than 2. Now imagine the same, but connecting O and C;
| most of those AB hyperplanes are not orthogonal to those OC
| hyperplanes, it's only coincidence if they are, though for
| dimensions higher than 2 you can always find a point C that
| happens to lie along the orthogonal line from the AB plane to
| the origin.
| dgacmu wrote:
| No, it's more restricted than that. Think of 2d planes in 3
| space: There are an infinite number of planes that you can
| construct on a line from A to C (they're all rotations of a
| plane extending out from that line).
| Solvency wrote:
| Okay I'm trying to map this visually in my brain. So, in 3
| space, say I extrude a random plane out from a line segment.
| I keep going, now I imagine I've extruded every possible
| plane approaching infinity. This eventually forms... a
| cylinder, no? So a hyperplane is what, a cylinder? Or is it
| shorthand for "every possible plane..."
| wongarsu wrote:
| In an n-dimensional space, an infinite n-1-dimensional
| thing is a hyperplane. Or put more simply: it's something
| that divides the space in two. In 1 dimension that's a
| point, in 2 dimensions it's a line, in 3 dimensions it's a
| plane (hence the name), for more dimensions we just call it
| a hyperplane.
|
| Now there are various ways to define which hyperplane you
| speak about. The easiest to reason about is to give a
| couple points that are on the hyperplane (1 point defines a
| point, a line can be defined by any two points on that
| line, a plane by three points on that plane etc.). A
| slightly weirder but equally useful way is to say "it's
| orthogonal to this line, and intersects this point". That's
| easy to reason about in three dimensions, but apparently
| works in higher dimensions as well.
| DougBTX wrote:
| No, rotate by 90 degrees. The plane is normal to the line,
| so the line doesn't lie in the plane.
| [deleted]
| maccaw wrote:
| This is great! See also Voy, A WASM vector similarity search
| written in Rust:
|
| https://github.com/tantaraio/voy
| ashvardanian wrote:
| Always pleasure to see short implementations, but I'd still take
| HNSW over such approach. Its core implementation can also be
| quite compact. In our case with USearch [1] its about 2k LOC,
| even with a custom implementation of priority queues and custom
| numeric types like uint40_t for indexes beyond 4B points.
|
| Obviously, together with bindings for Rust, Python, JavaScript,
| Java, Objective-C, Swift, and Wolfram it gets slightly larger,
| but on the bright side, you can pass JIT-compiled functions and
| store/search elements of different dimensionality, which allows
| much broader applicability [2].
|
| [1]: https://github.com/unum-cloud/usearch [2]:
| https://ashvardanian.com/posts/abusing-vector-search
| moab wrote:
| > I'd still take HNSW over such approach
|
| Why? Sorry, but your comment is not very insightful without
| more explanation, and the rest of your comment is just an
| advertisement for your (admittedly impressive looking) work.
| ashvardanian wrote:
| Because (A) it allows similarity search between auxiliary
| objects, not just equidimensional vectors, and (B) can be
| implemented in a similarly short fashion, while providing
| great performance.
| huac wrote:
| I'd be curious how this stacks up on ANN-benchmarks (https://ann-
| benchmarks.com/).
|
| FWIW exhaustive search is still probably good enough for most use
| cases. IMO the exhaustive search should use a heap
| https://github.com/fennel-ai/fann/blob/main/src/main.rs#L12 as
| you're only looking for top-k, it reduces time and memory
| complexity greatly. On a relatively unoptimized Golang
| implementation (though much beefier hardware) I get ~100ms to
| process 1M vectors of 300 dim. Still quite a bit slower than
| approximate, of course, but in absolute terms probably good
| enough for most use cases, especially because many use cases
| don't have 1M vectors.
| zX41ZdbW wrote:
| I've recently presented the approach with exhaustive search and
| heap, then iteratively adding simple indices, using ClickHouse.
| It works pretty well despite the overall usage simplicity.
|
| Video: https://www.youtube.com/watch?v=hGRNcftpqAk
|
| Presentation: https://presentations.clickhouse.com/meetup74/ai/
|
| PS. It should work at the same speed as "bruteforce-blas" in
| the ANN benchmarks.
| huac wrote:
| nice, if I'm reading this right, you do random projection as
| a first stage filter, then exact search over the filtered
| subset? that's clever, did you test precision/recall at all?
| Lichtso wrote:
| Isn't this kd-trees with arbitrarily oriented (not axis-aligned)
| hyperplanes?
| a_e_k wrote:
| It's a binary space partition (BSP) [0]. A kd-tree is a special
| case of a BSP with axis-aligned hyperplanes.
|
| [0] https://en.wikipedia.org/wiki/Binary_space_partitioning
| esafak wrote:
| The hyperplanes are randomized to ensure the hyperplanes are
| probabilistically independent and to deal with high
| dimensionality, so the data are less likely to be clustered in
| the same bucket.
|
| https://en.wikipedia.org/wiki/Random_projection
| cormacrelf wrote:
| The code for the Hyperplane type is unfortunately omitted. It's
| the most important code in the whole thing.
| un_ess wrote:
| isn't it in https://github.com/fennel-
| ai/fann/blob/main/src/search.rs#L1... ?
| cormacrelf wrote:
| Yes, but it should be in the article if everything else is.
| nikhilgarg28 wrote:
| Apologies, that got omitted by mistake. I've now included
| the post to include that piece of code as well.
| [deleted]
| esafak wrote:
| How did they manage to write this article without once mentioning
| the pioneering LSH algorithm from 1999?
|
| https://en.wikipedia.org/wiki/Locality-sensitive_hashing
| nighthawk454 wrote:
| > The approach we use here is based on a family of algorithms
| called "Locality Sensitive Hashing" used in the popular library
| annoy. This article's goal isn't to introduce a new fancy
| algorithm/library but to describe how vector search works using
| real code snippets.
|
| Right at the top...
| esafak wrote:
| The article was edited in response to this thread:
| https://news.ycombinator.com/item?id=36344870
| nighthawk454 wrote:
| My bad! That's some fast updating
| nikhilgarg28 wrote:
| Yes, it wasn't initially there mostly due to an oversight
| but we later edited the article.
| nighthawk454 wrote:
| Oh ok, cool you're checking feedback and updating. Might
| put in an 'updated at' or something :P. But thanks,
| enjoyed the post!
| fhaltmayer wrote:
| How is this different than ANNOY?
| esafak wrote:
| It is cited, but buried half-way into the article. There is a
| link in it that leads to
| https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...
| nikhilgarg28 wrote:
| The article isn't meant to present a new library competing with
| Annoy or other state of the art approaches, but instead,
| describe how vector search works using short code snippets.
|
| I've now updated the article to clearly mention this and also
| cite annoy.
___________________________________________________________________
(page generated 2023-06-15 23:01 UTC)