hngopher.com

       [HN Gopher] FANN: Vector Search in 200 Lines of Rust
       ___________________________________________________________________
        
       FANN: Vector Search in 200 Lines of Rust
        
       Author : ingve
       Score  : 148 points
       Date   : 2023-06-15 10:04 UTC (12 hours ago)
        
 (HTM) web link (fennel.ai)
 (TXT) w3m dump (fennel.ai)
        
       | bagavi wrote:
       | https://www.researchgate.net/profile/Vivek-Kumar-Bagaria-2/p...
       | 
       | This idea improves the scaling from linear in dimension to
       | logarithmic when very little loss using ideas from multi-arm
       | bandits
        
         | kristjansson wrote:
         | The actual arXiv link for the curious:
         | https://arxiv.org/abs/1805.08321
        
       | nikhilgarg28 wrote:
       | One of the authors of the post here. Happy to take any questions!
        
         | linhvn wrote:
         | Wondering how much details you have considered on using
         | Johnson-Lindenstrauss transform for dimensionality reduction on
         | ANN search?
         | 
         | e.g. https://www.cs.princeton.edu/~chazelle/pubs/FJLT-
         | sicomp09.pd...
        
           | nikhilgarg28 wrote:
           | We hadn't considered that at all, will definitely check it
           | out and explore further. Thanks!
        
       | tipsytoad wrote:
       | I couldn't find anywhere in the article, how long does it take to
       | add all 10,000 embeddings to the index? That seems quite
       | important for many applications
        
         | zX41ZdbW wrote:
         | It should take less than one second on CPU.
         | 
         | For example, see my comment above:
         | https://news.ycombinator.com/item?id=36341754
        
           | nikhilgarg28 wrote:
           | We did some benchmarking and found that an index of 1M
           | vectors using 15 trees can be built up in ~115 seconds (8
           | CPU, 8 GB RAM). https://github.com/fennel-ai/fann/pull/8.
           | 
           | Given this, around a second to build an index of 10K items
           | sounds about right.
        
       | sfpotter wrote:
       | The dot product isn't a distance
        
         | [deleted]
        
         | tipsytoad wrote:
         | Dot product is equivalent to cosine similarly for normalized
         | vecotrs
        
           | wongarsu wrote:
           | Given normalized vectors, it's also proportional to the
           | square of the Euclidian distance. Meaning for ranking (and
           | thus k-nearest-neighbors) it doesn't matter if you use
           | Euclidian distance or cosine similarity, you get the same
           | result.
        
           | sfpotter wrote:
           | It isn't equivalent to it. See my other comment.
        
             | tipsytoad wrote:
             | Dude.
             | 
             | Cos(x, y) = x . y / ||x|| * ||y||
             | 
             | If x and y are normalized:
             | 
             | Cos(x, y) = x . y
        
         | mathisfun123 wrote:
         | My favorite kind of pedantry is pedantry that's naive - on like
         | day 1 of topology 101 you learn that the "dot" product is an
         | inner product and _every_ inner product naturally induces a
         | metric so _every_ inner product space is a metric space (with
         | the induced metric).
        
           | cochne wrote:
           | It's not pedantry. Just because an inner product induced a
           | metric space does it make it the same thing as that metric.
           | The performance of similarity searches can depend very much
           | on whether you use a dot product or a distance.
        
           | sfpotter wrote:
           | A distance is a function d(x, y) where d(x, y) = 0 if x = y,
           | if x = 0, or if y = 0 (among other properties). If x = y and
           | neither x = 0 nor y = 0, and if we let d(x, y) = x*y, then
           | d(x, y) != 0 in general, so the dot product is certainly not
           | a distance. Your comment isn't relevant, but thanks for
           | calling me a pedant. ;-)
        
             | mathisfun123 wrote:
             | `is != induces`
        
         | marginalia_nu wrote:
         | It is for unit vectors.
        
           | sfpotter wrote:
           | The dot product of two unit vectors is the cosine of the
           | angle between them. Since cos^-1 (u*v) is the angle between
           | them, or the geodesic distance between u and v regarded as
           | points on the sphere, which _is_ a distance. But the dot
           | product is not.
           | 
           | If u = v, then u*v = 1, but for two points not equal to zero,
           | the distance between them should always be zero. Remember
           | that u*v = 0 if u and v are perpendicular. If neither u nor v
           | are 0, then for them to be perpendicular, they must certainly
           | not be equal. The dot product does not behave like a
           | distance.
        
           | Solvency wrote:
           | My math is weak. I thought dot product measures alignment.
           | Help me grok what "distance" means here?
        
             | mecsred wrote:
             | Unit vectors have a normalized length, so the only
             | difference between two unit vectors is the "heading". The
             | angle between them can be interpreted as the the
             | "difference" or distance, between them.
             | 
             | Since you can compose any vector in the space out of unit
             | vectors, you can extend the concept. See above comment
             | inner product -> metric space.
        
             | marginalia_nu wrote:
             | Hmm, I think I got my math wrong too.
             | 
             | What is a proper distance function is (1 - |a*b|). That's
             | proportional to the distance between those points on the
             | plane spanned along the circumference of the
             | circle/sphere/hypersphere as it's projected onto that
             | plane.
        
       | carterschonwald wrote:
       | This random hyper plane algorithm is actually a pretty elegant
       | algorithm. It has the following amazing property: as your
       | dimension increases, the quality of the random hyper planes
       | actually gets better!
        
         | irwt wrote:
         | What exactly is elegant about this algorithm? The approach
         | seems computationally quite inefficient to me.
        
           | zX41ZdbW wrote:
           | It is inefficient to build (unless on GPU), but good for
           | search. Moreover, the search naturally benefits from data
           | locality after ordering - this allows using MergeTree tables
           | in ClickHouse. See
           | https://news.ycombinator.com/item?id=36341754
           | 
           | And it is around 10 lines of code in SQL:
           | https://presentations.clickhouse.com/meetup74/ai/#35
        
             | gavinray wrote:
             | That presentation was pretty cool, thanks for posting.
        
             | irwt wrote:
             | That's the thing though, it's very easy to perform
             | approximate nearest neighbor search in an efficient way.
             | There are plenty of simple solutions for that. The real
             | challenge is to design an algorithm that can scale & that
             | remains robust even when there are additions and deletions
             | to the set.
        
       | yujian wrote:
       | Nice, this is a cool version of ANN search. I like that at the
       | end there is commentary on what's needed for production as well -
       | things like parallelization, RAM considerations, and the
       | consideration for balancing the trees. It's really the production
       | level considerations that would steer you toward a vector
       | database like Milvus/Zilliz
        
       | phillipcarter wrote:
       | Our top_k search in our own product is in Go and it's ~25 lines
       | of code. Super simple stuff, does the job of funding the top_k in
       | 1ms, no index, no framework, nothing. We benefit from the corpus
       | we embed being relatively static and we can just store them in
       | redis, so it's not a hard problem to solve. Anyways, often you
       | don't need some turnkey solution to this stuff.
        
         | zainhoda wrote:
         | Same -- I just asked ChatGPT to generate Go code to do cosine
         | similarity and I was surprised when the code it generated
         | worked and the _worst_ performance I get is like 10ms
        
           | phillipcarter wrote:
           | Yep. If you parallelize with goroutines it'll be faster, but
           | we're talking about shaving a few ms. The effectiveness of a
           | for loop crunching floats in a single thread is not to be
           | underestimated.
        
             | huac wrote:
             | I have gotten 10x speedups with SIMD on modern hardware -
             | goes from fast to blazing. https://github.com/stillmatic/go
             | llum/blob/main/vectorstore.g...
             | 
             | FWIW - Weaviate's pure Go implementation of cosine
             | similarity is probably as fast as you can possibly get in
             | pure go, without SIMD -- https://github.com/weaviate/weavia
             | te/blob/2f44dbee52196656bd...
        
               | phillipcarter wrote:
               | Yep, we'd go with what they did, except OpenAI embedding
               | vectors are already normalized, so you can just do the
               | dot product.
        
         | nikhilgarg28 wrote:
         | What is the size and dimensionality of your corpus?
        
           | phillipcarter wrote:
           | Size can be up to ~140MB of vectors, so not exactly enormous.
           | But it's nowhere near enough to stress a single node running
           | a for loop on the data and doing a dot product, then sorting
           | by relevancy.
        
       | Solvency wrote:
       | "Build a hyperplane (analog of a "line" in higher dimensions)
       | that passes through C and is perpendicular to the line segment
       | connecting A and B."
       | 
       | Isn't that just a fancy way of saying a line from O (a tuple of
       | zeroes) to C to infinity?
        
         | medium_spicy wrote:
         | No, this is only true if that hyperplane contains the origin;
         | imagine an infinite number of hyperplanes that contain A and B;
         | there are an infinite number of such planes for dimensions
         | higher than 2. Now imagine the same, but connecting O and C;
         | most of those AB hyperplanes are not orthogonal to those OC
         | hyperplanes, it's only coincidence if they are, though for
         | dimensions higher than 2 you can always find a point C that
         | happens to lie along the orthogonal line from the AB plane to
         | the origin.
        
         | dgacmu wrote:
         | No, it's more restricted than that. Think of 2d planes in 3
         | space: There are an infinite number of planes that you can
         | construct on a line from A to C (they're all rotations of a
         | plane extending out from that line).
        
           | Solvency wrote:
           | Okay I'm trying to map this visually in my brain. So, in 3
           | space, say I extrude a random plane out from a line segment.
           | I keep going, now I imagine I've extruded every possible
           | plane approaching infinity. This eventually forms... a
           | cylinder, no? So a hyperplane is what, a cylinder? Or is it
           | shorthand for "every possible plane..."
        
             | wongarsu wrote:
             | In an n-dimensional space, an infinite n-1-dimensional
             | thing is a hyperplane. Or put more simply: it's something
             | that divides the space in two. In 1 dimension that's a
             | point, in 2 dimensions it's a line, in 3 dimensions it's a
             | plane (hence the name), for more dimensions we just call it
             | a hyperplane.
             | 
             | Now there are various ways to define which hyperplane you
             | speak about. The easiest to reason about is to give a
             | couple points that are on the hyperplane (1 point defines a
             | point, a line can be defined by any two points on that
             | line, a plane by three points on that plane etc.). A
             | slightly weirder but equally useful way is to say "it's
             | orthogonal to this line, and intersects this point". That's
             | easy to reason about in three dimensions, but apparently
             | works in higher dimensions as well.
        
             | DougBTX wrote:
             | No, rotate by 90 degrees. The plane is normal to the line,
             | so the line doesn't lie in the plane.
        
             | [deleted]
        
       | maccaw wrote:
       | This is great! See also Voy, A WASM vector similarity search
       | written in Rust:
       | 
       | https://github.com/tantaraio/voy
        
       | ashvardanian wrote:
       | Always pleasure to see short implementations, but I'd still take
       | HNSW over such approach. Its core implementation can also be
       | quite compact. In our case with USearch [1] its about 2k LOC,
       | even with a custom implementation of priority queues and custom
       | numeric types like uint40_t for indexes beyond 4B points.
       | 
       | Obviously, together with bindings for Rust, Python, JavaScript,
       | Java, Objective-C, Swift, and Wolfram it gets slightly larger,
       | but on the bright side, you can pass JIT-compiled functions and
       | store/search elements of different dimensionality, which allows
       | much broader applicability [2].
       | 
       | [1]: https://github.com/unum-cloud/usearch [2]:
       | https://ashvardanian.com/posts/abusing-vector-search
        
         | moab wrote:
         | > I'd still take HNSW over such approach
         | 
         | Why? Sorry, but your comment is not very insightful without
         | more explanation, and the rest of your comment is just an
         | advertisement for your (admittedly impressive looking) work.
        
           | ashvardanian wrote:
           | Because (A) it allows similarity search between auxiliary
           | objects, not just equidimensional vectors, and (B) can be
           | implemented in a similarly short fashion, while providing
           | great performance.
        
       | huac wrote:
       | I'd be curious how this stacks up on ANN-benchmarks (https://ann-
       | benchmarks.com/).
       | 
       | FWIW exhaustive search is still probably good enough for most use
       | cases. IMO the exhaustive search should use a heap
       | https://github.com/fennel-ai/fann/blob/main/src/main.rs#L12 as
       | you're only looking for top-k, it reduces time and memory
       | complexity greatly. On a relatively unoptimized Golang
       | implementation (though much beefier hardware) I get ~100ms to
       | process 1M vectors of 300 dim. Still quite a bit slower than
       | approximate, of course, but in absolute terms probably good
       | enough for most use cases, especially because many use cases
       | don't have 1M vectors.
        
         | zX41ZdbW wrote:
         | I've recently presented the approach with exhaustive search and
         | heap, then iteratively adding simple indices, using ClickHouse.
         | It works pretty well despite the overall usage simplicity.
         | 
         | Video: https://www.youtube.com/watch?v=hGRNcftpqAk
         | 
         | Presentation: https://presentations.clickhouse.com/meetup74/ai/
         | 
         | PS. It should work at the same speed as "bruteforce-blas" in
         | the ANN benchmarks.
        
           | huac wrote:
           | nice, if I'm reading this right, you do random projection as
           | a first stage filter, then exact search over the filtered
           | subset? that's clever, did you test precision/recall at all?
        
       | Lichtso wrote:
       | Isn't this kd-trees with arbitrarily oriented (not axis-aligned)
       | hyperplanes?
        
         | a_e_k wrote:
         | It's a binary space partition (BSP) [0]. A kd-tree is a special
         | case of a BSP with axis-aligned hyperplanes.
         | 
         | [0] https://en.wikipedia.org/wiki/Binary_space_partitioning
        
         | esafak wrote:
         | The hyperplanes are randomized to ensure the hyperplanes are
         | probabilistically independent and to deal with high
         | dimensionality, so the data are less likely to be clustered in
         | the same bucket.
         | 
         | https://en.wikipedia.org/wiki/Random_projection
        
       | cormacrelf wrote:
       | The code for the Hyperplane type is unfortunately omitted. It's
       | the most important code in the whole thing.
        
         | un_ess wrote:
         | isn't it in https://github.com/fennel-
         | ai/fann/blob/main/src/search.rs#L1... ?
        
           | cormacrelf wrote:
           | Yes, but it should be in the article if everything else is.
        
             | nikhilgarg28 wrote:
             | Apologies, that got omitted by mistake. I've now included
             | the post to include that piece of code as well.
        
       | [deleted]
        
       | esafak wrote:
       | How did they manage to write this article without once mentioning
       | the pioneering LSH algorithm from 1999?
       | 
       | https://en.wikipedia.org/wiki/Locality-sensitive_hashing
        
         | nighthawk454 wrote:
         | > The approach we use here is based on a family of algorithms
         | called "Locality Sensitive Hashing" used in the popular library
         | annoy. This article's goal isn't to introduce a new fancy
         | algorithm/library but to describe how vector search works using
         | real code snippets.
         | 
         | Right at the top...
        
           | esafak wrote:
           | The article was edited in response to this thread:
           | https://news.ycombinator.com/item?id=36344870
        
             | nighthawk454 wrote:
             | My bad! That's some fast updating
        
             | nikhilgarg28 wrote:
             | Yes, it wasn't initially there mostly due to an oversight
             | but we later edited the article.
        
               | nighthawk454 wrote:
               | Oh ok, cool you're checking feedback and updating. Might
               | put in an 'updated at' or something :P. But thanks,
               | enjoyed the post!
        
       | fhaltmayer wrote:
       | How is this different than ANNOY?
        
         | esafak wrote:
         | It is cited, but buried half-way into the article. There is a
         | link in it that leads to
         | https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...
        
         | nikhilgarg28 wrote:
         | The article isn't meant to present a new library competing with
         | Annoy or other state of the art approaches, but instead,
         | describe how vector search works using short code snippets.
         | 
         | I've now updated the article to clearly mention this and also
         | cite annoy.
        
       ___________________________________________________________________
       (page generated 2023-06-15 23:01 UTC)