[HN Gopher] Approximate Nearest Neighbor Oh Yeah (Annoy)
       ___________________________________________________________________
        
       Approximate Nearest Neighbor Oh Yeah (Annoy)
        
       Author : fzliu
       Score  : 96 points
       Date   : 2023-10-30 17:06 UTC (5 hours ago)
        
 (HTM) web link (zilliz.com)
 (TXT) w3m dump (zilliz.com)
        
       | PaulHoule wrote:
       | Funny I got interested in nearest-neighbor search for this kind
       | of vectors in the early 2000s and there were a lot of papers on
       | the subject that were depressing (had to make a tradeoff that
       | wasn't really attractive) compared to the literature on low-
       | dimensional indexes. Back then I had 250,000 or so vectors and a
       | full scan wasn't that bad.
       | 
       | In the 2010s I worked on a dense vector search engine for patents
       | and we had maybe 10 million vectors and still used a scan, I
       | mean, you couldn't ask for a more memory hierarchy and SIMD
       | friendly algorithm.
       | 
       | Today I am looking at 1M (larger) vectors and the full scan is
       | still possible but I am using FAISS because it is a bird in the
       | hand and I decided I can live with the tradeoff.
        
         | kristjansson wrote:
         | I mean FAISS has IndexFlatL2 if you have it in hand and _want_
         | the full scan.
        
       | smokracek wrote:
       | Does anyone have any recommendations for a decent crash course on
       | using vector DBs in conjuction with LLMs? I wanna do some
       | experimentation with getting a model to comment on the similarity
       | of data vectors etc. and I don't really know where to start.
        
         | minimaxir wrote:
         | If you want to experiment with vector stores, you can do that
         | locally with something like faiss which has good multiplatform
         | support and sufficient tutorials:
         | https://github.com/facebookresearch/faiss
         | 
         | Doing full retrieval-augmented generation (RAG) and getting
         | LLMs to interpret the results has more steps but you get a lot
         | of flexibility, and despite what AI influencers say there's no
         | standard best-practice. When you query a vector DB you get the
         | most similar texts back (or an index integer in the case of
         | faiss), you then feed those result to an LLM like a normal
         | prompt, which can be optimized with prompt engineering.
         | 
         | The codifer for the RAG workflow is LangChain, but their demo
         | is substantially more complex and harder-to-use than even a
         | homegrown implementation:
         | https://minimaxir.com/2023/07/langchain-problem/
        
           | marcyb5st wrote:
           | Also, if what you look up has no semantic meaning like parts
           | number you might be better off with an inverted index in
           | addition to ANN lookups. Especially if the embedding model
           | has been trained on a dataset that is not similar to what you
           | use it for. That's a common situation right now with
           | embedding models based on LLMs.
        
         | pokpokpok wrote:
         | I recommend pgvector, it's simple and well featured with good
         | example code. Once you have a dataset of vectors loaded in, the
         | next step is called rag / retrieval augmented generation
        
         | malaya_zemlya wrote:
         | deeplearning.ai has a short coursee on the topic
         | https://www.deeplearning.ai/short-courses/large-language-mod...
        
         | alumic wrote:
         | You might also check out this previous thread on the subject.
         | It offers some pretty fascinating discussions:
         | 
         | https://news.ycombinator.com/item?id=35826929
        
         | bobvanluijt wrote:
         | (I'm affiliated with Weaviate) You might want to check out this
         | getting started guide. It takes a couple of minutes, and you're
         | good to go https://weaviate.io/developers/weaviate/quickstart
        
       | dmezzetti wrote:
       | If you want an easy way to evaluate Faiss, Hnswlib and Annoy
       | vector backends, check out txtai -
       | https://github.com/neuml/txtai. txtai also supports NumPy and
       | PyTorch vector storage.
       | 
       | Disclaimer: I am the author of txtai
        
       | daturkel wrote:
       | Annoy came out of Spotify, and they just announced their
       | successor library Voyager [1] last week [2].
       | 
       | [1]: https://github.com/spotify/voyager [2]:
       | https://engineering.atspotify.com/2023/10/introducing-voyage...
        
       | danbruc wrote:
       | That sound pretty much like binary space partitioning [1] which
       | is probably best known for being used in Doom to accelerate the
       | rendering. Also, if you only search the leaf subspace that you
       | query point lies in, then you can in principle be arbitrarily far
       | off from the true nearest neighbor.
       | 
       | [1] https://en.wikipedia.org/wiki/Binary_space_partitioning
        
       | naillo wrote:
       | This is so overcomplicated. You don't need an actual tree that's
       | just a mental model. All you need to do is generate N random
       | hyperplanes (defined via it's normal i.e. generate N normalized
       | vectors in the dimension of your choosing). Then you generate a
       | hash via these by going normal by normal and checking if your
       | query vector `q` is on the left or right side of the normal by
       | checking the sign of the dot product. I.e. `hash = [sign(dot(q,
       | n1)), sign(dot(q, n2)), ...]`.
       | 
       | So given a query you find that hash and you get all the vectors
       | with the same hash (using your favorite hash to array data
       | structure, i.e. a dict) and you can further refine the search
       | within that bucket. It's like 10 rows of code.
        
         | rozim wrote:
         | Sounds like you're describing LSH (locality sensitive hashing).
        
         | bigbillheck wrote:
         | > All you need to do
         | 
         | It seems to me that your approach is making some assumptions
         | about how the data is distributed that aren't necessarily
         | guaranteed to hold in practice, and a tree has a chance of
         | being robust to at least some of those. For example, whatever
         | random hyperplane you choose has to be informed by your data.
         | Suppose you make a particularly bad choice for one of those
         | hyperplanes, with your approach you'll be stuck with that for
         | all time and have a bit in your hash that's biased to heck, but
         | in the tree approach you only suffer when you hash your way
         | down to the lousy node, and all other paths will be unaffected.
        
         | erikbern wrote:
         | Annoy author here. What you're describing is Locality Sensitive
         | Hashing (LSH) which is something I spent a lot of time trying
         | back in 2009-2012 but never got it working. It has some elegant
         | theoretical properties, but empirically it always has terrible
         | performance. The reason I don't think it works well is that
         | data often lies near a lower-dimensionality manifold that may
         | have some sort of shape. LSH would "waste" most splits (because
         | it doesn't understand the data distribution) but using trees
         | (and finding splits such that the point set is partitioned
         | well) ends up "discovering" the data distribution better.
         | 
         | (but HNSW is generally much better than this tree paritioning
         | scheme)
        
         | kleene_op wrote:
         | It really isn't. That's like saying bisection is overly
         | complicated.
        
       ___________________________________________________________________
       (page generated 2023-10-30 23:00 UTC)