[HN Gopher] Approximate Nearest Neighbor Oh Yeah (Annoy)
___________________________________________________________________
Approximate Nearest Neighbor Oh Yeah (Annoy)
Author : fzliu
Score : 96 points
Date : 2023-10-30 17:06 UTC (5 hours ago)
(HTM) web link (zilliz.com)
(TXT) w3m dump (zilliz.com)
| PaulHoule wrote:
| Funny I got interested in nearest-neighbor search for this kind
| of vectors in the early 2000s and there were a lot of papers on
| the subject that were depressing (had to make a tradeoff that
| wasn't really attractive) compared to the literature on low-
| dimensional indexes. Back then I had 250,000 or so vectors and a
| full scan wasn't that bad.
|
| In the 2010s I worked on a dense vector search engine for patents
| and we had maybe 10 million vectors and still used a scan, I
| mean, you couldn't ask for a more memory hierarchy and SIMD
| friendly algorithm.
|
| Today I am looking at 1M (larger) vectors and the full scan is
| still possible but I am using FAISS because it is a bird in the
| hand and I decided I can live with the tradeoff.
| kristjansson wrote:
| I mean FAISS has IndexFlatL2 if you have it in hand and _want_
| the full scan.
| smokracek wrote:
| Does anyone have any recommendations for a decent crash course on
| using vector DBs in conjuction with LLMs? I wanna do some
| experimentation with getting a model to comment on the similarity
| of data vectors etc. and I don't really know where to start.
| minimaxir wrote:
| If you want to experiment with vector stores, you can do that
| locally with something like faiss which has good multiplatform
| support and sufficient tutorials:
| https://github.com/facebookresearch/faiss
|
| Doing full retrieval-augmented generation (RAG) and getting
| LLMs to interpret the results has more steps but you get a lot
| of flexibility, and despite what AI influencers say there's no
| standard best-practice. When you query a vector DB you get the
| most similar texts back (or an index integer in the case of
| faiss), you then feed those result to an LLM like a normal
| prompt, which can be optimized with prompt engineering.
|
| The codifer for the RAG workflow is LangChain, but their demo
| is substantially more complex and harder-to-use than even a
| homegrown implementation:
| https://minimaxir.com/2023/07/langchain-problem/
| marcyb5st wrote:
| Also, if what you look up has no semantic meaning like parts
| number you might be better off with an inverted index in
| addition to ANN lookups. Especially if the embedding model
| has been trained on a dataset that is not similar to what you
| use it for. That's a common situation right now with
| embedding models based on LLMs.
| pokpokpok wrote:
| I recommend pgvector, it's simple and well featured with good
| example code. Once you have a dataset of vectors loaded in, the
| next step is called rag / retrieval augmented generation
| malaya_zemlya wrote:
| deeplearning.ai has a short coursee on the topic
| https://www.deeplearning.ai/short-courses/large-language-mod...
| alumic wrote:
| You might also check out this previous thread on the subject.
| It offers some pretty fascinating discussions:
|
| https://news.ycombinator.com/item?id=35826929
| bobvanluijt wrote:
| (I'm affiliated with Weaviate) You might want to check out this
| getting started guide. It takes a couple of minutes, and you're
| good to go https://weaviate.io/developers/weaviate/quickstart
| dmezzetti wrote:
| If you want an easy way to evaluate Faiss, Hnswlib and Annoy
| vector backends, check out txtai -
| https://github.com/neuml/txtai. txtai also supports NumPy and
| PyTorch vector storage.
|
| Disclaimer: I am the author of txtai
| daturkel wrote:
| Annoy came out of Spotify, and they just announced their
| successor library Voyager [1] last week [2].
|
| [1]: https://github.com/spotify/voyager [2]:
| https://engineering.atspotify.com/2023/10/introducing-voyage...
| danbruc wrote:
| That sound pretty much like binary space partitioning [1] which
| is probably best known for being used in Doom to accelerate the
| rendering. Also, if you only search the leaf subspace that you
| query point lies in, then you can in principle be arbitrarily far
| off from the true nearest neighbor.
|
| [1] https://en.wikipedia.org/wiki/Binary_space_partitioning
| naillo wrote:
| This is so overcomplicated. You don't need an actual tree that's
| just a mental model. All you need to do is generate N random
| hyperplanes (defined via it's normal i.e. generate N normalized
| vectors in the dimension of your choosing). Then you generate a
| hash via these by going normal by normal and checking if your
| query vector `q` is on the left or right side of the normal by
| checking the sign of the dot product. I.e. `hash = [sign(dot(q,
| n1)), sign(dot(q, n2)), ...]`.
|
| So given a query you find that hash and you get all the vectors
| with the same hash (using your favorite hash to array data
| structure, i.e. a dict) and you can further refine the search
| within that bucket. It's like 10 rows of code.
| rozim wrote:
| Sounds like you're describing LSH (locality sensitive hashing).
| bigbillheck wrote:
| > All you need to do
|
| It seems to me that your approach is making some assumptions
| about how the data is distributed that aren't necessarily
| guaranteed to hold in practice, and a tree has a chance of
| being robust to at least some of those. For example, whatever
| random hyperplane you choose has to be informed by your data.
| Suppose you make a particularly bad choice for one of those
| hyperplanes, with your approach you'll be stuck with that for
| all time and have a bit in your hash that's biased to heck, but
| in the tree approach you only suffer when you hash your way
| down to the lousy node, and all other paths will be unaffected.
| erikbern wrote:
| Annoy author here. What you're describing is Locality Sensitive
| Hashing (LSH) which is something I spent a lot of time trying
| back in 2009-2012 but never got it working. It has some elegant
| theoretical properties, but empirically it always has terrible
| performance. The reason I don't think it works well is that
| data often lies near a lower-dimensionality manifold that may
| have some sort of shape. LSH would "waste" most splits (because
| it doesn't understand the data distribution) but using trees
| (and finding splits such that the point set is partitioned
| well) ends up "discovering" the data distribution better.
|
| (but HNSW is generally much better than this tree paritioning
| scheme)
| kleene_op wrote:
| It really isn't. That's like saying bisection is overly
| complicated.
___________________________________________________________________
(page generated 2023-10-30 23:00 UTC)