[HN Gopher] Solr's Dense Vector Search for indexing and searchin...
___________________________________________________________________
Solr's Dense Vector Search for indexing and searching dense
numerical vectors
Author : kordlessagain
Score : 84 points
Date : 2022-09-05 15:24 UTC (7 hours ago)
(HTM) web link (solr.apache.org)
(TXT) w3m dump (solr.apache.org)
| lovelearning wrote:
| A much-awaited enhancement. Saves the trouble of having to deploy
| a separate vector DB like Milvus.
|
| I don't like the query syntax though. Maybe a more developer-
| friendly indexing+query flow is possible. Vectorize fields and
| queries transparently using a lib like DL4J running in the same
| JVM. That can further simplify both app development and
| deployment.
| lmeyerov wrote:
| can this do something like a 100M+ index on a single node?
|
| it seems like all the vc-funded oss options are targeting more
| like 1M-rows-per-server, which doesn't really make sense for most
| of our use cases..
| QuadmasterXLII wrote:
| Question: I have ~10,000 128 element query vectors, and want to
| find the nearest neighbor (cosine similarity) for each of them in
| a dataset of ~1,000,000 target vectors. I can do this using brute
| force search on a GPU in a few minutes, which is fast but still a
| serious bottleneck for me. Is this an appropriate size of dataset
| and task for acceleration with some sort of vector database or
| algorithm more intelligent than brute force search?
| ianbutler wrote:
| Use an approximate method like faiss and then do cosine
| similarity on the results of that.
|
| Short answer is most of these databases uses some type of
| precomputation to make doing approximate nearest neighbors
| faster. HNSW[0], FAISS[1], SCANN[2] etc are then all methods of
| doing approximate nearest neighbors but make use of different
| techniques to speed up that approximation. For your use case it
| will likely result in a speed up.
|
| [0] https://www.pinecone.io/learn/hnsw/ [1]
| https://engineering.fb.com/2017/03/29/data-infrastructure/fa...
| [2]https://ai.googleblog.com/2020/07/announcing-scann-
| efficient...
| ParanoidShroom wrote:
| What about annoy? https://github.com/spotify/annoy I used this
| in the past. It will probably have it's limitations but worked
| great for me
| cschmidt wrote:
| You can fit 1M vectors in the free tier of www.pinecone.io if
| you want to experiment. I'm not sure how fast having that many
| query vectors would be. (I'm a happy Pinecone customer, but
| only use a single query vector.)
| cschmidt wrote:
| Huh, at one point you could have multiple queries, but it
| looks like that is deprecated now.
|
| https://www.pinecone.io/docs/api/operation/query/
|
| So maybe it wouldn't work for you use case.
| tarr11 wrote:
| SOLR is using Lucene's approximate nearest neighbor (ANN)
| implementation.
|
| This site has some nice information on how ANN performs for
| vector search.
|
| http://ann-benchmarks.com/
| generall wrote:
| There is more relevant benchmark of vector search engines
| end-to-end, not just algorithms:
| https://qdrant.tech/benchmarks/
| QuadmasterXLII wrote:
| Thanks!
| cschmidt wrote:
| I hesitate to mention this, because you probably know it and
| are doing it this way. But some other poster mentioned "and
| then do cosine similarity". In this case, you're going to want
| to preprocess and normalize each row of both matrices to have
| unit norm. Then cosine similarity is simply a matrix multiply
| between the two matrices (one transposed), and a pass over the
| results to find the top-k per query using a max queue type data
| structure.
| kordlessagain wrote:
| Could this matrix be compressed to binary form for storage in
| a binary index?
| [deleted]
| cschmidt wrote:
| That wouldn't really help. Let me explain in a bit more
| detail. The results depends on the query matrix, which will
| be different for each set of queries. We have a query
| matrix Q that of dimension 10000x128. And we have another
| vector matrix A that is 1,000,000x128. We preprocess both Q
| and A so each row has unit norm: Q[i,:]
| /= norm(Q[i,:]) A[k,:] /= norm(A[k,:])
|
| So now with that preprocessing the cosine similarity of a
| given row i of Q and k of A is:
| cossim(i,k) = dot(Q[i,:], A[k,:])
|
| If you multiply QxA.T (10,000 x 128)x(128, 1M) you get a
| result matrix (10,000 x 1M) with all the cosine similarity
| values for each combination of query and vector.
|
| If you make a pass across each column with a priority
| queue, you can find the top-n cosine similarity values in
| time O(1,000,000xn).
|
| Now you could store the resulting matrix, but Q is going to
| change for each call, and we really only care about the
| top-n values for each query, so storing it wouldn't really
| accomplish anything.
|
| Edited: fixed lots of typos
| kordlessagain wrote:
| I asked GPT-3 about it using an array of vectors of
| fragments of this page, weighted by relevance (using
| np.dot(v1,v2)) to the query. This is used to build the
| prompt for submission to the OpenAI APIs. I'm interested
| in storing these vectors in a very fast DB for memories.
|
| pastel-mature-herring~> Could this matrix be compressed
| to binary form for storage in a binary index?
|
| angelic-quokka|> It is possible to compress the matrix to
| binary form for storage in a binary index, but this would
| likely decrease the accuracy of the cosine similarity
| values.
| fzliu wrote:
| There's no one answer to this, but I'd say that anything past
| 10k vectors would benefit greatly from a vector database. A
| vector DB will abstract away the building of a vector index
| along with other core database features such as caching,
| failover, replication, horizontal scaling, etc. Milvus
| (https://milvus.io) is open-source and always my go-to choice
| for this (disclaimer: I'm a part of the Milvus community). An
| added bonus of Milvus is that it supports GPU-accelerated
| indexing and search in addition to batched queries and inserts.
|
| All of this assumes you're okay with a bit of imprecision -
| vector search with modern indexes is inherently probabilistic,
| e.g. your recall may not be 100%, but it will be close. Using a
| flat indexing strategy is still an option, but you lose a lot
| of the speedup that comes with a vector database.
| thirdtrigger wrote:
| Agreed with fzliu, you can also use https://weaviate.io
| (disclaimer, I'm affiliated with Weaviate). You might also like
| this article which describes why one might want to use a vector
| search engine: https://db-engines.com/en/blog_post/87
| QuadmasterXLII wrote:
| I'll look into weaviate.
| [deleted]
| binarymax wrote:
| Dense vector search in Solr is a welcome addition, but getting
| started requires a lot of pieces that aren't included.
|
| So I made this a couple months ago to make it super easy to get
| started with this tech. If you have a sitemap you can start the
| docker compose and index your website with one command line.
|
| https://github.com/maxdotio/neural-solr
|
| Enjoy!
| kordlessagain wrote:
| Thanks for this. Very useful. Any interest in adding a crawler?
| https://github.com/kordless/grub-2.0
| andre-z wrote:
| It was to expect after recent ES releases. However, dedicated
| vector search engines offer better performance and more advanced
| features. Qdrant https://github.com/qdrant/qdrant is written in
| Rust. Fast, stable, and super easy to deploy. (disclaimer.
| affiliated with the project).
| bratao wrote:
| Shameless plug from someone not related to the project. Try
| https://vespa.ai , fully open-source, very mature hybrid search
| with dense and approximated vector search. A breeze to deploy and
| maintain compared to ES and Solr. If I could name a single secret
| ingredient for my startup, is Vespa.
| forrest2 wrote:
| Vespa looks pretty compelling; indexing looks like a dream.
|
| I'd recommend basically anything else over a customized ES /
| Solr cluster. Some of the least fun clusters to manage. Great
| for simple use-cases / anything you see in a tutorial. The
| moment you walk off the beaten path with them, best of luck.
|
| Just an anecdote
| binarymax wrote:
| Solr has its quirks for sure, but I've seen multi-terabyte
| sized indices running with great relevance and performance. I
| would call it a mechanics search engine. It is very powerful
| but you need to get your hands dirty.
| mountainriver wrote:
| Vespa seemed like a total mess compared to Milvus when I picked
| them up
| peterstjohn wrote:
| Two big reasons for Vespa over Milvus 1.x:
|
| * Filtering
|
| * String-based IDs
|
| (a caveat that I haven't used Milvus 2.x recently, which does
| fix these issues, but brings in a bunch of other dependencies
| like Kafka or Pulsar)
| lmeyerov wrote:
| can vespa index 100M+ vectors on a regular RAM cpu server? any
| faster w/ a gpu (T4 / A10)?
| kofejnik wrote:
| omg so cool, thank you!
| stoicjumbotron wrote:
| Different from Solr I know, but thoughts on Lunr?
| https://github.com/olivernn/lunr.js
| kordlessagain wrote:
| Whoosh is cool too:
| https://whoosh.readthedocs.io/en/latest/intro.html
| dsign wrote:
| After the attempt of Apple of using "neural search" to spy on its
| customers, the term has been left with a bad rep.
| visarga wrote:
| It doesn't have a bad reputation, it's cosine similarity done
| faster by approximation, something part of many ML papers and
| systems these days.
___________________________________________________________________
(page generated 2022-09-05 23:01 UTC)