[HN Gopher] Billion-Scale Approximate Nearest Neighbor Search [pdf]
___________________________________________________________________
Billion-Scale Approximate Nearest Neighbor Search [pdf]
Author : tim_sw
Score : 105 points
Date : 2023-05-04 12:44 UTC (2 days ago)
(HTM) web link (wangzwhu.github.io)
(TXT) w3m dump (wangzwhu.github.io)
| victor106 wrote:
| Being in software development for the past 15 years I cant
| understand any of this. What are some good resource to get up to
| speed on this stuff?
| WinLychee wrote:
| This series of articles from Pinecone seems accessible
| https://www.pinecone.io/learn/vector-indexes/ . Stepping back a
| moment, the problem "I want to find the top-k most similar
| items to a given item from a collection" comes up in a variety
| of domains, and a common approach is nearest-neighbor search.
| Depending on the number of dimensions and the number of items,
| you may need tricks to speed up the search. In recent years
| we've seen an increase in scale and invention of some new
| tricks to speed this up.
|
| In a bit more detail the idea is to convert your data into
| vectors embedded in some d-dimensional space and compute the
| distance between a query vector and all other vectors in the
| space. This is an O(dN) computation. ANN covers a number of
| techniques giving a speedup by letting you reduce number of
| required comparisons.
| shoo wrote:
| see also - the section titled "Distance function" from
| https://en.wikipedia.org/wiki/Curse_of_dimensionality
|
| > One way to illustrate the "vastness" of high-dimensional
| Euclidean space is to compare the proportion of an inscribed
| hypersphere with radius r and dimension d, to that of a
| hypercube with edges of length 2 r . [...] As the dimension d
| of the space increases, the hypersphere becomes an
| insignificant volume relative to that of the hypercube
|
| E.g. in two dimensions the ratio area of the circle of radius
| r=1 to the area of a square of side length 2r is pi / 4 =
| approx 0.785. As the number of dimensions becomes large, this
| ratio drops toward zero.
|
| There's also some discussion of how using distance functions to
| measure distances from reference points become less useful as
| the number of dimensions of the space increases -- all
| distances becoming similar -- although
|
| > research has shown this to only hold in the artificial
| scenario when the one-dimensional distributions are independent
| and identically distributed.
| esafak wrote:
| It crops up in machine learning.
|
| https://en.wikipedia.org/wiki/Nearest_neighbor_search
| tysam_and wrote:
| High dimensional nearest neighbor search scales very poorly
| with dimensions and/or number of examples IIRC, so approximate
| methods are best for time/value tradeoffs.
|
| Oftentimes the applications don't require a perfect match
| anyways, just something 'close enough'.
|
| Because things live in high dimensions, and they're often
| related to each other, this results in these vast 'sparse' gaps
| where no data really generally lives.
|
| Hence allowing for good speed of approximate methods, and a
| good tradeoff of information-per-FLOP when compared to the full
| scale search itself.
| convexstrictly wrote:
| Video: https://www.youtube.com/watch?v=iI8e3kU11eU
| swaraj wrote:
| Is this different from pinecone?
| beoberha wrote:
| From what I understand, Pinecone is a wrapper around a lot of
| this. It utilizes these libraries under the hood (or their own
| implementations) to store and perform search on your vectors
| for you. All in a nice managed PaaS offering.
| ianbutler wrote:
| Nice this will get you acquainted with many of the methods used
| currently, as my small addition to the list of things you can
| learn. https://ai.googleblog.com/2020/07/announcing-scann-
| efficient... https://github.com/google-research/google-
| research/tree/mast...
|
| Google released SCaNN a few years ago and back then it wasn't
| implemented in many tools like Milvus etc. Maybe it is now, it's
| been a few years since I needed a scalable vector db. I believe
| it's still state of the art, and they use a novel approach to
| quantization, which was the key intuition.
|
| I didn't see it included in the PDF from a quick scan (hah!)
| bravura wrote:
| I remember SCaNN. Are there easy to use implementations
| available now?
|
| You also seem unsure.
|
| Personally, I love a cheat-sheet like in Slide 7 of the
| presentation. It's a game-tree for what ANN _implementation_ to
| use, given your problem setting and the assumption you don 't
| want to do groundbreaking ANN research; rather, you have an ANN
| problem you need to solve on the path to your actual problem.
| Maybe that's why SCaNN wasn't cited?
|
| A quick google reveals that SCaNN is available on pypi from
| Google: https://pypi.org/project/scann/ Do any users have any
| feedback or experiences to share?
|
| [edit: GPT4, comparing hnwslib and SCaNN: "Note that SCaNN
| recently ceased to be supported by Google Research, and there
| won't be any further improvements, bug fixes, or documentation
| updates from them. However, you can still use and get support
| from the community. This recent change in support could shift
| preference towards hnswlib for future applications."]
| ianbutler wrote:
| GPT4 is only trained on data up to 2021, and the library has
| commits as recent as two months ago. So I think that may be a
| hallucination. A quick google also reveals nothing like that.
|
| It's been a while but I also remember HSNW having undesirable
| characteristics for large scale vector search (compared to
| other methods), so I'm not sure there.
|
| I did do a quick search and saw there are methods that make
| it more suitable, but again it's been 3 years now since I've
| used anything in this area.
|
| Edit: I got curious, https://ann-benchmarks.com/
|
| So I was mis remembering a bit, HSNW is pretty good overall
| depending on your implementation!
| sgt101 wrote:
| > I remember SCaNN. Are there easy to use implementations
| available now?
|
| Yes, on gcp in vertex AI.
|
| GPT4 is talking utter horseshit.
|
| https://cloud.google.com/vertex-ai/docs/matching-
| engine/ann-...
| jhj wrote:
| SCaNN (the paper) is roughly two different things:
|
| 1. a SIMD-optimized form of product quantization (PQ), where
| code to distance lookup can be performed in SIMD registers
|
| 2. anisotropic quantization to bias the database towards
| returning better maximum inner product search (MIPS)
| candidates, versus usual quantization (such as PQ) that aims
| to minimize compressed vector reconstruction error. In MIPS
| it is much more likely that the query data set may be of a
| completely different distribution than the database vectors.
|
| If your application needs L2 lookup rather than MIPS which
| does not admit a metric, then only the PQ part is relevant.
| For cosine similarity, you can get that by normalizing all
| vectors to the surface of a hypersphere, in which case it has
| the same order as L2 (see "L2 normalized Euclidean distance"
| on https://en.wikipedia.org/wiki/Cosine_similarity ).
|
| SCaNN is implemented in Faiss CPU, on the GPU the fast PQ
| part is less relevant due to the greater register set size
| and throughput (rather than latency) optimized nature of the
| hardware, but the GPU version is more geared towards batch
| lookup in any case.
|
| https://github.com/facebookresearch/faiss/wiki/Fast-
| accumula...
|
| https://github.com/facebookresearch/faiss/wiki/Indexing-1M-v.
| ..
|
| We have not found the anisotropic quantization part to be
| that useful for large datasets, but results may vary. Graph-
| based ANN techniques tend to be better in many cases for
| these small datasets than IVF based strategies.
|
| (I'm the author of GPU Faiss)
| chuckcode wrote:
| Thanks for details! Few follow up questions:
|
| - I've seen neural nets using int8 for matrix
| multiplication to reduce memory size [1]. Do you think
| something similar could be useful in the ANN space?
|
| - Do you know of any studies using Faiss looking at
| speed/cost tradeoffs of RAM vs flash vs Disk for storage?
|
| - Are there recommended ways to update Faiss index with
| streaming data, e.g. updating the vectors continuously?
|
| Seems like more and more use cases for Faiss as neural nets
| become more and more core to workflows. Would like to try
| and figure out the configurations that are optimized to
| minimize carbon usage in addition to latency and recall
| metrics.
|
| [1] https://arxiv.org/abs/2208.07339
|
| (edit for formatting)
| jhj wrote:
| Regarding reduced precision, depending upon what you are
| trying to do, I think it doesn't work quite as well in
| similarity search as it does for, say, neural networks.
|
| If you are concerned about recall of the true nearest
| neighbor (k=1), in many datasets I've seen (especially
| large ones) in float32 the distance from a query vector
| to candidate nearest neighbor vectors may only differ by
| some thousands of ULPs when performing brute force
| search, which if done in float16 would result in the true
| nearest neighbor being the same as (or perhaps behind
| even, due to rounding error) other proximate vectors. If
| you are performing approximate lookup and you have the
| luxury of performing reranking (you store the compressed
| / approximate index for lookup, but return a larger
| candidate set like k=100 or k=1000 and refine the results
| based on true distances computed from the uncompressed
| vectors via brute-force, so you have to keep all the
| original vectors around) then this problem can go away.
|
| If however you are looking at recall@100 (is the true
| nearest neighbor reported within the top k=100) or set
| intersection (of the k=100 approximate nearest neighbors,
| how much overlap is there with the true set of the 100
| nearest neighbors), then this doesn't matter as much.
|
| Certainly a lot of the options in the Faiss library are
| geared towards compression and quantization anyways
| (e.g., storing a billion high-dimensional vector dataset
| in 8/16 GB of memory) so this is a tradeoff as with
| everything else.
|
| In Faiss there are the scalar quantizer indexes which do
| store vectors in int4/int8 etc for which int8 GEMM
| (accumulate in int32) would be great, but using this
| would require that the query vector itself be quantized
| to int4/int8 as well. This is the difference between
| "asymmetric distance computation" (ADC) where you compute
| the distance between int4 encoded vectors (or product
| quantized encoded vectors, etc) versus a float32 query
| vector, where we reconstruct the database vectors in
| floating point and compare in floating point, versus
| symmetric distance computation (you have to convert the
| query vector to int4 say, and compare in the quantized
| regime). ADC tends to work a lot better than symmetric
| computation, so this is why we don't use pure int8 GEMM,
| but maybe in many applications (NN inference, say,
| instead of image database search) the non-ADC comparison
| would be ok.
| fzliu wrote:
| SCaNN is coming to Milvus very soon (we've seen some demand
| from the community). Stay tuned.
| eternalban wrote:
| both are 2020.
| eachro wrote:
| I see HN pieces on SIMD optimizations for numerical computing
| every so often. Are these the sort of optimizations that a
| hobbyist with say a normal laptop (either a macbook air or
| consumer grade thinkpad) can actually end up tinkering with as
| well? Or am I going to have to rent some AWS machine and have it
| run on some specific architecture to be able to mess around with?
| boulos wrote:
| Absolutely. Most development work for these things is just fine
| on a laptop.
|
| Each processor family has its own supported SIMD instructions
| though. For a long time, SSE / AVX / AVX2 were the only game in
| town on x86. AVX-512 introduced in Skylake was _primarily_
| available only on server parts, and was a mixed bag, and didn
| 't have support on AMD machines until just recently (with Genoa
| / Zen4).
|
| A modern Mac laptop with an M-series chip will have ARM NEON
| vector instructions, just like on an iPhone or similar. There
| is a new-ish ARM vector instruction set called Scalable Vector
| Extensions (SVE) that Apple doesn't support, but Graviton 3
| does. But generally improvements on your laptop _often_
| translate onto the server side, and you can definitely test
| correctness.
|
| There are a few least-common denominator SIMD libraries out
| there, but realistically for many problems, you'll get the most
| mileage from writing some intrinsics yourself for your target
| platform.
| nwoli wrote:
| Basically all CPUs have simd nowadays. You can even make use of
| it in wasm/the web. (Though currently not supported on iOS.)
| Just compile with 'emcc -msimd128 main.c'
| lmeyerov wrote:
| Yes, regular computers have had SIMD for decades. Ex: Intel was
| SSE and now AVX. Fancier servers will have the same, except
| maybe wider and some other tricks.
|
| If you go down the vectorization path, unless you are a crufty
| CPU DB vendor, I'd skip ahead to GPU for the same. Ecosystem is
| built to do the same but faster and easier. CPU makers are
| slowly changing to look more like GPUs, so just skip ahead...
| sebasv_ wrote:
| If you're asking how widely available SIMD is, it has been
| common in consumer hardware for 2 decades. To perform SIMD
| instructions manually you will need a compiled language that
| supports it, like Rust or C. But the compiler can actually
| implement it for you as an optimization.
| kzrdude wrote:
| This is a Rust crate doc, but it's a nice place to browse for
| different cpu features and what they unlock:
| https://docs.rs/target-features/latest/target_features/docs/...
|
| If you go to the x86 cpu list it will tell you what features
| are enabled if you would optimize for a particular one
| https://docs.rs/target-features/latest/target_features/docs/...
|
| On linux you can run lscpu to check what cpu features are
| detected.
| perfmode wrote:
| I use optimized SIMD instructions on iOS to do on-device linear
| algebra for the cooking app i work on as a hobby.
|
| https://theflavor.app/
| manx wrote:
| I only see a landing page with missing images (on mobile).
| How do I use it?
| nwoli wrote:
| Spotify uses random hyperplanes to produce a hash where each bit
| is the sign of the dot product of the query and the hyperplanes
| (ie "is above or below" the hyperplane). Works really well
| AndrewKemendo wrote:
| Is there a blog post or something on this?
___________________________________________________________________
(page generated 2023-05-06 23:00 UTC)