[HN Gopher] Why HNSW is not the answer and disk-based alternativ...
___________________________________________________________________
Why HNSW is not the answer and disk-based alternatives might be
more practical
Author : kevlened
Score : 99 points
Date : 2024-12-23 18:24 UTC (4 hours ago)
(HTM) web link (blog.pgvecto.rs)
(TXT) w3m dump (blog.pgvecto.rs)
| rekoros wrote:
| Yep, I believe it
|
| HNSW has been great for relatively small datasets (a website with
| 30K pages), but it's a dangerous stretch for anything bigger
| aabhay wrote:
| What systems would you recommend for larger deployments?
| rekoros wrote:
| I ended up breaking up/sharding HNSW across multiple tables,
| but I'm dealing with many distinct datasets, each one just
| small enough to make HNSW effective in terms of index
| build/rebuild performance.
|
| The article suggests IVF for larger datasets - this is the
| direction I'd certainly explore, but I've not personally had
| to deal with it. HNSW sharding/partitioning might actually
| work even for a very large - sharded/partitioned - dataset,
| where each query is a parallelized map/reduce operation.
| nostrebored wrote:
| What are you using for HNSW? Is the implementation
| handwritten? I've seen people using it well over XXm at
| full precision vectors with real time updates
| bluecoconut wrote:
| I don't quite understand this - by 30k pages, is this the
| number of entries in your index? Did you mean 30M?
|
| At the <100k scale I just full compute / inner product
| directly, and I don't mess with vector stores or added
| complexity. No ANN algo needed -- they'll all be slower than
| actual exact kNN re ranking. (10k _768_ 4 =30MB, a scan over it
| and a sort is on the ~100us or faster). frankly, I've even sent
| at the 5k scale to client and done that client side in JS.
|
| Often, I find i use an ANN algo / index to get me my nearest
| 10k then I do final re ranking with more expensive
| algorithms/compute in that reduced space.
|
| The original HNSW paper was testing/benchmarking at the 5M-15M
| scales. That's where it shines compared to alternatives.
|
| When pushing to the 1B scale (I have an instance at 200M) the
| memory consumption does become a frustration (100GB of ram
| usage). Needing to vertically scale nodes that use the index.
| But it's still very fast and good. I wouldn't call it
| "dangerous" just "expensive".
|
| Interestingly though, I found that usearch package worked great
| and let me split and offload indexes into separate files on
| disk, greatly lowered ram usage and latency is still quite good
| on average, but has some spikes (eg. sometimes when doing
| nearest 10k though can be ~1-3 seconds on the 200M dataset)
| PaulHoule wrote:
| I've been doing vector search since 2002 or so and it's amazing
| how far you can get keeping vectors in RAM and using primitive
| search algorithms, enough that I'm afraid the market for vector
| databases is 1% of what VC's think it is. (e.g. full scan was
| good enough for a search engine of international patents and non-
| patent literature)
| zackangelo wrote:
| This is so true. A plain old exhaustive SIMD-optimized
| similarity search will do just fine in many cases and not have
| any of the approximation tradeoffs of HNSW.
| PhilippGille wrote:
| In chromem-go [1] I'm searching through 100,000 vectors in
| 40ms on a mid-range laptop CPU, even without SIMD. It's quick
| enough for many use cases.
|
| [1] https://github.com/philippgille/chromem-go
| PaulHoule wrote:
| It would be very hard to find a problem that has better
| mechanical sympathy than full-scan similarity search. Even
| if the operation count of some other algorithm was 1/10 on
| paper it might not be faster if the prefetcher and branch
| predictor aren't running at their best.
|
| People who want to start with RAG _should not_ start with a
| vector search engine but instead download
|
| https://sbert.net/
|
| and start messing around w/ Jupyter notebooks and maybe
| FAISS. People who think I'm going to upload vectors to an
| expensive cloud service over a slow ADSL connection are
| delusional.
| rockwotj wrote:
| Swap out FAISS with usearch, you get all the awesome SIMD
| acceleration (via dynamic dispatch), optional
| compression. Not affiliated but really cool tech.
|
| https://github.com/unum-cloud/usearch
| zackangelo wrote:
| fwiw faiss, although a bit unwieldy, has an optimized
| full scan search built into it as well
| abhgh wrote:
| I strongly advocate this! If you're starting off in this
| space, _check_ if this barebones implementation isn 't
| all you need. You can't beat the accuracy of a full
| search; in theory, you're trading off scalability, but
| _validate_ if you need the scale where this tradeoff
| begins to show. And yes, sbert is great, and it gives you
| options to choose [1] between accuracy (MPNet) and speed
| (MiniLM). There are also multi-lingual options. And
| remember, you can also fine-tune MPNet with SetFit. And
| there are always new and interesting embeddings being
| released, so remember to re-assess the fitment of
| embeddings once in a while against what you 're using,
| e.g., LLM2vec or ModernBERT. A good idea would be to keep
| checking MTEB [2].
|
| [1] https://www.sbert.net/docs/sentence_transformer/pretr
| ained_m...
|
| [2] https://huggingface.co/spaces/mteb/leaderboard
| LunaSea wrote:
| VC are terrible at technical due diligence so the money will
| keep pouring in.
|
| However, CTOs are also terrible at assessing their technical
| necessities and vector databases will have customers the same
| way web scale databases did.
| montebicyclelo wrote:
| This 100%. Vector DBs have been heavily pushed by vector DB
| companies and cloud providers, and despite companies often
| having mere MBs of documents to search through for their use
| cases, amounts that trivially fit into RAM / can be searched
| via dot product in milliseconds, managers and engineers less
| familiar with the space think people are doing it wrong if they
| don't use a vector db. So.. vector DBs end up getting used when
| actually a simpler in-memory, non-approximate, solution would
| be fine, (but less monetizable)
| binarymax wrote:
| HNSW became the defacto default specifically because you don't
| need to precalculate the index and it updates as writes come in.
|
| This is a great article, and all the points are there, but the
| truth is that most teams running million scale vectors don't want
| the operational complexity of quantizing offline in some
| frequency. They'll gladly outsource the costs to paying for RAM
| instead of some IVFPQ calculation.
|
| However if there were a solution that "just worked" to handle PQ
| for shards in near real time for updates that also had sane
| filtering, that would be really nice.
| nostrebored wrote:
| Yes, real time updates, filtering, and multi vector support
| make most of these on device, in memory approaches untenable.
| If you really are just doing a similarity search against a
| fixed set of things, often you know the queries ahead of time
| and can just make a lookup table.
| bddicken wrote:
| At PlanetScale, we have a really nice solution for this in
| MySQL using SPANN + SPFresh. The way we've implemented it
| allows for pre and post filtering, full compatibility with
| where clause filtering, and has the same kind of acid
| compliance you'd expect from a relational database. You can
| read about it here:
|
| https://planetscale.com/blog/announcing-planetscale-vectors-...
| generall wrote:
| IVF, unfortunately, is barely compatible with filtered search. It
| have to rely on post-filtering and retrieve more and more
| candidates if the result set is not big enough. If the query is
| in some form correlated with the filter, this approach quickly
| degrades to brute-force.
|
| Surprised that the article doesn't mention filtering use-case at
| all.
| intalentive wrote:
| SIMD-accelerated search over binary vectors is very fast and you
| can fit a lot in memory. Then rerank over f32 from disk.
|
| I tried a few alternatives and found that SQLite + usearch
| extension wins for my use cases (< 1 million records), as
| measured by latency and simplicity. I believe this approach can
| scale up to hundreds of millions of records.
| tveita wrote:
| > For example, the typical distance computation complexity
| between two D-dimensional vectors is O(D^2), but compressing
| floats into bits reduces this by a factor of 1024 (32x32).
|
| This isn't right, cosine distance between two vectors is
| definitely O(D)...
|
| Of course replacing float multiplications with xor+popcount is a
| nice speedup in computation, but assuming you're memory bandwidth
| limited, speedup should be linear.
| ayende wrote:
| The problem with IVF is that you need to find the right
| centroids. And that doesn't work well if your data grow and
| mutate over time.
|
| Splitting a centroid is a pretty complex issue.
|
| As are clustering in an area. For example, let's assume that you
| hold StackOverflow questions & answers. Now you have a massive
| amount of additional data (> 25% of the existing dataset) that
| talks about Rust.
|
| You either need to re-calculate the centroids globally, or find a
| good way to split.
|
| The posting list are easy to use, but if you are unbalanced, it
| gets really bad.
| nostrebored wrote:
| Nit: the drawback of "not working well in disk based systems"
| isn't a drawback unless you're already using disk based systems.
|
| The difference in recall is also significant -- what you really
| get with HNSW is a system made to give good cost:approximation
| quality. These IVFPQ based systems are ones I've seen people rip
| and replace if the use case is high value.
|
| I really don't understand the push to make pg do everything. It
| wasn't designed for search, and trying to shove these features
| into the platform feels like some misguided cost optimization
| that puts all of your data infrastructure on the same critical
| path.
| tw04 wrote:
| > Its reliance on frequent random access patterns makes it
| incompatible with the sequential access nature of disk storage.
|
| So use SSD? Are people seriously still trying to use spinning
| disk for database workloads in 2024?
| JoeAltmaier wrote:
| Data farms are all about cost per GB. Spinning media is a
| fraction of the cost per.
| jandrewrogers wrote:
| An SSD does not solve the problem of page fault chasing, it
| just makes it slightly less bad. This is fundamentally a
| software architecture problem.
|
| This is solved with latency-hiding I/O schedulers, which don't
| rely on cache locality for throughput.
___________________________________________________________________
(page generated 2024-12-23 23:00 UTC)