[HN Gopher] Why HNSW is not the answer and disk-based alternativ...
       ___________________________________________________________________
        
       Why HNSW is not the answer and disk-based alternatives might be
       more practical
        
       Author : kevlened
       Score  : 99 points
       Date   : 2024-12-23 18:24 UTC (4 hours ago)
        
 (HTM) web link (blog.pgvecto.rs)
 (TXT) w3m dump (blog.pgvecto.rs)
        
       | rekoros wrote:
       | Yep, I believe it
       | 
       | HNSW has been great for relatively small datasets (a website with
       | 30K pages), but it's a dangerous stretch for anything bigger
        
         | aabhay wrote:
         | What systems would you recommend for larger deployments?
        
           | rekoros wrote:
           | I ended up breaking up/sharding HNSW across multiple tables,
           | but I'm dealing with many distinct datasets, each one just
           | small enough to make HNSW effective in terms of index
           | build/rebuild performance.
           | 
           | The article suggests IVF for larger datasets - this is the
           | direction I'd certainly explore, but I've not personally had
           | to deal with it. HNSW sharding/partitioning might actually
           | work even for a very large - sharded/partitioned - dataset,
           | where each query is a parallelized map/reduce operation.
        
             | nostrebored wrote:
             | What are you using for HNSW? Is the implementation
             | handwritten? I've seen people using it well over XXm at
             | full precision vectors with real time updates
        
         | bluecoconut wrote:
         | I don't quite understand this - by 30k pages, is this the
         | number of entries in your index? Did you mean 30M?
         | 
         | At the <100k scale I just full compute / inner product
         | directly, and I don't mess with vector stores or added
         | complexity. No ANN algo needed -- they'll all be slower than
         | actual exact kNN re ranking. (10k _768_ 4 =30MB, a scan over it
         | and a sort is on the ~100us or faster). frankly, I've even sent
         | at the 5k scale to client and done that client side in JS.
         | 
         | Often, I find i use an ANN algo / index to get me my nearest
         | 10k then I do final re ranking with more expensive
         | algorithms/compute in that reduced space.
         | 
         | The original HNSW paper was testing/benchmarking at the 5M-15M
         | scales. That's where it shines compared to alternatives.
         | 
         | When pushing to the 1B scale (I have an instance at 200M) the
         | memory consumption does become a frustration (100GB of ram
         | usage). Needing to vertically scale nodes that use the index.
         | But it's still very fast and good. I wouldn't call it
         | "dangerous" just "expensive".
         | 
         | Interestingly though, I found that usearch package worked great
         | and let me split and offload indexes into separate files on
         | disk, greatly lowered ram usage and latency is still quite good
         | on average, but has some spikes (eg. sometimes when doing
         | nearest 10k though can be ~1-3 seconds on the 200M dataset)
        
       | PaulHoule wrote:
       | I've been doing vector search since 2002 or so and it's amazing
       | how far you can get keeping vectors in RAM and using primitive
       | search algorithms, enough that I'm afraid the market for vector
       | databases is 1% of what VC's think it is. (e.g. full scan was
       | good enough for a search engine of international patents and non-
       | patent literature)
        
         | zackangelo wrote:
         | This is so true. A plain old exhaustive SIMD-optimized
         | similarity search will do just fine in many cases and not have
         | any of the approximation tradeoffs of HNSW.
        
           | PhilippGille wrote:
           | In chromem-go [1] I'm searching through 100,000 vectors in
           | 40ms on a mid-range laptop CPU, even without SIMD. It's quick
           | enough for many use cases.
           | 
           | [1] https://github.com/philippgille/chromem-go
        
             | PaulHoule wrote:
             | It would be very hard to find a problem that has better
             | mechanical sympathy than full-scan similarity search. Even
             | if the operation count of some other algorithm was 1/10 on
             | paper it might not be faster if the prefetcher and branch
             | predictor aren't running at their best.
             | 
             | People who want to start with RAG _should not_ start with a
             | vector search engine but instead download
             | 
             | https://sbert.net/
             | 
             | and start messing around w/ Jupyter notebooks and maybe
             | FAISS. People who think I'm going to upload vectors to an
             | expensive cloud service over a slow ADSL connection are
             | delusional.
        
               | rockwotj wrote:
               | Swap out FAISS with usearch, you get all the awesome SIMD
               | acceleration (via dynamic dispatch), optional
               | compression. Not affiliated but really cool tech.
               | 
               | https://github.com/unum-cloud/usearch
        
               | zackangelo wrote:
               | fwiw faiss, although a bit unwieldy, has an optimized
               | full scan search built into it as well
        
               | abhgh wrote:
               | I strongly advocate this! If you're starting off in this
               | space, _check_ if this barebones implementation isn 't
               | all you need. You can't beat the accuracy of a full
               | search; in theory, you're trading off scalability, but
               | _validate_ if you need the scale where this tradeoff
               | begins to show. And yes, sbert is great, and it gives you
               | options to choose [1] between accuracy (MPNet) and speed
               | (MiniLM). There are also multi-lingual options. And
               | remember, you can also fine-tune MPNet with SetFit. And
               | there are always new and interesting embeddings being
               | released, so remember to re-assess the fitment of
               | embeddings once in a while against what you 're using,
               | e.g., LLM2vec or ModernBERT. A good idea would be to keep
               | checking MTEB [2].
               | 
               | [1] https://www.sbert.net/docs/sentence_transformer/pretr
               | ained_m...
               | 
               | [2] https://huggingface.co/spaces/mteb/leaderboard
        
         | LunaSea wrote:
         | VC are terrible at technical due diligence so the money will
         | keep pouring in.
         | 
         | However, CTOs are also terrible at assessing their technical
         | necessities and vector databases will have customers the same
         | way web scale databases did.
        
         | montebicyclelo wrote:
         | This 100%. Vector DBs have been heavily pushed by vector DB
         | companies and cloud providers, and despite companies often
         | having mere MBs of documents to search through for their use
         | cases, amounts that trivially fit into RAM / can be searched
         | via dot product in milliseconds, managers and engineers less
         | familiar with the space think people are doing it wrong if they
         | don't use a vector db. So.. vector DBs end up getting used when
         | actually a simpler in-memory, non-approximate, solution would
         | be fine, (but less monetizable)
        
       | binarymax wrote:
       | HNSW became the defacto default specifically because you don't
       | need to precalculate the index and it updates as writes come in.
       | 
       | This is a great article, and all the points are there, but the
       | truth is that most teams running million scale vectors don't want
       | the operational complexity of quantizing offline in some
       | frequency. They'll gladly outsource the costs to paying for RAM
       | instead of some IVFPQ calculation.
       | 
       | However if there were a solution that "just worked" to handle PQ
       | for shards in near real time for updates that also had sane
       | filtering, that would be really nice.
        
         | nostrebored wrote:
         | Yes, real time updates, filtering, and multi vector support
         | make most of these on device, in memory approaches untenable.
         | If you really are just doing a similarity search against a
         | fixed set of things, often you know the queries ahead of time
         | and can just make a lookup table.
        
         | bddicken wrote:
         | At PlanetScale, we have a really nice solution for this in
         | MySQL using SPANN + SPFresh. The way we've implemented it
         | allows for pre and post filtering, full compatibility with
         | where clause filtering, and has the same kind of acid
         | compliance you'd expect from a relational database. You can
         | read about it here:
         | 
         | https://planetscale.com/blog/announcing-planetscale-vectors-...
        
       | generall wrote:
       | IVF, unfortunately, is barely compatible with filtered search. It
       | have to rely on post-filtering and retrieve more and more
       | candidates if the result set is not big enough. If the query is
       | in some form correlated with the filter, this approach quickly
       | degrades to brute-force.
       | 
       | Surprised that the article doesn't mention filtering use-case at
       | all.
        
       | intalentive wrote:
       | SIMD-accelerated search over binary vectors is very fast and you
       | can fit a lot in memory. Then rerank over f32 from disk.
       | 
       | I tried a few alternatives and found that SQLite + usearch
       | extension wins for my use cases (< 1 million records), as
       | measured by latency and simplicity. I believe this approach can
       | scale up to hundreds of millions of records.
        
       | tveita wrote:
       | > For example, the typical distance computation complexity
       | between two D-dimensional vectors is O(D^2), but compressing
       | floats into bits reduces this by a factor of 1024 (32x32).
       | 
       | This isn't right, cosine distance between two vectors is
       | definitely O(D)...
       | 
       | Of course replacing float multiplications with xor+popcount is a
       | nice speedup in computation, but assuming you're memory bandwidth
       | limited, speedup should be linear.
        
       | ayende wrote:
       | The problem with IVF is that you need to find the right
       | centroids. And that doesn't work well if your data grow and
       | mutate over time.
       | 
       | Splitting a centroid is a pretty complex issue.
       | 
       | As are clustering in an area. For example, let's assume that you
       | hold StackOverflow questions & answers. Now you have a massive
       | amount of additional data (> 25% of the existing dataset) that
       | talks about Rust.
       | 
       | You either need to re-calculate the centroids globally, or find a
       | good way to split.
       | 
       | The posting list are easy to use, but if you are unbalanced, it
       | gets really bad.
        
       | nostrebored wrote:
       | Nit: the drawback of "not working well in disk based systems"
       | isn't a drawback unless you're already using disk based systems.
       | 
       | The difference in recall is also significant -- what you really
       | get with HNSW is a system made to give good cost:approximation
       | quality. These IVFPQ based systems are ones I've seen people rip
       | and replace if the use case is high value.
       | 
       | I really don't understand the push to make pg do everything. It
       | wasn't designed for search, and trying to shove these features
       | into the platform feels like some misguided cost optimization
       | that puts all of your data infrastructure on the same critical
       | path.
        
       | tw04 wrote:
       | > Its reliance on frequent random access patterns makes it
       | incompatible with the sequential access nature of disk storage.
       | 
       | So use SSD? Are people seriously still trying to use spinning
       | disk for database workloads in 2024?
        
         | JoeAltmaier wrote:
         | Data farms are all about cost per GB. Spinning media is a
         | fraction of the cost per.
        
         | jandrewrogers wrote:
         | An SSD does not solve the problem of page fault chasing, it
         | just makes it slightly less bad. This is fundamentally a
         | software architecture problem.
         | 
         | This is solved with latency-hiding I/O schedulers, which don't
         | rely on cache locality for throughput.
        
       ___________________________________________________________________
       (page generated 2024-12-23 23:00 UTC)