hngopher.com

       [HN Gopher] VectorChord: Store 400k Vectors for $1 in PostgreSQL
       ___________________________________________________________________
        
       VectorChord: Store 400k Vectors for $1 in PostgreSQL
        
       Author : gaocegege
       Score  : 151 points
       Date   : 2024-12-05 02:01 UTC (21 hours ago)
        
 (HTM) web link (blog.pgvecto.rs)
 (TXT) w3m dump (blog.pgvecto.rs)
        
       | gaocegege wrote:
       | Hey everyone! We've developed a new PostgreSQL extension that
       | supports 400k vectors for just $1. Check it out!
        
       | 7qW24A wrote:
       | The "external index build" idea seems pretty interesting. How
       | does it work with updates to the underlying data (e.g., new
       | embeddings being added)? For that matter, I guess, how do
       | incremental updates to pgvector's HNSW indexes work?
        
         | VoVAllen wrote:
         | The IVF indexing can be considered into two phases, computing
         | the centroids (KMeans), and assigning each point to the
         | centroids as the inverted lists. The most time-consuming part
         | is at the KMeans stage, and can be greatly accelerated with
         | GPU. 1M 960dim vec can be clustered in less than 10s. We did
         | the KMeans phase externally, and the assignment phase inside
         | postgres. The KMeans part depends only on the data
         | distribution, not on any specific data. So we can do sampling
         | on the data, and inserting/deleting the data won't affect the
         | KMeans result significantly. For the update, it's just a matter
         | of assigning the new vector to a specific cluster and appending
         | it to the corresponding list. It's very light compared to
         | inserting in hnsw
        
       | nextworddev wrote:
       | What dimension vectors are we talking here
        
         | jesperwe wrote:
         | First paragraph of tfa mentions "768-dimensional vectors"
        
       | jasonkester wrote:
       | In five pages of text, we never get to learn what a Vector is (in
       | this context), why we'd want to store one in pgsql, or why it
       | costs so much to store them compared to anything else you'd store
       | there.
       | 
       | For an example of how you can communicate with domain experts,
       | while still giving everyone else some form of clue as to what
       | this hell you're talking about, check out the link to the product
       | that this thing claims to be a successor to:
       | 
       | https://pgvecto.rs/
       | 
       | That starts off by telling us what it is and what it does.
        
         | lja wrote:
         | That's because this product isn't for you then. My team has
         | been evaluating vector databases for years and everything on
         | the VectorChord page resonated with me. We run one of the
         | world's largest vector databases and we'll likely benchmark
         | vectorchord to see if it lives up to its promises here.
        
           | gaocegege wrote:
           | Hi, we're here to help if you need assistance (via GitHub
           | issue, Discord, or email). Could you let us know the scale of
           | your vectors--are they 1B or 10B?
        
             | lja wrote:
             | 10B, feel free to email me luke@ the domain in my profile.
        
           | thelittleone wrote:
           | Do you have a blog per chance? Or any recommended reading on
           | pre processing / data chunking strategies to improve results?
        
             | VoVAllen wrote:
             | I recently came across a project that looks promising: [Wor
             | dLlama](https://github.com/dleemiller/WordLlama?tab=readme-
             | ov-file#s...). It appears to be well-suited for semantic
             | chunking, though I haven't had a chance to try it out yet.
        
           | redskyluan wrote:
           | Maybe check this https://zilliz.com/pricing
           | 
           | Yan easily store 1B data into Zilliz Serverless and the cost
           | is incredible cheap
        
             | lja wrote:
             | Zilliz gave us a near 6-figure a month quote for our
             | database.
        
         | curl-up wrote:
         | So they should start every one of their posts, on that same
         | site, with a summary of what is available on the homepage?
        
           | jasonkester wrote:
           | Wow. It never occurred to me that this might be anything but
           | the landing page of a product.
           | 
           | The title here, the presentation on the page itself.
           | Everything screams "landing page". I had to go back on a
           | desktop browser to see the word "blog" in the url bar, and
           | mentally shift those graphics and little islands of text
           | around until I can view it from that lens. If it's really
           | just a sub-product of the main product that they're talking
           | about, then yeah, it makes more sense in that context.
           | 
           | But my answer to your question would still be "Yes".
           | Absolutely. If you're a product, the job of your blog is to
           | convince people coming off the street that they need your
           | thing, even if they didn't realize it yet.
           | 
           | Step one of that process is to not bounce them back to the
           | street without any idea what they're looking at.
        
         | Raed667 wrote:
         | And every hosting provider doesn't start by teaching you what
         | is HTML
        
       | curl-up wrote:
       | Does this mean you won't support pgvecto.rs anymore?
        
         | VoVAllen wrote:
         | I'm the project lead for both projects. We're still in the
         | process of supporting all the function from pgvecto.rs in
         | VectorChord (int8, more than 2000 dim vec, etc.). We'll provide
         | the migration docs for pgvecto.rs users to VectorChord. User
         | will have better experience with VectorChord due to better
         | integration with postgres storage system. We will stop
         | supporting pgvecto.rs early next year when everything on
         | VectorChord is ready.
        
       | marcyb5st wrote:
       | Awesome work! But aren't the comparisons missing ScaNN [1, 2]? I
       | think it's the overall SOTA [3] at the moment regarding vector
       | indexing.
       | 
       | [1] https://github.com/google-research/google-
       | research/tree/mast...
       | 
       | [2] Also available on something like AlloyDB on GCP:
       | https://cloud.google.com/alloydb/docs/ai/store-index-query-v...
       | 
       | [3] https://ann-benchmarks.com/glove-100-angular_10_angular.html
       | 
       | Disclaimer: Working for Google, but nowhere close to Databases.
        
         | VoVAllen wrote:
         | I'm the project lead for VectorChord. I have tested ScaNN on
         | AlloyDB Omni but have struggled to achieve reasonable recall on
         | the GIST 1M dataset, with results peaking at only around 0.8.
         | The limited documentation makes it challenging to understand
         | the underlying causes of this performance.
         | 
         | Additionally, I couldn't find any performance benchmarks for
         | ScaNN integrated with PostgreSQL, particularly in comparison to
         | pgvector or its standalone. The publicly available metrics
         | focus exclusively on query-only indexing outside of the
         | database.
         | 
         | On our side, we've implemented the fastscan kernel for bit
         | vector scanning, which is considered as one of ScaNN's key
         | advantages.
        
           | marcyb5st wrote:
           | Thanks for the explanation!
           | 
           | Really appreciate it and it makes perfect sense.
        
       | tarasglek wrote:
       | I am still waiting for a good pattern for using multivector
       | embeddings like ColBert and ColPali in postgres. I get that its
       | fun to optimize single vector stuff, but multivector is that
       | happy middleground between single vector and reranker that seems
       | to be only validated in specialized exotic search dbs like Vespa
        
         | tarasglek wrote:
         | Looks like that there is recent work to make a pgvector example
         | for this https://github.com/pgvector/pgvector-
         | python/blob/master/exam...
        
         | VoVAllen wrote:
         | There's no easy way to index ColBert multi-vectors in a
         | scalable way that I know of. Vespa seems to rely heavily on
         | binary quantization, which can cost a lot in recall loss. And
         | for most cases, using ColBert as a reranker is good enough, as
         | the pgvector example you posted.
        
           | tarasglek wrote:
           | Seems like like doing a proper relational 1:N chunk:multiple-
           | vectors foreign key, binarization and a clever join or
           | multistage CTE would get us pretty close to useful.
           | 
           | I am ok with it being less efficient as the dev ux will be
           | amazing. Vespa ops (even in their cloud) are a complete
           | nightmare compared to postgres
        
         | jonathan-adly wrote:
         | I would like to throw our project in the ring that solves this
         | problem: https://github.com/tjmlabs/ColiVara
         | 
         | 1. Uses half-vecs, so you cut down everything by half with no
         | recall loss 2. Uses token pooling with hierarchial clustering
         | at 3, so, you further cut down things by 2/3rd with <1% loss 3.
         | Everything is on Postgres and pgvector, so you can do all the
         | Postgres stuff and decrease corpus size by document metadata
         | filtering 4. We have a 5000+ pages corpus in production with <3
         | seconds latency. 5. We benchmark against the Vidore
         | leaderboard, and very near SOTA
         | 
         | You can read about half-vecs here:
         | https://jkatz05.com/post/postgres/pgvector-scalar-binary-qua...
         | 
         | Hierarchical token pooling:
         | https://www.answer.ai/posts/colbert-pooling.html
         | 
         | And how we implemented them here: https://blog.colivara.com/
        
           | __jl__ wrote:
           | I really like the idea of ColPali and products building on it
           | but I am still unsure about the applications for which it
           | makes most sense. We mostly deal with reports that are 80-90%
           | text, 10-20% figures and tables. Does a vision first approach
           | makes sense in this context? My sense is that text-based
           | embeddings are better in mostly text contexts. Layout, for
           | example, is pretty much irrelevant but plays into vision-
           | based approaches. What is your sense about this?
        
             | jonathan-adly wrote:
             | So - the synthetic QAs datasets in the Vidore datasets are
             | exactly like that 90% text, 10% charts/tables. OCR + BM25
             | is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is
             | at ~98%.
             | 
             | It is a small upgrade, but one nonetheless. The complexity,
             | and the cost of multi-vectors *might* not make this worth
             | it, really depends on how accuracy-critical the task is.
             | 
             | For example, one of our customers who does this over FDA
             | monographs, which is like 95%+ text, and 5% tables - they
             | misses were extremely painful - even though there weren't
             | that many in text-based pipelines. So, the migrations made
             | sense to them.
        
       | rkuzsma wrote:
       | Would you be willing to speculate on how VectorChord's ingestion
       | and query performance might compare to Elasticsearch/OpenSearch
       | for dense vector and sparse vector search use cases, particularly
       | when dealing with larger full text data sets (>5M records)?
        
         | VoVAllen wrote:
         | In the LAION-5M benchmark, we've compared our performance
         | against ElasticSearch and OpenSearch. However, comparing
         | ingestion performance is more challenging due to differences in
         | architecture. Both ElasticSearch and OpenSearch, like most
         | vector databases, use the concept of shards. Each shard
         | represents a separate vector index, and queries aggregate
         | results across these shards. Larger shards lead to faster
         | queries but come with higher resource requirements and slower
         | update speeds.
         | 
         | It's also worth noting that ElasticSearch has implemented
         | RaBitQ support for HNSW. So it's difficult to compare without
         | running actual benchmarks. However, ElasticSearch typically
         | requires at least double, if not triple, the memory size of the
         | vector dataset to maintain system stability. In contrast,
         | PostgreSQL can achieve a stable system with far fewer resources
         | --for example, 32GB of memory is sufficient to manage 100
         | million vectors efficiently.
         | 
         | From my perspective, it would be faster in query comparing to
         | ElasticSearch due to the extensive optimizations. And much much
         | faster with the updates (insert and delete) due to using IVF
         | instead of HNSW.
        
       | estebarb wrote:
       | It would have been nice a comparison with pgvectorscale, which
       | uses binary quantization and StreamingDiskANN.
        
       | whakim wrote:
       | Could you talk about how updates are handled? My understanding is
       | that IVF can struggle if you're doing a lot of inserts/updates
       | after index creation, as the data needs to be incrementally re-
       | clustered (or the entire index needs to be rebuilt) in order to
       | ensure the clusters continue to reflect the shape of your data?
        
         | VoVAllen wrote:
         | We don't perform any reclustering. As you said, users would
         | need to rebuild the index if they want to recluster. However,
         | based on our observations, the speed remains acceptable even
         | with significant data growth. We did a simple experiment using
         | nlist=1 on the GIST dataset, the top-10 retrieval results took
         | less than twice the time compared to using nlist=4096. This is
         | because only the quantized vectors (with a 32x compression)
         | need to be inserted into the posting list, and only quantized
         | vector distances need more computations. And the quantized
         | vector computation only accounts for a small amount of time.
         | Most of the time is spent on re-ranking using full-precision
         | vectors. Let's say the breakdown is approximately 20% for
         | quantized vector computations and 80% for full-precision vector
         | computations. So even if the time for quantized vector
         | computations triples, the overall increase in query time would
         | be only about 40%.
         | 
         | If the data distribution shifts, the optimal solution would be
         | to rebuild the index. We believe that HNSW also experiences
         | challenges with data distribution to some extent. However,
         | without rebuilding, our observations suggest that users are
         | more likely to experience slightly longer query times rather
         | than a significant loss in recall.
        
       | dmezzetti wrote:
       | This looks like an interesting project.
       | 
       | Though it's worth noting that the license is AGPL. So if the idea
       | is for this to take over for pgvecto.rs, it's an important data
       | point for those building SaaS products.
       | 
       | It will make pgvector the only permissively licensed option,
       | given it has the same license as Postgres.
        
       | _mmarshall wrote:
       | The cost to store a static set of 400k 768-dimension vectors is
       | also $1 a month on Datastax's AstraDB. However, for that $1,
       | AstraDB replicates the data 3x instead of storing it on a single
       | machine.
       | 
       | Here is a link to the cost calculator. Note that the calculator
       | includes cost of ingestion, but the article only mentions storage
       | costs, not ingestion costs:
       | https://www.datastax.com/pricing/vector-search?cloudProvider...
       | 
       | Disclaimer: I work on vectorsearch/AstraDB at DataStax.
        
         | VoVAllen wrote:
         | It's hard to compare the cost with the Serverless pricing
         | model, as write and read have extra costs. On the pricing page,
         | datastax costs $4000 to write 100M 768-dim vectors. And 10M
         | query will cost $300, which is only 4 QPS. As comparison,
         | VectorChord can achieve 100 QPS on $250 instance.
        
       ___________________________________________________________________
       (page generated 2024-12-05 23:02 UTC)