[HN Gopher] VectorChord: Store 400k Vectors for $1 in PostgreSQL
___________________________________________________________________
VectorChord: Store 400k Vectors for $1 in PostgreSQL
Author : gaocegege
Score : 151 points
Date : 2024-12-05 02:01 UTC (21 hours ago)
(HTM) web link (blog.pgvecto.rs)
(TXT) w3m dump (blog.pgvecto.rs)
| gaocegege wrote:
| Hey everyone! We've developed a new PostgreSQL extension that
| supports 400k vectors for just $1. Check it out!
| 7qW24A wrote:
| The "external index build" idea seems pretty interesting. How
| does it work with updates to the underlying data (e.g., new
| embeddings being added)? For that matter, I guess, how do
| incremental updates to pgvector's HNSW indexes work?
| VoVAllen wrote:
| The IVF indexing can be considered into two phases, computing
| the centroids (KMeans), and assigning each point to the
| centroids as the inverted lists. The most time-consuming part
| is at the KMeans stage, and can be greatly accelerated with
| GPU. 1M 960dim vec can be clustered in less than 10s. We did
| the KMeans phase externally, and the assignment phase inside
| postgres. The KMeans part depends only on the data
| distribution, not on any specific data. So we can do sampling
| on the data, and inserting/deleting the data won't affect the
| KMeans result significantly. For the update, it's just a matter
| of assigning the new vector to a specific cluster and appending
| it to the corresponding list. It's very light compared to
| inserting in hnsw
| nextworddev wrote:
| What dimension vectors are we talking here
| jesperwe wrote:
| First paragraph of tfa mentions "768-dimensional vectors"
| jasonkester wrote:
| In five pages of text, we never get to learn what a Vector is (in
| this context), why we'd want to store one in pgsql, or why it
| costs so much to store them compared to anything else you'd store
| there.
|
| For an example of how you can communicate with domain experts,
| while still giving everyone else some form of clue as to what
| this hell you're talking about, check out the link to the product
| that this thing claims to be a successor to:
|
| https://pgvecto.rs/
|
| That starts off by telling us what it is and what it does.
| lja wrote:
| That's because this product isn't for you then. My team has
| been evaluating vector databases for years and everything on
| the VectorChord page resonated with me. We run one of the
| world's largest vector databases and we'll likely benchmark
| vectorchord to see if it lives up to its promises here.
| gaocegege wrote:
| Hi, we're here to help if you need assistance (via GitHub
| issue, Discord, or email). Could you let us know the scale of
| your vectors--are they 1B or 10B?
| lja wrote:
| 10B, feel free to email me luke@ the domain in my profile.
| thelittleone wrote:
| Do you have a blog per chance? Or any recommended reading on
| pre processing / data chunking strategies to improve results?
| VoVAllen wrote:
| I recently came across a project that looks promising: [Wor
| dLlama](https://github.com/dleemiller/WordLlama?tab=readme-
| ov-file#s...). It appears to be well-suited for semantic
| chunking, though I haven't had a chance to try it out yet.
| redskyluan wrote:
| Maybe check this https://zilliz.com/pricing
|
| Yan easily store 1B data into Zilliz Serverless and the cost
| is incredible cheap
| lja wrote:
| Zilliz gave us a near 6-figure a month quote for our
| database.
| curl-up wrote:
| So they should start every one of their posts, on that same
| site, with a summary of what is available on the homepage?
| jasonkester wrote:
| Wow. It never occurred to me that this might be anything but
| the landing page of a product.
|
| The title here, the presentation on the page itself.
| Everything screams "landing page". I had to go back on a
| desktop browser to see the word "blog" in the url bar, and
| mentally shift those graphics and little islands of text
| around until I can view it from that lens. If it's really
| just a sub-product of the main product that they're talking
| about, then yeah, it makes more sense in that context.
|
| But my answer to your question would still be "Yes".
| Absolutely. If you're a product, the job of your blog is to
| convince people coming off the street that they need your
| thing, even if they didn't realize it yet.
|
| Step one of that process is to not bounce them back to the
| street without any idea what they're looking at.
| Raed667 wrote:
| And every hosting provider doesn't start by teaching you what
| is HTML
| curl-up wrote:
| Does this mean you won't support pgvecto.rs anymore?
| VoVAllen wrote:
| I'm the project lead for both projects. We're still in the
| process of supporting all the function from pgvecto.rs in
| VectorChord (int8, more than 2000 dim vec, etc.). We'll provide
| the migration docs for pgvecto.rs users to VectorChord. User
| will have better experience with VectorChord due to better
| integration with postgres storage system. We will stop
| supporting pgvecto.rs early next year when everything on
| VectorChord is ready.
| marcyb5st wrote:
| Awesome work! But aren't the comparisons missing ScaNN [1, 2]? I
| think it's the overall SOTA [3] at the moment regarding vector
| indexing.
|
| [1] https://github.com/google-research/google-
| research/tree/mast...
|
| [2] Also available on something like AlloyDB on GCP:
| https://cloud.google.com/alloydb/docs/ai/store-index-query-v...
|
| [3] https://ann-benchmarks.com/glove-100-angular_10_angular.html
|
| Disclaimer: Working for Google, but nowhere close to Databases.
| VoVAllen wrote:
| I'm the project lead for VectorChord. I have tested ScaNN on
| AlloyDB Omni but have struggled to achieve reasonable recall on
| the GIST 1M dataset, with results peaking at only around 0.8.
| The limited documentation makes it challenging to understand
| the underlying causes of this performance.
|
| Additionally, I couldn't find any performance benchmarks for
| ScaNN integrated with PostgreSQL, particularly in comparison to
| pgvector or its standalone. The publicly available metrics
| focus exclusively on query-only indexing outside of the
| database.
|
| On our side, we've implemented the fastscan kernel for bit
| vector scanning, which is considered as one of ScaNN's key
| advantages.
| marcyb5st wrote:
| Thanks for the explanation!
|
| Really appreciate it and it makes perfect sense.
| tarasglek wrote:
| I am still waiting for a good pattern for using multivector
| embeddings like ColBert and ColPali in postgres. I get that its
| fun to optimize single vector stuff, but multivector is that
| happy middleground between single vector and reranker that seems
| to be only validated in specialized exotic search dbs like Vespa
| tarasglek wrote:
| Looks like that there is recent work to make a pgvector example
| for this https://github.com/pgvector/pgvector-
| python/blob/master/exam...
| VoVAllen wrote:
| There's no easy way to index ColBert multi-vectors in a
| scalable way that I know of. Vespa seems to rely heavily on
| binary quantization, which can cost a lot in recall loss. And
| for most cases, using ColBert as a reranker is good enough, as
| the pgvector example you posted.
| tarasglek wrote:
| Seems like like doing a proper relational 1:N chunk:multiple-
| vectors foreign key, binarization and a clever join or
| multistage CTE would get us pretty close to useful.
|
| I am ok with it being less efficient as the dev ux will be
| amazing. Vespa ops (even in their cloud) are a complete
| nightmare compared to postgres
| jonathan-adly wrote:
| I would like to throw our project in the ring that solves this
| problem: https://github.com/tjmlabs/ColiVara
|
| 1. Uses half-vecs, so you cut down everything by half with no
| recall loss 2. Uses token pooling with hierarchial clustering
| at 3, so, you further cut down things by 2/3rd with <1% loss 3.
| Everything is on Postgres and pgvector, so you can do all the
| Postgres stuff and decrease corpus size by document metadata
| filtering 4. We have a 5000+ pages corpus in production with <3
| seconds latency. 5. We benchmark against the Vidore
| leaderboard, and very near SOTA
|
| You can read about half-vecs here:
| https://jkatz05.com/post/postgres/pgvector-scalar-binary-qua...
|
| Hierarchical token pooling:
| https://www.answer.ai/posts/colbert-pooling.html
|
| And how we implemented them here: https://blog.colivara.com/
| __jl__ wrote:
| I really like the idea of ColPali and products building on it
| but I am still unsure about the applications for which it
| makes most sense. We mostly deal with reports that are 80-90%
| text, 10-20% figures and tables. Does a vision first approach
| makes sense in this context? My sense is that text-based
| embeddings are better in mostly text contexts. Layout, for
| example, is pretty much irrelevant but plays into vision-
| based approaches. What is your sense about this?
| jonathan-adly wrote:
| So - the synthetic QAs datasets in the Vidore datasets are
| exactly like that 90% text, 10% charts/tables. OCR + BM25
| is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is
| at ~98%.
|
| It is a small upgrade, but one nonetheless. The complexity,
| and the cost of multi-vectors *might* not make this worth
| it, really depends on how accuracy-critical the task is.
|
| For example, one of our customers who does this over FDA
| monographs, which is like 95%+ text, and 5% tables - they
| misses were extremely painful - even though there weren't
| that many in text-based pipelines. So, the migrations made
| sense to them.
| rkuzsma wrote:
| Would you be willing to speculate on how VectorChord's ingestion
| and query performance might compare to Elasticsearch/OpenSearch
| for dense vector and sparse vector search use cases, particularly
| when dealing with larger full text data sets (>5M records)?
| VoVAllen wrote:
| In the LAION-5M benchmark, we've compared our performance
| against ElasticSearch and OpenSearch. However, comparing
| ingestion performance is more challenging due to differences in
| architecture. Both ElasticSearch and OpenSearch, like most
| vector databases, use the concept of shards. Each shard
| represents a separate vector index, and queries aggregate
| results across these shards. Larger shards lead to faster
| queries but come with higher resource requirements and slower
| update speeds.
|
| It's also worth noting that ElasticSearch has implemented
| RaBitQ support for HNSW. So it's difficult to compare without
| running actual benchmarks. However, ElasticSearch typically
| requires at least double, if not triple, the memory size of the
| vector dataset to maintain system stability. In contrast,
| PostgreSQL can achieve a stable system with far fewer resources
| --for example, 32GB of memory is sufficient to manage 100
| million vectors efficiently.
|
| From my perspective, it would be faster in query comparing to
| ElasticSearch due to the extensive optimizations. And much much
| faster with the updates (insert and delete) due to using IVF
| instead of HNSW.
| estebarb wrote:
| It would have been nice a comparison with pgvectorscale, which
| uses binary quantization and StreamingDiskANN.
| whakim wrote:
| Could you talk about how updates are handled? My understanding is
| that IVF can struggle if you're doing a lot of inserts/updates
| after index creation, as the data needs to be incrementally re-
| clustered (or the entire index needs to be rebuilt) in order to
| ensure the clusters continue to reflect the shape of your data?
| VoVAllen wrote:
| We don't perform any reclustering. As you said, users would
| need to rebuild the index if they want to recluster. However,
| based on our observations, the speed remains acceptable even
| with significant data growth. We did a simple experiment using
| nlist=1 on the GIST dataset, the top-10 retrieval results took
| less than twice the time compared to using nlist=4096. This is
| because only the quantized vectors (with a 32x compression)
| need to be inserted into the posting list, and only quantized
| vector distances need more computations. And the quantized
| vector computation only accounts for a small amount of time.
| Most of the time is spent on re-ranking using full-precision
| vectors. Let's say the breakdown is approximately 20% for
| quantized vector computations and 80% for full-precision vector
| computations. So even if the time for quantized vector
| computations triples, the overall increase in query time would
| be only about 40%.
|
| If the data distribution shifts, the optimal solution would be
| to rebuild the index. We believe that HNSW also experiences
| challenges with data distribution to some extent. However,
| without rebuilding, our observations suggest that users are
| more likely to experience slightly longer query times rather
| than a significant loss in recall.
| dmezzetti wrote:
| This looks like an interesting project.
|
| Though it's worth noting that the license is AGPL. So if the idea
| is for this to take over for pgvecto.rs, it's an important data
| point for those building SaaS products.
|
| It will make pgvector the only permissively licensed option,
| given it has the same license as Postgres.
| _mmarshall wrote:
| The cost to store a static set of 400k 768-dimension vectors is
| also $1 a month on Datastax's AstraDB. However, for that $1,
| AstraDB replicates the data 3x instead of storing it on a single
| machine.
|
| Here is a link to the cost calculator. Note that the calculator
| includes cost of ingestion, but the article only mentions storage
| costs, not ingestion costs:
| https://www.datastax.com/pricing/vector-search?cloudProvider...
|
| Disclaimer: I work on vectorsearch/AstraDB at DataStax.
| VoVAllen wrote:
| It's hard to compare the cost with the Serverless pricing
| model, as write and read have extra costs. On the pricing page,
| datastax costs $4000 to write 100M 768-dim vectors. And 10M
| query will cost $300, which is only 4 QPS. As comparison,
| VectorChord can achieve 100 QPS on $250 instance.
___________________________________________________________________
(page generated 2024-12-05 23:02 UTC)