[HN Gopher] Vector databases are the wrong abstraction
___________________________________________________________________
Vector databases are the wrong abstraction
Author : jascha_eng
Score : 116 points
Date : 2024-10-29 15:40 UTC (7 hours ago)
(HTM) web link (www.timescale.com)
(TXT) w3m dump (www.timescale.com)
| avthar wrote:
| Hey HN! Post co-author here, excited to share our new open-source
| PostgreSQL tool that re-imagines vector embeddings as database
| indexes. It's not literally an index but it functions like one to
| update embeddings as source data gets added, deleted or changed.
|
| Right now the system only supports OpenAI as an embedding
| provider, but we plan to extend with local and OSS model support
| soon.
|
| Eager to hear your feedback and reactions. If you'd like to leave
| an issue or better yet a PR, you can do so here [1]
|
| [1]: https://github.com/timescale/pgai
| hhdhdbdb wrote:
| Pretty smart. Why is the DB api the abstraction layer though?
| Why not two columns and a microservice. I assume you are making
| async calls to get the embeddings?
|
| I say that because it seems n unsual. Index would suit sync
| better. But async things like embeddings, geo for an address,
| is this email considered a spammer etc. feel like app level
| stuff.
| cevian wrote:
| (post co-author here)
|
| The DB is the right layer from a interface point of view --
| because that's where the data properties should be defined.
| We also use the DB for bookkeeping what needs to be done
| because we can leverage transactions and triggers to make
| sure we never miss any data. From an implementation point of
| view, the actual embedding does happen outside the database
| in a python worker or cloud functions.
|
| Merging the embeddings and the original data into a single
| view allows the full feature set of SQL rather than being
| constrained by a REST API.
| dinobones wrote:
| Wow, actually a good point I haven't seen anyone make.
|
| Taking raw embeddings and then storing them into vector
| databases, would be like if you took raw n-grams of your text and
| put them into a database for search.
|
| Storing documents makes much more sense.
| choilive wrote:
| Been using pgvector for a while, and to me it was kind of
| obvious that the source document and the embeddings are
| fundamentally linked so we always stored them "together".
| Basically anyone doing embeddings at scale is doing something
| similar to what Pgai Vectorizer is doing and is certainly a
| nice abstraction.
| jdthedisciple wrote:
| I used FAISS as it also allowed me to trivially store them
| together.
|
| Idk how well it scales though, it's just doing it's job on my
| hobby project scale
|
| For my few 100'000s embeddings I must say the performance was
| satisfactory.
| morgango wrote:
| Great point!
|
| (Disclaimer: I work for Elastic)
|
| Elasticsearch has recently added a data type called
| semantic_text, which automatically chunks text, calculates
| embeddings, and stores the chunks with sensible defaults.
|
| Queries are similarly simplified, where vectors are calculated
| and compared internally, which makes a lot less I/O and a lot
| simpler client code.
|
| https://www.elastic.co/search-labs/blog/semantic-search-simp...
| jdthedisciple wrote:
| How does their embedding model compare in terms of retrieval
| accuracy to, say `text-embedding-3-small` and `text-
| embedding-3-large`?
| binarymax wrote:
| It's impossible to answer that question without knowing what
| content/query domain you are embedding. Checkout MTEB
| leaderboard, dig into the retrieval benchmark, and look for
| analogous datasets.
| splike wrote:
| You can use openai embeddings in elastic if you don't want to
| use their elser sparse embeddings
| pjot wrote:
| I made something similar, but used duckDB as the vector store
| (and query engine)! It's impressively fast
|
| https://github.com/patricktrainer/duckdb-embedding-search
| markusw wrote:
| I'm using sqlite-vec along with FTS5 in (you guessed it) SQLite
| and it's pretty cool. :)
| mattxxx wrote:
| This reads solely as a sales pitch, which quickly cuts to the
| "we're selling this product so you don't have to think about it."
|
| ...when you actually do want to think about it (in 2024).
|
| Right now, we're collectively still figuring out:
| 1. Best chunking strategies for documents 2. Best ways to
| add context around chunks of documents 3. How to mix and
| match similarity search with hybrid search 4. Best way to
| version and update your embeddings
| cevian wrote:
| (post co-author here)
|
| We agree a lot of stuff still needs to be figured out. Which is
| why we made vectorizer very configurable. You can configure
| chunking strategies, formatting (which is a way to add context
| back into chunks). You can mix semantic and lexical search on
| the results. That handles your 1,2,3. Versioning can mean a
| different version of the data (in which case the versioning
| info lives with the source data) OR a different embedding
| config, which we also support[1].
|
| Admittedly, right now we have predefined chunking strategies.
| But we plan to add custom-code options very soon.
|
| Our broader point is that the things you highlight above are
| the right things to worry about, not the data workflow ops and
| babysitting your lambda jobs. That's what we want to handle for
| you.
|
| [1]: https://www.timescale.com/blog/which-rag-chunking-and-
| format...
| jdthedisciple wrote:
| Whats wrong with using FAISS as your single db?
|
| Its like sqlite for vector embeddings, and you can store metadata
| (the _primary_ data, foreign keys, etc) along with the vectors,
| preserving the relationship.
|
| Not sure if the metadata is indexxed but at least iirc it's more
| or less trivial to update the embeddings when your data changes
| (tho i haven't used it in a while so not sure).
| avthar wrote:
| Good q. For most standalone vector search use cases, FAISS or a
| library like it is good.
|
| However, FAISS is not a database. It can store metadata
| alongside vectors, but it doesn't have things you'd want in
| your app db like ACID compliance, non-vector indexing, and
| proper backup/recovery mechanisms. You're basically giving up
| all the DBMS capabilities.
|
| For new RAG and search apps, many teams prefer just using a
| single app db with vector search capabilities included
| (Postgres, Mongo, MySQL etc) vs managing an app db and a
| separate vector db.
| ok123456 wrote:
| Yes. Materialized Views are good.
___________________________________________________________________
(page generated 2024-10-29 23:00 UTC)