[HN Gopher] Vector databases are the wrong abstraction
       ___________________________________________________________________
        
       Vector databases are the wrong abstraction
        
       Author : jascha_eng
       Score  : 116 points
       Date   : 2024-10-29 15:40 UTC (7 hours ago)
        
 (HTM) web link (www.timescale.com)
 (TXT) w3m dump (www.timescale.com)
        
       | avthar wrote:
       | Hey HN! Post co-author here, excited to share our new open-source
       | PostgreSQL tool that re-imagines vector embeddings as database
       | indexes. It's not literally an index but it functions like one to
       | update embeddings as source data gets added, deleted or changed.
       | 
       | Right now the system only supports OpenAI as an embedding
       | provider, but we plan to extend with local and OSS model support
       | soon.
       | 
       | Eager to hear your feedback and reactions. If you'd like to leave
       | an issue or better yet a PR, you can do so here [1]
       | 
       | [1]: https://github.com/timescale/pgai
        
         | hhdhdbdb wrote:
         | Pretty smart. Why is the DB api the abstraction layer though?
         | Why not two columns and a microservice. I assume you are making
         | async calls to get the embeddings?
         | 
         | I say that because it seems n unsual. Index would suit sync
         | better. But async things like embeddings, geo for an address,
         | is this email considered a spammer etc. feel like app level
         | stuff.
        
           | cevian wrote:
           | (post co-author here)
           | 
           | The DB is the right layer from a interface point of view --
           | because that's where the data properties should be defined.
           | We also use the DB for bookkeeping what needs to be done
           | because we can leverage transactions and triggers to make
           | sure we never miss any data. From an implementation point of
           | view, the actual embedding does happen outside the database
           | in a python worker or cloud functions.
           | 
           | Merging the embeddings and the original data into a single
           | view allows the full feature set of SQL rather than being
           | constrained by a REST API.
        
       | dinobones wrote:
       | Wow, actually a good point I haven't seen anyone make.
       | 
       | Taking raw embeddings and then storing them into vector
       | databases, would be like if you took raw n-grams of your text and
       | put them into a database for search.
       | 
       | Storing documents makes much more sense.
        
         | choilive wrote:
         | Been using pgvector for a while, and to me it was kind of
         | obvious that the source document and the embeddings are
         | fundamentally linked so we always stored them "together".
         | Basically anyone doing embeddings at scale is doing something
         | similar to what Pgai Vectorizer is doing and is certainly a
         | nice abstraction.
        
           | jdthedisciple wrote:
           | I used FAISS as it also allowed me to trivially store them
           | together.
           | 
           | Idk how well it scales though, it's just doing it's job on my
           | hobby project scale
           | 
           | For my few 100'000s embeddings I must say the performance was
           | satisfactory.
        
       | morgango wrote:
       | Great point!
       | 
       | (Disclaimer: I work for Elastic)
       | 
       | Elasticsearch has recently added a data type called
       | semantic_text, which automatically chunks text, calculates
       | embeddings, and stores the chunks with sensible defaults.
       | 
       | Queries are similarly simplified, where vectors are calculated
       | and compared internally, which makes a lot less I/O and a lot
       | simpler client code.
       | 
       | https://www.elastic.co/search-labs/blog/semantic-search-simp...
        
         | jdthedisciple wrote:
         | How does their embedding model compare in terms of retrieval
         | accuracy to, say `text-embedding-3-small` and `text-
         | embedding-3-large`?
        
           | binarymax wrote:
           | It's impossible to answer that question without knowing what
           | content/query domain you are embedding. Checkout MTEB
           | leaderboard, dig into the retrieval benchmark, and look for
           | analogous datasets.
        
           | splike wrote:
           | You can use openai embeddings in elastic if you don't want to
           | use their elser sparse embeddings
        
         | pjot wrote:
         | I made something similar, but used duckDB as the vector store
         | (and query engine)! It's impressively fast
         | 
         | https://github.com/patricktrainer/duckdb-embedding-search
        
       | markusw wrote:
       | I'm using sqlite-vec along with FTS5 in (you guessed it) SQLite
       | and it's pretty cool. :)
        
       | mattxxx wrote:
       | This reads solely as a sales pitch, which quickly cuts to the
       | "we're selling this product so you don't have to think about it."
       | 
       | ...when you actually do want to think about it (in 2024).
       | 
       | Right now, we're collectively still figuring out:
       | 1. Best chunking strategies for documents       2. Best ways to
       | add context around chunks of documents       3. How to mix and
       | match similarity search with hybrid search       4. Best way to
       | version and update your embeddings
        
         | cevian wrote:
         | (post co-author here)
         | 
         | We agree a lot of stuff still needs to be figured out. Which is
         | why we made vectorizer very configurable. You can configure
         | chunking strategies, formatting (which is a way to add context
         | back into chunks). You can mix semantic and lexical search on
         | the results. That handles your 1,2,3. Versioning can mean a
         | different version of the data (in which case the versioning
         | info lives with the source data) OR a different embedding
         | config, which we also support[1].
         | 
         | Admittedly, right now we have predefined chunking strategies.
         | But we plan to add custom-code options very soon.
         | 
         | Our broader point is that the things you highlight above are
         | the right things to worry about, not the data workflow ops and
         | babysitting your lambda jobs. That's what we want to handle for
         | you.
         | 
         | [1]: https://www.timescale.com/blog/which-rag-chunking-and-
         | format...
        
       | jdthedisciple wrote:
       | Whats wrong with using FAISS as your single db?
       | 
       | Its like sqlite for vector embeddings, and you can store metadata
       | (the _primary_ data, foreign keys, etc) along with the vectors,
       | preserving the relationship.
       | 
       | Not sure if the metadata is indexxed but at least iirc it's more
       | or less trivial to update the embeddings when your data changes
       | (tho i haven't used it in a while so not sure).
        
         | avthar wrote:
         | Good q. For most standalone vector search use cases, FAISS or a
         | library like it is good.
         | 
         | However, FAISS is not a database. It can store metadata
         | alongside vectors, but it doesn't have things you'd want in
         | your app db like ACID compliance, non-vector indexing, and
         | proper backup/recovery mechanisms. You're basically giving up
         | all the DBMS capabilities.
         | 
         | For new RAG and search apps, many teams prefer just using a
         | single app db with vector search capabilities included
         | (Postgres, Mongo, MySQL etc) vs managing an app db and a
         | separate vector db.
        
       | ok123456 wrote:
       | Yes. Materialized Views are good.
        
       ___________________________________________________________________
       (page generated 2024-10-29 23:00 UTC)