[HN Gopher] Vector Databases: A Technical Primer [pdf]
       ___________________________________________________________________
        
       Vector Databases: A Technical Primer [pdf]
        
       Author : jide_tracc
       Score  : 264 points
       Date   : 2024-01-12 17:38 UTC (5 hours ago)
        
 (HTM) web link (tge-data-web.nyc3.digitaloceanspaces.com)
 (TXT) w3m dump (tge-data-web.nyc3.digitaloceanspaces.com)
        
       | jide_tracc wrote:
       | A few months ago, I taught a class on Vector Databases for a TGE
       | Data private client and then decided to record it into a short
       | course for a wider audience.
       | 
       | The course is a mix of theory and demos discussing some of the
       | underlying concepts of Vectors, Vector Databases, Indexing,
       | Search Similarity and ending with demos specifically for Pinecone
       | and Weaviate databases.
        
         | CharlesW wrote:
         | Cool! Is there a video as well, then?
        
           | jide_tracc wrote:
           | There is but it's a udemy course unfortunately. I don't want
           | to spam the group with links but the link is on the lat page
           | of the document
        
             | codetrotter wrote:
             | If you have a discount code you can share, it would provide
             | a neat excuse for posting the link in a comment along side
             | it ;)
        
               | jide_tracc wrote:
               | Thank you! That's a great idea. I created a discount
               | coupon which is active for the next 5 days up until Jan
               | 17th
               | 
               | https://www.udemy.com/course/vector-databases-deep-
               | dive/?cou...
        
       | AtlasBarfed wrote:
       | ANNOY is essentially a binary space partition tree for points
       | without having to worry about sets of points aka polygons?
        
         | fzliu wrote:
         | Like IVF, Annoy partitions the entire embedding space into
         | high-dimensional polygons. The difference is how the two
         | algorithms do it - IVF (https://zilliz.com/learn/vector-index)
         | uses centroids, while Annoy
         | (https://zilliz.com/learn/approximate-nearest-neighbor-oh-
         | yea...) is basically just one big binary tree.
        
       | ignoramous wrote:
       | Some more resources:
       | 
       | - _A Comprehensive Survey on Vector Database: Storage and
       | Retrieval Technique, Challenge_ ,
       | https://arxiv.org/abs/2310.11703
       | 
       | - _Survey of Vector Database Management Systems_ ,
       | https://arxiv.org/abs/2310.14021
       | 
       | - _What are Embeddings_ ,
       | https://raw.githubusercontent.com/veekaybee/what_are_embeddi...
       | 
       | ---
       | 
       | h/t: https://twitter.com/eatonphil/status/1745524630624862314 and
       | https://twitter.com/ChristophMolnar/status/17457316026829826...
        
         | fzliu wrote:
         | Throwing a few more on here (mix of beginner and advanced):
         | 
         | - Wikipedia article:
         | https://en.wikipedia.org/wiki/Vector_database
         | 
         | - Vector Database 101: https://zilliz.com/learn/introduction-
         | to-unstructured-data
         | 
         | - ANN & Similarity search: https://vinija.ai/concepts/ann-
         | similarity-search/
         | 
         | - Distributed database:
         | https://15445.courses.cs.cmu.edu/fall2021/notes/21-distribut...
        
           | rammy1234 wrote:
           | Throwing one more - https://www.pinecone.io/learn/vector-
           | database/
        
         | simonw wrote:
         | Here's my attempt at this: Embeddings: What they are and why
         | they matter https://simonwillison.net/2023/Oct/23/embeddings/
         | 
         | Available as an annotated presentation + an optional 38 minute
         | video.
        
           | Scorpiion wrote:
           | Thanks for writing this one Simon, I read it some time ago
           | and I just wanted to say thanks and recommend it to folks
           | browsing the comments, it's really good!
        
           | AndyNemmity wrote:
           | Very helpful to make it clear, in concrete terms
        
       | jpasmore wrote:
       | helpful - thx - building Latimer.ai using Pinecone
        
       | kgeist wrote:
       | When does a simple linear search become an issue? For example, a
       | cos similarity function which runs even over a million of vectors
       | is surprisingly fast. We can further shard vectors by user (i.e.
       | if they can only search in their own data they uploaded), or even
       | by document (for things like "chat with this PDF") then it's
       | likely that most searches will happen over small ranges of
       | vectors.
       | 
       | I suspect the main downsides would be increased CPU usage and
       | having to load vectors to RAM all the time. However, for small
       | projects with low RPS you could probably get away without a
       | special database?
       | 
       | Does anyone know of any benchmarks where they compare vector
       | databases to a plain linear search? How they scale in regards to
       | increased vector count and RPS, etc.
        
         | rolisz wrote:
         | Do you mean using brute force (exact search) and not
         | approximate nearest neighbor search? From my unscientific
         | benchmarks, doing an exact search with 100k vectors takes about
         | 1 second. Usually chatbots take much longer to generate the
         | text, so that's still an acceptable time for doing exact
         | search, which also means you will always find the best matches.
        
           | kgeist wrote:
           | >Do you mean using brute force (exact search) and not
           | approximate nearest neighbor search?
           | 
           | Yes.
           | 
           | >doing an exact search with 100k vectors takes about 1
           | second.
           | 
           | The speed can depend on the language/runtime of your choice
           | and the number of dimensions. What language/runtime did you
           | use, how many dimensions in a vector? (I heard too many
           | dimensions for vector search is an overkill)
        
         | Radim wrote:
         | Brute force search is both exact (100% accurate) and pleasantly
         | linear - a predictable algorithm. CPUs and caches like that, so
         | performance is much better than you might otherwise expect.
         | 
         | From my https://rare-technologies.com/performance-shootout-of-
         | neares... benchmark of kNN libs:
         | 
         | " _Brute force doesn't care about the number of neighbours, so
         | its performance is 679ms /query regardless of "k". When run in
         | batch mode (issuing all 100 queries at once), average query
         | time drops to 354ms/query._"
         | 
         | This was 500 dimensions over a 3.7M dataset (the English
         | Wikipedia), in 2014. So, ~700ms/search, or half that if you can
         | batch several searches together at once. YMMV.
        
       | m3kw9 wrote:
       | Is this how most AI services implements RAG?
        
         | gk1 wrote:
         | Yes, most RAG is done with a vector database. Or something that
         | approximates a vector database, like a traditional database
         | with a vector-index add-on.
        
         | charcircuit wrote:
         | Keyword or text matching based search is likely still more
         | popular due to how long its been around, its simplicity, and
         | the tooling built around it. Most companies who have internal
         | search are most likely not using vector / semantic search, but
         | are doing something more basic.
        
       | esafak wrote:
       | Missing coverage of hybrid search (vector + lexical).
        
         | swyx wrote:
         | this is quite important. every vector db will provide lexical
         | features, every traditional db will provide vector features.
         | everything trends hybrid.
        
       | cranberryturkey wrote:
       | is surrealdb anygood as a vector database?
        
         | m1117 wrote:
         | Pinecone is surreal!
        
         | esafak wrote:
         | That doesn't make sense. SurrealDB doesn't even tout itself as
         | having good performance. If you don't want a custom vector
         | database, you might as well use postgres.
        
       | gk1 wrote:
       | This is really cool, thanks for making this and sharing with the
       | community!
        
       | simonw wrote:
       | Since digitaloceanspaces.com is an S3-style hosting provider, it
       | would be neat if Hacker News could special-case it to display
       | something like tge-data-web.nyc3.digitaloceanspaces.com as the
       | domain instead of just digitaloceanspaces.com
       | 
       | Though it looks like S3 has the same problem:
       | https://news.ycombinator.com/item?id=38876761
       | 
       | There's precedent for this elsewhere though - sites on a
       | x.github.io subdomain have special treatment here:
       | https://news.ycombinator.com/from?site=lfranke.github.io
        
         | A_Duck wrote:
         | You may be interested in this project:
         | https://publicsuffix.org/
        
       | danjl wrote:
       | Great overview, but the final section doesn't address the obvious
       | question about how to decide between using a "vector store", like
       | Postgres+pgvector vs a "vector database", like Pinecone. I'd love
       | to see another presentation which discusses the various
       | tradeoffs, like query speed, insertion/index-building speed,
       | ease-of-use, and others to help guide people trying to decide
       | which of the options is best for their application.
        
         | esafak wrote:
         | I'd call the former a vector _extension_. A database is a store
         | with bells and whistles.
        
       | takinola wrote:
       | When I started playing around with AI applications and learnt
       | about RAG techniques, it took me a while to grok vector databases
       | and was a bit of pain to learn how to set one up. So I built my
       | own little pet project, RagTag [1], which abstracts the vector
       | database behind a simple CRUD API. I simply POST documents to
       | RagTag and they are automatically converted into embeddings and
       | made available to be queried for similarity searches.
       | 
       | [1] RagTag - https://ragtag.weaveapi.com
        
       | stefanha wrote:
       | In the table on slide 15 the Indexing & Search Efficiency cells
       | for Traditional Databases and Vector Databases appear to be
       | swapped.
        
         | kevindamm wrote:
         | Yes that last row looks swapped to me as well.
        
       | nostrebored wrote:
       | From looking at this, I think it's a very risky starting point
       | for an engineer to kick off from.
       | 
       | Things like mentioning they're clustered by meaning and optimized
       | for analytics are questionable.
       | 
       | The clustering depends on the embedding you calculate. If you
       | think that the embedding is a good semantic approximation of the
       | data then maybe this is a fine way of thinking about it. But it's
       | not hard to imagine embeddings that may violate this -- eg if I
       | use an audio file and a text file that are identical in meaning
       | through the same embed process, unless it is multimodal they will
       | likely be distant in the embedding vector space.
       | 
       | I fully expect to see embeddings that put things close together
       | in the vector space based on utilization rather than semantic
       | similarity. If I'm creating a recommender system, I don't want to
       | group different varieties of one off purchases closely. For
       | instance, the most semantically similar flight is going to be
       | another fight to the same destination at a different time or a
       | flight to a nearby airport. But I would want to group hotels
       | often purchased by people who have previously bought the flight.
       | 
       | Vector databases also allow you to provide extra dimensionality
       | into the data, like time awareness. Nothing is forcing you to use
       | a vector that encodes semantic meaning.
       | 
       | And from this, you can see that we're optimized for lookups or
       | searches based on an input vector. This is not analogous to OLAP
       | queries. This is more akin to elasticsearch than snowflake. If
       | you are using a vector database thinking it's going to give you
       | reporting or large scale analytics on the vector space afaik
       | there isn't a readily available offering.
        
         | chasd00 wrote:
         | calculating the embeddings is still a mystery to me. I get
         | going from a picture of an Apple to a vector representing
         | "appleness" and then comparing that vector to other vectors
         | using all the usual math. What I don't get is, who/what takes
         | the image as input and outputs the vector. Same goes for
         | documents, let's say i want to add a dimension (another number
         | in the array) what part of the vector database do I modify to
         | include this dimension in the vector calculation? Or is going
         | from doc/image/whatver to the vector representation done
         | outside the database in some other way?
         | 
         | edit: it seems like calculating embeddings would be something
         | an ML algorithm would do but then, again, you have to train
         | that one first. ...it's training all the way down.
        
           | nostrebored wrote:
           | Yup it happens outside of the system -- but there are a
           | number of perks to being able to store that data in a db --
           | including easily adding metadata, updating entries, etc.
           | 
           | I think in 10y we will see retail systems heavily utilizing
           | vector dbs and many embedding as a service products that take
           | into account things like conversion. In this model you can
           | add metadata about products to the vector db and direct
           | program flow instead of querying back out to one or more
           | databases to retrieve relevant metadata.
           | 
           | They'll also start to enable things like search via image for
           | features like "show us your favorite outfit" pulling up a
           | customized wardrobe based on individual items extracted from
           | the photo and run through the embedder.
           | 
           | Just one of many ways these products will exist outside of
           | RAG. I think we'll actually see a lot of the opposite -- GAR.
        
       | dmezzetti wrote:
       | It's also important to understand why we'd even want vectors in
       | the first place.
       | 
       | This article covers the advantages of semantic search over
       | keyword search - https://medium.com/neuml/getting-started-with-
       | semantic-searc...
        
       | yagami_takayuki wrote:
       | For recognizing features such as hair and skin color, which would
       | do a better job? Machine learning with image classification? Or a
       | vector database?
       | 
       | I've had Weaviate return a dog given a human as input when doing
       | an image similarity search, so I was wondering if there's some
       | way to improve the results or whether I'm barking up the wrong
       | tree.
        
         | bfeynman wrote:
         | I would think you could improve your embedding space to address
         | that issue, partially. Similarity search (as a result of some
         | contrastive loss) definitely suffers at the tails and the OOD
         | is pretty bad. That being said, you're more likely to have
         | higher recall than a more classical technique.
        
         | dimatura wrote:
         | For those two in particular? You'd definitely get a better
         | result with an ML model such as a convolutional neural net. In
         | some sense, using an image similarity query is a kind of ML
         | model - nearest neighbor - which can work in some scenarios.
         | But for this specifically, I'd recommend a CNN.
        
         | m00x wrote:
         | You don't use vector databases independently, you need to input
         | the embedding from a ML model.
         | 
         | For your use-case, it should be pretty simple. You could use a
         | CNN and train it, or use YOLO, Deepface, or other face
         | detection algos, then within the face, find the hair and find
         | the skin.
         | 
         | From there you can use a vector database to get the colors that
         | resemble other inputs, or you can use a simple CNN to classify
         | the hair and skin to the closest label.
        
       | dimatura wrote:
       | Any recommendations for an embedded embedding database (heh)?
       | Embedded, as in sqlite. For smaller-scale problems, but hopefully
       | more convenient than say, LMDB + FAISS.
        
         | ripley12 wrote:
         | FWIW, Simon Willison's `llm` tool just uses SQLite plus a few
         | UDFs. The simplicity of that approach is appealing to me but I
         | don't have a good sense of when+why it becomes insufficient.
        
           | dimatura wrote:
           | Thanks, I'll check it out.
        
         | pjot wrote:
         | I actually just finished a POC using DuckDB that does
         | similarity search for HN comments.
         | 
         | https://github.com/patricktrainer/hackernews-comment-search
        
         | PhilippGille wrote:
         | For Python I believe Chroma [1] can be used embedded.
         | 
         | For Go I recently started building chromem-go, inspired by the
         | Chroma interface: https://github.com/philippgille/chromem-go
         | 
         | It's neither advanced nor for scale yet, but the RAG demo
         | works.
         | 
         | [1] https://github.com/chroma-core/chroma
        
         | catketch wrote:
         | https://github.com/asg017/sqlite-vss
        
           | dimatura wrote:
           | awesome, how did I not find this :D
        
       | ripley12 wrote:
       | The "Do you need a Dedicated Vector Database?" slide is quite
       | interesting, but doesn't really answer its own question! This is
       | something I've been wondering myself, if anyone has any
       | guidelines or rules of thumb I would appreciate it.
       | 
       | I've recently been using Simon Willison's (hi simonw) excellent
       | `llm` tool that can help with embeddings, and it takes the
       | simplest approach possible: just store embeddings in SQLite with
       | a few UDFs for calculating distance etc.
       | 
       | The simplicity of that approach is very appealing, but presumably
       | at some level of traffic+data an application will outgrow it and
       | need a more specialized database. Does anyone have a good
       | intuition for where that cutoff might be?
        
         | summarity wrote:
         | I've had some success scaling a quick and dirty engine (that
         | powers findsight.ai) to tens of millions of vectors, details in
         | the talk here: https://youtu.be/elNrRU12xRc?t=1556
         | 
         | That's maybe 1kLOC so I didn't need an external one after all.
        
       | ok123456 wrote:
       | Are these dedicated vector databases doing anything more
       | complicated than what can be accomplished using Postgres with the
       | Cube extension?
        
         | m00x wrote:
         | You don't need a dedicated vector db, you can use pgvector.
         | 
         | You could maybe use Cube for euclidean space search, but you're
         | better off using optimized algorithms for embedding space
         | search.
         | 
         | https://github.com/pgvector/pgvector
        
         | gk1 wrote:
         | Yes, see: https://www.pinecone.io/blog/hnsw-not-enough/
        
       | shreyas44 wrote:
       | From a few months ago -
       | https://news.ycombinator.com/item?id=35814381
        
       | jhj wrote:
       | I don't know why PQ is listed as an "indexing strategy". It's a
       | vector compression/quantization technique, not a means of
       | partitioning the search space. You could encode vectors with PQ
       | when using brute-force/flat index, an IVF index, with HNSW (all
       | of which are present in Faiss with PQ encoding as IndexPQ,
       | IndexIVFPQ and IndexHNSWPQ respectively), or even k-D trees or
       | ANNOY if someone wanted to do that.
       | 
       | "Use HNSW or Annoy for very large datasets where query speed is
       | more important than precision": Graph-based methods have huge
       | memory overhead and construction cost, and they aren't practical
       | for billion-scale datasets. Also, they will usually be more
       | accurate and faster than IVF techniques (as you would need to
       | visit a large number of IVF cells to get comparable accuracy),
       | though IVF can scale to trillion-sized databases without much
       | overhead yet with reasonable speed/accuracy tradeoffs unlike
       | other techniques. I'd say "use for medium-scale datasets where
       | query speed is important, yet high accuracy is still desired and
       | flat/brute-force indexing is impractical".
        
         | m00x wrote:
         | It turns a continuous space into a discrete space. You would
         | first do PQ, then do KNN of the new discrete vector. This way
         | you can compress the vocabulary to a fixed size.
        
       | perone wrote:
       | I find it interesting how everyone ignore EuclidesDB
       | (https://euclidesdb.readthedocs.io) which came before Milvus and
       | others in 2018, it is free and open-source. Same for all
       | presentations from major DBs.
        
       ___________________________________________________________________
       (page generated 2024-01-12 23:00 UTC)