[HN Gopher] Vector Databases: A Technical Primer [pdf]
___________________________________________________________________
Vector Databases: A Technical Primer [pdf]
Author : jide_tracc
Score : 264 points
Date : 2024-01-12 17:38 UTC (5 hours ago)
(HTM) web link (tge-data-web.nyc3.digitaloceanspaces.com)
(TXT) w3m dump (tge-data-web.nyc3.digitaloceanspaces.com)
| jide_tracc wrote:
| A few months ago, I taught a class on Vector Databases for a TGE
| Data private client and then decided to record it into a short
| course for a wider audience.
|
| The course is a mix of theory and demos discussing some of the
| underlying concepts of Vectors, Vector Databases, Indexing,
| Search Similarity and ending with demos specifically for Pinecone
| and Weaviate databases.
| CharlesW wrote:
| Cool! Is there a video as well, then?
| jide_tracc wrote:
| There is but it's a udemy course unfortunately. I don't want
| to spam the group with links but the link is on the lat page
| of the document
| codetrotter wrote:
| If you have a discount code you can share, it would provide
| a neat excuse for posting the link in a comment along side
| it ;)
| jide_tracc wrote:
| Thank you! That's a great idea. I created a discount
| coupon which is active for the next 5 days up until Jan
| 17th
|
| https://www.udemy.com/course/vector-databases-deep-
| dive/?cou...
| AtlasBarfed wrote:
| ANNOY is essentially a binary space partition tree for points
| without having to worry about sets of points aka polygons?
| fzliu wrote:
| Like IVF, Annoy partitions the entire embedding space into
| high-dimensional polygons. The difference is how the two
| algorithms do it - IVF (https://zilliz.com/learn/vector-index)
| uses centroids, while Annoy
| (https://zilliz.com/learn/approximate-nearest-neighbor-oh-
| yea...) is basically just one big binary tree.
| ignoramous wrote:
| Some more resources:
|
| - _A Comprehensive Survey on Vector Database: Storage and
| Retrieval Technique, Challenge_ ,
| https://arxiv.org/abs/2310.11703
|
| - _Survey of Vector Database Management Systems_ ,
| https://arxiv.org/abs/2310.14021
|
| - _What are Embeddings_ ,
| https://raw.githubusercontent.com/veekaybee/what_are_embeddi...
|
| ---
|
| h/t: https://twitter.com/eatonphil/status/1745524630624862314 and
| https://twitter.com/ChristophMolnar/status/17457316026829826...
| fzliu wrote:
| Throwing a few more on here (mix of beginner and advanced):
|
| - Wikipedia article:
| https://en.wikipedia.org/wiki/Vector_database
|
| - Vector Database 101: https://zilliz.com/learn/introduction-
| to-unstructured-data
|
| - ANN & Similarity search: https://vinija.ai/concepts/ann-
| similarity-search/
|
| - Distributed database:
| https://15445.courses.cs.cmu.edu/fall2021/notes/21-distribut...
| rammy1234 wrote:
| Throwing one more - https://www.pinecone.io/learn/vector-
| database/
| simonw wrote:
| Here's my attempt at this: Embeddings: What they are and why
| they matter https://simonwillison.net/2023/Oct/23/embeddings/
|
| Available as an annotated presentation + an optional 38 minute
| video.
| Scorpiion wrote:
| Thanks for writing this one Simon, I read it some time ago
| and I just wanted to say thanks and recommend it to folks
| browsing the comments, it's really good!
| AndyNemmity wrote:
| Very helpful to make it clear, in concrete terms
| jpasmore wrote:
| helpful - thx - building Latimer.ai using Pinecone
| kgeist wrote:
| When does a simple linear search become an issue? For example, a
| cos similarity function which runs even over a million of vectors
| is surprisingly fast. We can further shard vectors by user (i.e.
| if they can only search in their own data they uploaded), or even
| by document (for things like "chat with this PDF") then it's
| likely that most searches will happen over small ranges of
| vectors.
|
| I suspect the main downsides would be increased CPU usage and
| having to load vectors to RAM all the time. However, for small
| projects with low RPS you could probably get away without a
| special database?
|
| Does anyone know of any benchmarks where they compare vector
| databases to a plain linear search? How they scale in regards to
| increased vector count and RPS, etc.
| rolisz wrote:
| Do you mean using brute force (exact search) and not
| approximate nearest neighbor search? From my unscientific
| benchmarks, doing an exact search with 100k vectors takes about
| 1 second. Usually chatbots take much longer to generate the
| text, so that's still an acceptable time for doing exact
| search, which also means you will always find the best matches.
| kgeist wrote:
| >Do you mean using brute force (exact search) and not
| approximate nearest neighbor search?
|
| Yes.
|
| >doing an exact search with 100k vectors takes about 1
| second.
|
| The speed can depend on the language/runtime of your choice
| and the number of dimensions. What language/runtime did you
| use, how many dimensions in a vector? (I heard too many
| dimensions for vector search is an overkill)
| Radim wrote:
| Brute force search is both exact (100% accurate) and pleasantly
| linear - a predictable algorithm. CPUs and caches like that, so
| performance is much better than you might otherwise expect.
|
| From my https://rare-technologies.com/performance-shootout-of-
| neares... benchmark of kNN libs:
|
| " _Brute force doesn't care about the number of neighbours, so
| its performance is 679ms /query regardless of "k". When run in
| batch mode (issuing all 100 queries at once), average query
| time drops to 354ms/query._"
|
| This was 500 dimensions over a 3.7M dataset (the English
| Wikipedia), in 2014. So, ~700ms/search, or half that if you can
| batch several searches together at once. YMMV.
| m3kw9 wrote:
| Is this how most AI services implements RAG?
| gk1 wrote:
| Yes, most RAG is done with a vector database. Or something that
| approximates a vector database, like a traditional database
| with a vector-index add-on.
| charcircuit wrote:
| Keyword or text matching based search is likely still more
| popular due to how long its been around, its simplicity, and
| the tooling built around it. Most companies who have internal
| search are most likely not using vector / semantic search, but
| are doing something more basic.
| esafak wrote:
| Missing coverage of hybrid search (vector + lexical).
| swyx wrote:
| this is quite important. every vector db will provide lexical
| features, every traditional db will provide vector features.
| everything trends hybrid.
| cranberryturkey wrote:
| is surrealdb anygood as a vector database?
| m1117 wrote:
| Pinecone is surreal!
| esafak wrote:
| That doesn't make sense. SurrealDB doesn't even tout itself as
| having good performance. If you don't want a custom vector
| database, you might as well use postgres.
| gk1 wrote:
| This is really cool, thanks for making this and sharing with the
| community!
| simonw wrote:
| Since digitaloceanspaces.com is an S3-style hosting provider, it
| would be neat if Hacker News could special-case it to display
| something like tge-data-web.nyc3.digitaloceanspaces.com as the
| domain instead of just digitaloceanspaces.com
|
| Though it looks like S3 has the same problem:
| https://news.ycombinator.com/item?id=38876761
|
| There's precedent for this elsewhere though - sites on a
| x.github.io subdomain have special treatment here:
| https://news.ycombinator.com/from?site=lfranke.github.io
| A_Duck wrote:
| You may be interested in this project:
| https://publicsuffix.org/
| danjl wrote:
| Great overview, but the final section doesn't address the obvious
| question about how to decide between using a "vector store", like
| Postgres+pgvector vs a "vector database", like Pinecone. I'd love
| to see another presentation which discusses the various
| tradeoffs, like query speed, insertion/index-building speed,
| ease-of-use, and others to help guide people trying to decide
| which of the options is best for their application.
| esafak wrote:
| I'd call the former a vector _extension_. A database is a store
| with bells and whistles.
| takinola wrote:
| When I started playing around with AI applications and learnt
| about RAG techniques, it took me a while to grok vector databases
| and was a bit of pain to learn how to set one up. So I built my
| own little pet project, RagTag [1], which abstracts the vector
| database behind a simple CRUD API. I simply POST documents to
| RagTag and they are automatically converted into embeddings and
| made available to be queried for similarity searches.
|
| [1] RagTag - https://ragtag.weaveapi.com
| stefanha wrote:
| In the table on slide 15 the Indexing & Search Efficiency cells
| for Traditional Databases and Vector Databases appear to be
| swapped.
| kevindamm wrote:
| Yes that last row looks swapped to me as well.
| nostrebored wrote:
| From looking at this, I think it's a very risky starting point
| for an engineer to kick off from.
|
| Things like mentioning they're clustered by meaning and optimized
| for analytics are questionable.
|
| The clustering depends on the embedding you calculate. If you
| think that the embedding is a good semantic approximation of the
| data then maybe this is a fine way of thinking about it. But it's
| not hard to imagine embeddings that may violate this -- eg if I
| use an audio file and a text file that are identical in meaning
| through the same embed process, unless it is multimodal they will
| likely be distant in the embedding vector space.
|
| I fully expect to see embeddings that put things close together
| in the vector space based on utilization rather than semantic
| similarity. If I'm creating a recommender system, I don't want to
| group different varieties of one off purchases closely. For
| instance, the most semantically similar flight is going to be
| another fight to the same destination at a different time or a
| flight to a nearby airport. But I would want to group hotels
| often purchased by people who have previously bought the flight.
|
| Vector databases also allow you to provide extra dimensionality
| into the data, like time awareness. Nothing is forcing you to use
| a vector that encodes semantic meaning.
|
| And from this, you can see that we're optimized for lookups or
| searches based on an input vector. This is not analogous to OLAP
| queries. This is more akin to elasticsearch than snowflake. If
| you are using a vector database thinking it's going to give you
| reporting or large scale analytics on the vector space afaik
| there isn't a readily available offering.
| chasd00 wrote:
| calculating the embeddings is still a mystery to me. I get
| going from a picture of an Apple to a vector representing
| "appleness" and then comparing that vector to other vectors
| using all the usual math. What I don't get is, who/what takes
| the image as input and outputs the vector. Same goes for
| documents, let's say i want to add a dimension (another number
| in the array) what part of the vector database do I modify to
| include this dimension in the vector calculation? Or is going
| from doc/image/whatver to the vector representation done
| outside the database in some other way?
|
| edit: it seems like calculating embeddings would be something
| an ML algorithm would do but then, again, you have to train
| that one first. ...it's training all the way down.
| nostrebored wrote:
| Yup it happens outside of the system -- but there are a
| number of perks to being able to store that data in a db --
| including easily adding metadata, updating entries, etc.
|
| I think in 10y we will see retail systems heavily utilizing
| vector dbs and many embedding as a service products that take
| into account things like conversion. In this model you can
| add metadata about products to the vector db and direct
| program flow instead of querying back out to one or more
| databases to retrieve relevant metadata.
|
| They'll also start to enable things like search via image for
| features like "show us your favorite outfit" pulling up a
| customized wardrobe based on individual items extracted from
| the photo and run through the embedder.
|
| Just one of many ways these products will exist outside of
| RAG. I think we'll actually see a lot of the opposite -- GAR.
| dmezzetti wrote:
| It's also important to understand why we'd even want vectors in
| the first place.
|
| This article covers the advantages of semantic search over
| keyword search - https://medium.com/neuml/getting-started-with-
| semantic-searc...
| yagami_takayuki wrote:
| For recognizing features such as hair and skin color, which would
| do a better job? Machine learning with image classification? Or a
| vector database?
|
| I've had Weaviate return a dog given a human as input when doing
| an image similarity search, so I was wondering if there's some
| way to improve the results or whether I'm barking up the wrong
| tree.
| bfeynman wrote:
| I would think you could improve your embedding space to address
| that issue, partially. Similarity search (as a result of some
| contrastive loss) definitely suffers at the tails and the OOD
| is pretty bad. That being said, you're more likely to have
| higher recall than a more classical technique.
| dimatura wrote:
| For those two in particular? You'd definitely get a better
| result with an ML model such as a convolutional neural net. In
| some sense, using an image similarity query is a kind of ML
| model - nearest neighbor - which can work in some scenarios.
| But for this specifically, I'd recommend a CNN.
| m00x wrote:
| You don't use vector databases independently, you need to input
| the embedding from a ML model.
|
| For your use-case, it should be pretty simple. You could use a
| CNN and train it, or use YOLO, Deepface, or other face
| detection algos, then within the face, find the hair and find
| the skin.
|
| From there you can use a vector database to get the colors that
| resemble other inputs, or you can use a simple CNN to classify
| the hair and skin to the closest label.
| dimatura wrote:
| Any recommendations for an embedded embedding database (heh)?
| Embedded, as in sqlite. For smaller-scale problems, but hopefully
| more convenient than say, LMDB + FAISS.
| ripley12 wrote:
| FWIW, Simon Willison's `llm` tool just uses SQLite plus a few
| UDFs. The simplicity of that approach is appealing to me but I
| don't have a good sense of when+why it becomes insufficient.
| dimatura wrote:
| Thanks, I'll check it out.
| pjot wrote:
| I actually just finished a POC using DuckDB that does
| similarity search for HN comments.
|
| https://github.com/patricktrainer/hackernews-comment-search
| PhilippGille wrote:
| For Python I believe Chroma [1] can be used embedded.
|
| For Go I recently started building chromem-go, inspired by the
| Chroma interface: https://github.com/philippgille/chromem-go
|
| It's neither advanced nor for scale yet, but the RAG demo
| works.
|
| [1] https://github.com/chroma-core/chroma
| catketch wrote:
| https://github.com/asg017/sqlite-vss
| dimatura wrote:
| awesome, how did I not find this :D
| ripley12 wrote:
| The "Do you need a Dedicated Vector Database?" slide is quite
| interesting, but doesn't really answer its own question! This is
| something I've been wondering myself, if anyone has any
| guidelines or rules of thumb I would appreciate it.
|
| I've recently been using Simon Willison's (hi simonw) excellent
| `llm` tool that can help with embeddings, and it takes the
| simplest approach possible: just store embeddings in SQLite with
| a few UDFs for calculating distance etc.
|
| The simplicity of that approach is very appealing, but presumably
| at some level of traffic+data an application will outgrow it and
| need a more specialized database. Does anyone have a good
| intuition for where that cutoff might be?
| summarity wrote:
| I've had some success scaling a quick and dirty engine (that
| powers findsight.ai) to tens of millions of vectors, details in
| the talk here: https://youtu.be/elNrRU12xRc?t=1556
|
| That's maybe 1kLOC so I didn't need an external one after all.
| ok123456 wrote:
| Are these dedicated vector databases doing anything more
| complicated than what can be accomplished using Postgres with the
| Cube extension?
| m00x wrote:
| You don't need a dedicated vector db, you can use pgvector.
|
| You could maybe use Cube for euclidean space search, but you're
| better off using optimized algorithms for embedding space
| search.
|
| https://github.com/pgvector/pgvector
| gk1 wrote:
| Yes, see: https://www.pinecone.io/blog/hnsw-not-enough/
| shreyas44 wrote:
| From a few months ago -
| https://news.ycombinator.com/item?id=35814381
| jhj wrote:
| I don't know why PQ is listed as an "indexing strategy". It's a
| vector compression/quantization technique, not a means of
| partitioning the search space. You could encode vectors with PQ
| when using brute-force/flat index, an IVF index, with HNSW (all
| of which are present in Faiss with PQ encoding as IndexPQ,
| IndexIVFPQ and IndexHNSWPQ respectively), or even k-D trees or
| ANNOY if someone wanted to do that.
|
| "Use HNSW or Annoy for very large datasets where query speed is
| more important than precision": Graph-based methods have huge
| memory overhead and construction cost, and they aren't practical
| for billion-scale datasets. Also, they will usually be more
| accurate and faster than IVF techniques (as you would need to
| visit a large number of IVF cells to get comparable accuracy),
| though IVF can scale to trillion-sized databases without much
| overhead yet with reasonable speed/accuracy tradeoffs unlike
| other techniques. I'd say "use for medium-scale datasets where
| query speed is important, yet high accuracy is still desired and
| flat/brute-force indexing is impractical".
| m00x wrote:
| It turns a continuous space into a discrete space. You would
| first do PQ, then do KNN of the new discrete vector. This way
| you can compress the vocabulary to a fixed size.
| perone wrote:
| I find it interesting how everyone ignore EuclidesDB
| (https://euclidesdb.readthedocs.io) which came before Milvus and
| others in 2018, it is free and open-source. Same for all
| presentations from major DBs.
___________________________________________________________________
(page generated 2024-01-12 23:00 UTC)