[HN Gopher] Embeddings are a good starting point for the AI curi...
___________________________________________________________________
Embeddings are a good starting point for the AI curious app
developer
Author : bryantwolf
Score : 333 points
Date : 2024-04-17 17:09 UTC (5 hours ago)
(HTM) web link (bawolf.substack.com)
(TXT) w3m dump (bawolf.substack.com)
| dullcrisp wrote:
| Is there any easy way to run the embedding logic locally? Maybe
| even locally to the database? My understanding is that they're
| hitting OpenAI's API to get the embedding for each search query
| and then storing that in the database. I wouldn't want my search
| function to be dependent on OpenAI if I could help it.
| dvt wrote:
| Yes, I use fastembed-rs[1] in a project I'm working on and it
| runs flawlessly. You can store the embeddings in any boring
| database (it's just an array of f32s at the end of the day).
| But for fast vector math (which you need for similarity
| search), a vector database is recommended, e.g. the pgvector[2]
| postgres extension.
|
| [1] https://github.com/Anush008/fastembed-rs
|
| [2] https://github.com/pgvector/pgvector
| J_Shelby_J wrote:
| Fun timing!
|
| I literally just published my first crate: candle_embed[1]
|
| It uses Candle under the hood (the crate is more of a user
| friendly wrapper) and lets you use any model on HF like the
| new SoTA model from Snowflake[2].
|
| [1] https://github.com/ShelbyJenkins/candle_embed [2]
| https://huggingface.co/Snowflake/snowflake-arctic-embed-l
| simonw wrote:
| There are a bunch of embedding models you can run on your own
| machine. My LLM tool had plugins for some of those:
|
| -
| https://llm.datasette.io/en/stable/plugins/directory.html#em...
|
| Here's how to use them:
| https://simonwillison.net/2023/Sep/4/llm-embeddings/
| notakash wrote:
| If you're building an iOS app, I've had success storing vectors
| in coredata and using a tiny coreml model that runs on device
| for embedding and then doing cosine similarity.
| ngalstyan4 wrote:
| We provide this functionality in Lantern cloud via our Lantern
| Extras extension:
| <https://github.com/lanterndata/lantern_extras>
|
| You can generate CLIP embeddings locally on the DB server via:
| SELECT abstract, introduction, figure1,
| clip_text(abstract) AS abstract_ai,
| clip_text(introduction) AS introduction_ai,
| clip_image(figure1) AS figure1_ai INTO papers_augmented
| FROM papers;
|
| Then you can search for embeddings via: SELECT
| abstract, introduction FROM papers_augmented ORDER BY
| clip_text(query) <=> abstract_ai LIMIT 10;
|
| The approach significantly decreases search latency and results
| in cleaner code. As an added bonus, EXPLAIN ANALYZE can now
| tell percentage of time spent in embedding generation vs
| search.
|
| The linked library enables embedding generation for a dozen
| open source models and proprietary APIs (list here:
| <https://lantern.dev/docs/develop/generate>, and adding new
| ones is really easy.
| charlieyuan wrote:
| Lantern seems really cool! Interestingly we did try CLIP
| (openclip) image embeddings but the results were poor for
| 24px by 24px icons. Any ideas?
|
| Charlie @ v0.app
| ngalstyan4 wrote:
| I have tried CLIP on my personal photo album collection and
| it worked really well there - I could write detailed scene
| descriptions of past road trips, and the photos I had in
| mind would pop up. Probably the model is better for
| everyday photos than for icons
| jmorgan wrote:
| Support for _some_ embedding models works in Ollama (and
| llama.cpp - Bert models specifically) ollama
| pull all-minilm curl
| http://localhost:11434/api/embeddings -d '{ "model":
| "all-minilm", "prompt": "Here is an article about
| llamas..." }'
|
| Embedding models run quite well even on CPU since they are
| smaller models. There are other implementations with a library
| form factor like transformers.js
| https://xenova.github.io/transformers.js/ and sentence-
| transformers https://pypi.org/project/sentence-transformers/
| laktek wrote:
| If you are building using Supabase stack (Postgres as DB with
| pgVector), we just released a built-in embedding generation API
| yesterday. This works both locally (in CPUs) and you can deploy
| it without any modifications.
|
| Check this video on building Semantic Search in Supabase:
| https://youtu.be/w4Rr_1whU-U
|
| Also, the blog on announcement with links to text versions of
| the tutorials: https://supabase.com/blog/ai-inference-now-
| available-in-supa...
| jonplackett wrote:
| So handy! I already got some embeddings working with supabase
| pgvector and OpenAI and it worked great.
|
| What would the cost of running this be like compared to the
| OpenAI embedding api?
| laktek wrote:
| There are no extra costs other than the what we'd normally
| charge for Edge Function invocations (you get up to 500K in
| the free plan and 2M in the Pro plan)
| bryantwolf wrote:
| This is a good call out. OpenAI embeddings were simple to stand
| up, pretty good, cheap at this scale, and accessible to
| everyone. I think that makes them a good starting point for
| many people. That said, they're closed-source, and there are
| open-source embeddings you can run on your infrastructure to
| reduce external dependencies.
| jonnycoder wrote:
| The MTEB leaderboard has you covered. That is a goto for
| finding the leading embedding models and I believe many of them
| can run locally.
|
| https://huggingface.co/spaces/mteb/leaderboard
| xyc wrote:
| Have just done this recently for local chat with pdf feature in
| https://recurse.chat. (It's a macOS app that bundles a
| llama.cpp server and local vector database)
|
| Running an embedding server locally is pretty straightforward:
|
| - Get llama.cpp release binary:
| https://github.com/ggerganov/llama.cpp/releases
|
| - Get a GGUF file: https://huggingface.co/CompendiumLabs/bge-
| base-en-v1.5-gguf and https://huggingface.co/nomic-ai/nomic-
| embed-text-v1-GGUF are some good choices to start
|
| - Point the server to your embedding GGUF file (-m your.gguf)
| and start the server (documentation: https://github.com/ggergan
| ov/llama.cpp/blob/master/examples/...). Make sure you also add
| --embedding to server start up param.
|
| Then use it from an HTTP client: curl
| http://localhost:8080/v1/embeddings \ -H "Content-
| Type: application/json" \ -H "Authorization:
| Bearer no-key" \ -d '{ "input":
| "hello" }'
|
| You can then store the embedding vector in local vector DBs of
| your choice. You can refer to this comprehensive comparison:
| https://superlinked.com/vector-db-comparison.
| thisiszilff wrote:
| One straightforward way to get started is to understand embedding
| without any AI/deep learning magic. Just pick a vocabulary of
| words (say, some 50k words), pick a unique index between 0 and
| 49,999 for each of the words, and then produce embedding by
| adding +1 to the given index for a given word each time it occurs
| in a text. Then normalize the embedding so it adds up to one.
|
| Presto -- embeddings! And you can use cosine similarity with them
| and all that good stuff and the results aren't totally terrible.
|
| The rest of "embeddings" builds on top of this basic strategy
| (smaller vectors, filtering out words/tokens that occur
| frequently enough that they don't signify similarity, handling
| synonyms or words that are related to one another, etc. etc.).
| But stripping out the deep learning bits really does make it
| easier to understand.
| pstorm wrote:
| I'm trying to understand this approach. Maybe I am expecting
| too much out of this basic approach, but how does this create a
| similarity between words with indices close to each other?
| Wouldn't it just be a popularity contest - the more common
| words have higher indices and vice versa? For instance, "king"
| and "prince" wouldn't necessarily have similar indices, but
| they are semantically very similar.
| zachrose wrote:
| Maybe the idea is to order your vocabulary into some kind of
| "semantic rainbow"? Like a one-dimensional embedding?
| svieira wrote:
| You are expecting too much out of this basic approach. The
| "simple" similarity search in word2vec (used in
| https://semantle.com/ if you haven't seen it) is based on
| _multiple_ embeddings like this one (it's a simple neural
| network not a simple embedding).
| sdwr wrote:
| It doesn't even work as described for popularity - one word
| starts at 49,999 and one starts at 0.
| itronitron wrote:
| Yeah, that is a poorly written description. I think they
| meant that each word gets a unique index location into an
| array, and the value at that word's index location is
| incremented whenever the word occurs.
| jncfhnb wrote:
| King doesn't need to appear commonly with prince. It just
| needs to appear in the same context as prince.
|
| It also leaves out the old "tf idf" normalization of
| considering how common a word is broadly (less interesting)
| vs in that particular document. Kind of like a shittier
| attention. Used to make a big difference.
| im3w1l wrote:
| It's a document embedding, not a word embedding.
| afro88 wrote:
| How does this enable cosine similarity usage? I don't get the
| link between incrementing a word's index by it's count in a
| text and how this ends up with words that have similar meaning
| to have a high cosine similarity value
| sell_dennis wrote:
| You're right, that approach doesn't enable getting embeddings
| for an individual word. But it would work for comparing
| similarity of documents - not that well of course, but it's a
| toy example that might feel more intuitive
| twelfthnight wrote:
| I think they are talking about bag-of-words. If you apply a
| dimensionality reduction technique like SVD or even random
| projection on bag-of-words, you can effectively create a
| basic embedding. Check out latent semantic indexing / latent
| semantic analysis.
| mschulkind wrote:
| Aren't you just describing a bag-of-words model?
|
| https://en.wikipedia.org/wiki/Bag-of-words_model
| thisiszilff wrote:
| Yes! And the follow up that cosine similarity (for BoW) is a
| super simple similarity metric based on counting up the
| number of words the two vectors have in common.
| dekhn wrote:
| Is that really an embedding? I normally think of an embedding
| as an approximate lower-dimensional matrix of coefficients that
| operate on a reduced set of composite variables that map the
| data from a nonlinear to linear space.
| thisiszilff wrote:
| You're right that what I described isn't what people commonly
| think about as embeddings (given we are more advanced now the
| above description), but broadly an embedding is anything (in
| nlp at least) that maps text into a fixed length vector. When
| you make embedding like this, the nice thing is that cosine
| similarity has an easy to understand similarity meaning:
| count the number of words two documents have in common
| (subject to some normalization constant).
|
| Most fancy modern embedding strategies basically start with
| this and then proceed to build on top of it to reduce
| dimensions, represent words as vectors in their own right,
| pass this into some neural layer, etc.
| HarHarVeryFunny wrote:
| Those would really just be identifiers. I think the key
| property of embeddings is that the dimensions each individually
| mean/measure something, and therefore the dot product of two
| embeddings (similarity of direction of the vectors) is a
| meaningful similarity measure of the things being represented.
|
| The classic example is word embeddings such as word2vec, or
| GloVE, where due to the embeddings being meaningful in this
| way, one can see vector relationships such as "man - woman" =
| "king - queen".
| IanCal wrote:
| They're not, I get why you think that though.
|
| They're making a vector for a text that's the term
| frequencies in the document.
|
| It's one step simpler than tfidf which is a great starting
| point.
| OmarShehata wrote:
| Are you saying it's pure chance that operations like "man -
| woman" = "king - queen" (and many, many other similar
| relationships and analogies) work?
|
| If not please explain this comment to those of us ignorant
| in these matters :)
| StrangeDoctor wrote:
| It's not pure chance that the above calculus shakes out,
| but it doesn't have to be that way. If you are embedding
| on a word by word level then it can happen, if it's a
| little smaller or larger than word by word it's not
| immediately clear what the calculation is doing.
|
| But the main difference here is you get 1 embedding for
| the document in question, not an embedding per word like
| word2vec. So it's something more like "document about
| OS/2 warp" - "wiki page for ibm" + "wiki page for
| Microsoft" = "document on windows 3.1"
| HarHarVeryFunny wrote:
| OK, sounds counter-intuitive, but I'll take your word for
| it!
|
| It seems odd since the basis of word similarity captured in
| this type of way is that word meanings are associated with
| local context, which doesn't seem related to these global
| occurrence counts.
|
| Perhaps it works because two words with similar occurrence
| counts are more likely to often appear close to each other
| than two words where one has a high count, and another a
| small count? But this wouldn't seem to work for small
| counts, and anyways the counts are just being added to the
| base index rather than making similar-count words closer in
| the embedding space.
|
| Do you have any explanation for why this captures any
| similarity in meaning?
| IanCal wrote:
| > rather than making similar-count words closer in the
| embedding space.
|
| Ah I think I see the confusion here. They are describing
| creating an embedding of a _document_ or piece of text.
| At the base, the embedding of a single word would just be
| a single 1. There is absolutely no help with word
| similarity.
|
| The problem of multiple meanings isn't solved by this
| approach at all, at least not directly.
|
| Talking about the "gravity of a situation" in a political
| piece makes the text _a bit more similar_ to physics
| discussions about gravity. But most of the words won 't
| match as well, so your document vector is still more
| similar to other political pieces than physics.
|
| Going up the scale, here's a few basic starting points
| that were (are?) the backbone of many production text
| AI/ML systems.
|
| 1. Bag of words. Here your vector has a 1 for words that
| are present, and 0 for ones that aren't.
|
| 2. Bag of words with a count. A little better, now we've
| got the information that you said "gravity" fifty times
| not once. Normalise it so text length doesn't matter and
| everything fits into 0-1.
|
| 3. TF-IDF. It's not very useful to know that you said a
| common word a lot. Most texts do, what we care about is
| ones that say it more than you'd expect so we take into
| account how often the words appear in the entire corpus.
|
| These don't help with words, but given how simple they
| are they are shockingly useful. They have their stupid
| moments, although one benefit is that it's very easy to
| debug why they cause a problem.
| thisiszilff wrote:
| > I think the key property of embeddings is that the
| dimensions each individually mean/measure something, and
| therefore the dot product of two embeddings (similarity of
| direction of the vectors) is a meaningful similarity measure
| of the things being represented.
|
| In this case each dimension is the presence of a word in a
| particular text. So when you take the dot product of two
| texts you are effectively counting the number of words the
| two texts have in common (subject to some normalization
| constants depending on how you normalize the embedding).
| Cosine similarity still works for even these super naive
| embeddings which makes it slightly easier to understand
| before getting into any mathy stuff.
|
| You are 100% right this won't give you the word embedding
| analogies like king - man = queen or stuff like that. This
| embedding has no concept of relationships between words.
| HarHarVeryFunny wrote:
| But that doesn't seem to be what you are describing in
| terms of using incrementing indices and adding occurrence
| counts.
|
| If you want to create a bag of words text embedding then
| you set the number of embedding dimensions to the
| vocabulary size and the value of each dimension to the
| global count of the corresponding word.
| thisiszilff wrote:
| Heh -- my explanation isn't the clearest I realize, but
| yes, it is BoW.
|
| Eg fix your vocab of 50k words (or whatever) and
| enumerate it.
|
| Then to make an embedding for some piece of text
|
| 1. initialize an all zero vector of size 50k 2. for each
| word in the text, add one to the index of the
| corresponding word (per our enumeration). If the word
| isn't in the 50k words in your vocabulary, then discard
| it 3. (optionally), normalize the embedding to 1 (though
| you don't really need this and can leave it off for the
| toy example). initialize an embedding (for a single text)
| as an all zero vector of size 50k
| m1117 wrote:
| ah pgvector is kind of annoying to start with, you have to set it
| up and maintain, and then it starts falling apart when you have
| more vectors
| sdesol wrote:
| Can you elaborate more on the falling apart? I can see pgvector
| being intimidating for users with no experience standing up a
| DB, but I don't see how Postgres or pgvector would fall apart.
| Note, my reason for asking is I'm planning on going all in with
| Postgres, so pgvector makes sense for me.
| cargobuild wrote:
| https://www.pinecone.io/blog/pinecone-vs-pgvector/ check it
| out :)
| hackernoteng wrote:
| What is "more vectors"? How many are we talking about? We've
| been using pgvector in production for more than 1 year without
| any issues. We dont have a ton of vectors, less than 100,000,
| and we filter queries by other fields so our total per cosine
| function is probably more like max of 5000. Performance is fine
| and no issues.
| patrick-fitz wrote:
| Nice project! I find it can be hard to think of a idea that is
| well suited to use AI. Using embeddings for search is definitely
| a good option to start with.
| ParanoidShroom wrote:
| I made a reverse image search when I learned about embeddings.
| It's pretty fun to work with images
| https://medium.com/@christophe.smet1/finding-dirty-xtc-with-...
| LunaSea wrote:
| Does anyone have examples of word (ngram) disambiguation when
| doing Approximate Nearest Neighbour (ANN) on word vector
| embeddings?
| Imnimo wrote:
| One of the challenges here is handling homonyms. If I search in
| the app for "king", most of the top ten results are "ruler" icons
| - showing a measuring stick. Rodent returns mostly computer mice,
| etc.
|
| https://www.v0.app/search?q=king
|
| https://www.v0.app/search?q=rodent
|
| This isn't a criticism of the app - I'd rather get a few funny
| mismatches in exchange for being able to find related icons. But
| it's an interesting puzzle to think about.
| charlieyuan wrote:
| Good call out! We think of this as a two part problem.
|
| 1. The intent of the user. Is it a description of the look of
| the icon or the utility of the icon? 2. How best to rank the
| results which is a combination of intent, CTR of past search
| queries, bootstrapping popularity via usage on open source
| projects etc.
|
| - Charlie of v0.app
| itronitron wrote:
| >> If I search in the app for "king", most of the top ten
| results are "ruler" icons
|
| I believe that's the measure of a man.
| bryantwolf wrote:
| Yeah, these can be cute, but they're not ideal. I think the
| user feedback mechanism could help naturally align this over
| time, but it would also be gameable. It's all interesting stuff
| jonnycoder wrote:
| As the op, you can do both semantic search (embedding) and
| keyword search. Some RAG techniques call out using both for
| better results. Nice product by the way!
| dceddia wrote:
| I was reading this article and thinking about things like, in
| the case of doing transcription, if you heard the spoken word
| "sign" in isolation you couldn't be sure whether it meant road
| sign, spiritual sign, +/- sign, or even the sine function. This
| seems like a similar problem where you pretty much require
| context to make a good guess, otherwise the best it could do is
| go off of how many times the word appears in the dataset right?
| Is there something smarter it could do?
| joshspankit wrote:
| This is imo the worst part of embedding search.
|
| Somehow Amazon continues to be the leader in muddy results
| which is a sign that it's a huge problem domain and not easily
| fixable even if you have massive resources.
| EcommerceFlow wrote:
| Embeddings have a special place in my heart since I learned about
| them 2 years ago. Working in SEO, it felt like everything finally
| "clicked" and I understood, on a lower level, how Google search
| actually works, how they're able to show specific content
| snippets directly on the search results page, etc. I never found
| any "SEO Guru" discussing this at all back then (maybe even
| now?), even though this was complete gold. It explains "topical
| authority" and gave you clues on how Google itself understands
| it.
| minimaxir wrote:
| One of my biggest annoyances with the modern AI tooling hype is
| that you need to use a vector store for just working with
| embeddings. You don't.
|
| The reason vector stores are important for production use-cases
| are mostly latency-related for larger sets of data (100k+
| records), but if you're working on a toy project just learning
| how to use embeddings, you can compute cosine distance with a
| couple lines of numpy by doing a dot product of a normalized
| query vectors with a matrix of normalized records.
|
| Best of all, it gives you a reason to use Python's @ operator,
| which with numpy matrices does a dot product.
| twelfthnight wrote:
| Even in production my guess is most teams would be better off
| just rolling their own embedding model (huggingface) + caching
| (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I
| suppose there is some expertise needed, but working with a
| vector database vendor has major drawbacks too.
| hackernoteng wrote:
| Using Postgres with pgvector is trivial and cheap. Its also
| available on AWS RDS.
| jonplackett wrote:
| Also on supabase!
| danielbln wrote:
| Or you just shove it into Postgres + pg_vector and just use
| the DBMS you already use anyway.
| christiangenco wrote:
| Yup. I was just playing around with this in Javascript
| yesterday and with ChatGPT's help it was surprisingly simple to
| go from text => embedding (via. `openai.embeddings.create`) and
| then to compare the embedding similarity with the cosine
| distance (which ChatGPT wrote for me):
| https://gist.github.com/christiangenco/3e23925885e3127f2c177...
|
| Seems like the next standard feature in every app is going to
| be natural language search powered by embeddings.
| minimaxir wrote:
| For posterity, OpenAI embeddings come pre-normalized so you
| can immediately dot-product.
|
| Most embeddings providers do normalization by default, and
| SentenceTransformers has a normalize_embeddings parameter
| which does that. (it's a wrapper around PyTorch's
| F.normalize)
| itronitron wrote:
| Does anyone know the provenance for when vectors started to be
| called embeddings?
| minimaxir wrote:
| I think it was due to GloVe embeddings back then: I don't
| recall them ever being called GloVe vectors, although the
| "Ve" does stand for vector so it could have been RAS
| syndrome.
| itronitron wrote:
| >> https://nlp.stanford.edu/projects/glove/
|
| A quick scan of the project website yields zero uses of
| 'embedding' and 23 of 'vector'
| minimaxir wrote:
| It's how I remember it when I was working with them back
| in the day (word embeddings): I could be wrong.
| bryantwolf wrote:
| As an individual, I love the idea of pushing to simplify even
| further to understand these core concepts. For the ecosystem, I
| like that vector stores make these features accessible to
| environments outside of Python.
| simonw wrote:
| If you ask ChatGPT to give you a cosine similarity function
| that works against two arrays of floating numbers in any
| programming language you'll get the code that you need.
|
| Here's one in JavaScript (my prompt was "cosine similarity
| function for two javascript arrays of floating point
| numbers"): function cosineSimilarity(vecA,
| vecB) { if (vecA.length !== vecB.length) {
| throw "Vectors do not have the same dimensions";
| } let dotProduct = 0.0; let normA =
| 0.0; let normB = 0.0; for (let i = 0;
| i < vecA.length; i++) { dotProduct += vecA[i]
| * vecB[i]; normA += vecA[i] ** 2;
| normB += vecB[i] ** 2; } if (normA
| === 0 || normB === 0) { throw "One of the
| vectors is zero, cannot compute similarity"; }
| return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
| }
|
| Vector stores really aren't necessary if you're dealing with
| less than a few hundred thousand vectors - load them up in a
| bunch of in-memory arrays and run a function like that
| against them using brute-force.
| bryantwolf wrote:
| I love it!
| hereonout2 wrote:
| 100k records is still pretty small!
|
| It feels a bit like the hype that happended with "big data".
| People ended up creating spark clusters to query a few million
| records. Or using Hadoop for a dataset you could process with
| awk.
|
| Professionally I've only ever worked with dataset sizes in the
| region of low millions and have never needed specialist tooling
| to cope.
|
| I assume these tools do serve a purpose but perhaps one that
| only kicks in at a scale approaching billions.
| smahs wrote:
| This sentiment is pretty common I guess. Outside of a niche,
| the massive scale for which a vast majority of the data tech
| was designed doesn't exist and KISS wins outright. Though I
| guess that's evolution, we want to test the limits in pursuit
| of grandeur before mastering the utility (ex. pyramids).
| jonnycoder wrote:
| KISS doesn't get me employed though. I narrowly missed
| being the chosen candidate for a State job which called for
| Apache Spark experience. I missed two questions relating to
| Spark and "what is a parquet file?" but otherwise did great
| on the remaining behavioral questions (the hiring manager
| gave me feedback after requesting it). Too bad they did not
| have a question about processing data using command lines
| tools.
| mritchie712 wrote:
| yeah, glad the hype around big data is dead. Not a lot of
| solid numbers in here, but this post covers it well[0].
|
| We have duckdb embedded in our product[1] and it works
| perfectly well for billions of rows of a data without the
| hadoop overhead.
|
| 0 - https://motherduck.com/blog/big-data-is-dead/
|
| 1 - https://www.definite.app/
| hot_gril wrote:
| I've been in the "mid-sized" area a lot where Numpy etc
| cannot handle it, so I had to go to Postgres or more
| specialized tooling like Spark. But I always started with the
| simple thing and only moved up if it didn't suffice.
|
| Similarly, I read how Postgres won't scale for a backend
| application and I should use Citus, Spanner, or some NoSQL
| thing. But that day has not yet arrived.
| jerrygenser wrote:
| Numpy might not be able to handle a full o(n^2) comparison
| of vectors but you can use a lib with hsnw and it can have
| great performance on medium (and large) datasets.
| cornel_io wrote:
| Right on: I've used a single Postgres database on AWS to
| handle 1M+ concurrent users. If you're Google, sure, not
| gonna cut it, but for most people these things scale
| vertically a _lot_ further than you 'd expect (especially
| if, like me, you grew up in the pre-SSD days and couldn't
| get hundreds of gigs of RAM on a cloud instance).
|
| Even when you do pass that point, you can often shard to
| achieve horizontal scalability to at least some degree,
| since the real heavy lifting is usually easy to break out
| on a per-user basis. Some apps won't permit that (if you've
| got cross-user joins then it's going to be a bit of a
| headache), but at that point you've at least earned the
| right to start building up a more complex stack and
| complicating your queries to let things grow horizontally.
|
| Horizontal scaling _is_ a huge headache, any way you cut
| it, and TBH going with something like Spanner is just as
| much of a headache because you have to understand its
| limitations extremely well if you want _it_ to scale. It
| doesn 't just magically make all your SQL infinitely
| scalable, things that are hard to shard are typically also
| hard to make fast on Spanner. What it's really good at is
| taking an app with huge traffic where a) all the hot
| queries _would_ be easy to shard, but b) you don 't want
| the complexity of adding sharding logic (+re-sharding,
| migration, failure handling, etc), and c) the tough to
| shard queries are low frequency enough that you don't
| really care if they're slow (I guess also d) you don't care
| that it's hella expensive compared to a normal Postgres or
| MySQL box). You still need to understand a lot more than
| when using a normal DB, but it can add a lot of value in
| those cases.
| leobg wrote:
| hnswlib, usearch. Both handle tens of millions of vectors
| easily. The latter even without holding them in RAM.
| ertgbnm wrote:
| When I'm messing around, I normally have everything in a Pandas
| DataFrame already so I just add embeddings as a column and
| calculate cosine similarity on the fly. Even with a hundred
| thousand rows, it's fast enough to calculate before I can even
| move my eyes down on the screen to read the output.
|
| I regret ever messing around with Pinecone for my tiny and
| infrequently used set ups.
| m1117 wrote:
| Actually, I had a pretty good experience with Pinecone.
| cargobuild wrote:
| seeing comments about using pgvector... at pinecone, we spent
| some time understanding it's limitations and pain points.
| pinecone eliminates these pain points entirely and makes things
| simple at any scale. check it out:
| https://www.pinecone.io/blog/pinecone-vs-pgvector/
| gregorymichael wrote:
| Has Pinecone gotten any cheaper? Last time I tried it was
| $75/month for the starter plan / single vector store.
| cargobuild wrote:
| yep. pinecone serverless has reduced costs significantly for
| many workloads.
| dvaun wrote:
| I'd love to build a suite of local tooling to play around with
| different embedding approaches.
|
| I've had great results using SentenceTransformers for quick one-
| off tasks at work for unique data asks.
|
| I'm curious about clustering within the embeddings and seeing
| what different approaches can yield and what applications they
| work best for.
| PaulHoule wrote:
| If I have 50,000 historical articles and 5,000 new articles I
| apply SBERT and then k-means with N=20 I get great results in
| terms of articles about Ukraine, sports, chemistry, and
| nerdcore from Lobsters ending up in distinct clusters.
|
| I've used DBSCAN for finding duplicate content, this is less
| successful. With the parameters I am using it is rare for there
| to be a false positives, but there aren't that many true
| positives. I'm sure I could do do better if I tuned it up but
| I'm not sure if there is an operating point I'd really like.
| kaycebasques wrote:
| I have been saying similar things to my fellow technical writers
| ever since the ChatGPT explosion. We now have a tool that makes
| semantic search on arbitrary, diverse input much easier. Improved
| semantic search could make a lot of common technical writing
| workflows much more efficient. E.g. speeding up the mandatory
| research that you must do before it's even possible to write an
| effective doc.
| gchadwick wrote:
| For an article extolling the benefits of embeddings for
| developers looking to dip their toe into the waters of AI it's
| odd they don't actually have an intro to embeddings or to vector
| databases. They just assume the reader already knows these
| concepts and dives on in to how they use them.
|
| Sure many do know these concepts already but they're probably not
| the people wondering about a 'good starting point for the AI
| curious app developer'.
| charlieyuan wrote:
| Apologies!
|
| Here's a good primer on embeddings from openai:
| https://platform.openai.com/docs/guides/embeddings
| simonw wrote:
| I published this pretty comprehensive intro to embeddings last
| year: https://simonwillison.net/2023/Oct/23/embeddings/
| gk1 wrote:
| To add to the other recommendations, here's a primer on vector
| DB's: https://www.pinecone.io/learn/vector-database/
| hot_gril wrote:
| This is where I got started too. Glove embedding stored in
| Postgres.
|
| Pgvector is nice, and it's cool seeing quick tutorials using it.
| Back then, we only had cube, which didn't do cosine similarity
| indexing out of the box (you had to normalize vectors and use
| euclidean indexes) and only supported up to 100 dimensions. And
| there were maybe other inconveniences I don't remember, cause
| front page AI tutorials weren't using it.
| isoprophlex wrote:
| PGvector is very nice indeed. And you get to store your vectors
| close to the rest of your data. I'm yet to understand the
| unique use case for dedicated vector dbs. It seems so annoying,
| having to query your vectors in a separate database without
| being able to easily join/filter based on the rest of your
| tables.
|
| I stored ~6 million hacker news posts, their metadata, and the
| vector embeddings in a cheap 20$/month vm running pgvector.
| Querying is very fast. Maybe there's some penalty to pay when
| you get to the billion+ row counts, but I'm happy so far.
| hot_gril wrote:
| You can also store vectors or matrices in a split-up fashion
| as separate rows in a table, which is particularly useful if
| they're sparse. I've handled huge sparse matrix expressions
| (add, subtract, multiply, transpose) that way, cause numpy
| couldn't deal with them.
| crowcroft wrote:
| My smooth brain might not understand this properly, but the idea
| is we generate embeddings, store them, then use retrieval each
| time we want to use them.
|
| For simple things we might not need to worry about storing much,
| we can generate the embeddings and just cache them or send them
| straight to retrieval as an array or something...
|
| The storing of embeddings seems the hard part, do I need a
| special database or PG extension? Is there any reason I can't
| store them as a blobs in SQlite if I don't have THAT much data,
| and I don't care too much about speed? Do embeddings generated
| ever 'expire'?
| H1Supreme wrote:
| Vector databases are used to store embeddings.
| crowcroft wrote:
| But why is that? I'm sure it's the 'best' way to do things,
| but it also means more infrastructure which for simple apps
| isn't worth the hassle.
|
| I should use redis for queues but often I'll just use a table
| in a SQLite database. For small scale projects I find it
| works fine, I'm wondering what an equivalent simple option
| for embeddings would be.
| laborcontract wrote:
| A KV store is both good enough and highly performant. I use
| Redis for storing embeddings and expire them after a while.
| Unless you have a highly specialized use case it's not
| economical to persistently store chunk embedding.
|
| Redis also does have vector search capability as well. However,
| the most popular answer you'll get here is to use Postgres
| (pgvectpr).
| crowcroft wrote:
| Redis sounds like a good option. I like that it's not more
| infrastructure, I already have redis setup for my app so I'm
| not adding more to the stack.
| alexgarcia-xyz wrote:
| Re storing vectors in BLOB columns: ya, if it's not a lot of
| data and it's fast enough for you, then there's no problem
| doing it like that. I'd even just store then in JSON/npy files
| first and see how long you can get away with it. Once that gets
| too slow, then try SQLite/redis/valkey, and when that gets too
| slow, look into pgvector or other vector database solutions.
|
| For SQLite specifically, very large BLOB columns might effect
| query performance, especially for large embeddings. For
| example, a 1536-dimension vector from OpenAI would take 1536 *
| 4 = 6144 bytes of space, if stored in a compact BLOB format.
| That's larger than SQLite default page size of 4096, so that
| extra data will overflow into overflow pages. Which again,
| isn't too big of a deal, but if the original table had small
| values before, then table scans can be slower.
|
| One solution is to move it to a separate table, ex on an
| original `users` table, you can make a new `CREATE TABLE
| users_embeddings(user_id, embedding)` table and just LEFT JOIN
| that when you need it. Or you can use new techniques like
| Matryoshka embeddings[0] or scalar/binary quantization[1] to
| reduce the size of individual vectors, at the cost of lower
| accuracy. Or you can bump the page size of your SQLite database
| with `PRAGMA page_size=8192`.
|
| I also have a SQLite extension for vector search[2], but
| there's a number of usability/ergonomic issues with it. I'm
| making a new one that I hope to release soon, which will
| hopefully be a great middle ground between "store vectors in a
| .npy files" and "use pgvector".
|
| Re "do embeddings ever expire": nope! As long as you have
| access to the same model, the same text input should give the
| same embedding output. It's not like LLMs that have
| temperatures/meta prompts/a million other dials that make
| outputs non-deterministic, most embedding models should be
| deterministic and should work forever.
|
| [0] https://huggingface.co/blog/matryoshka
|
| [1] https://huggingface.co/blog/embedding-quantization
|
| [2] https://github.com/asg017/sqlite-vss
| crowcroft wrote:
| This is very useful appreciate the insight. Storing
| embeddings in a table and joining when needed feels like a
| really nice solution for what I'm trying to do.
| kmeisthax wrote:
| Yes, you can shove the embeddings in a BLOB, but then you can't
| do the kinds of query operations you expect to be able to do
| with embeddings.
| crowcroft wrote:
| Right like you could use it sort of like cache and send the
| blobs to OpenAI to use their similarity API, but you couldn't
| really use SQL to do cosine similarity operations?
|
| My understanding of what's going on at a technical level
| might be a bit limited.
| kmeisthax wrote:
| Yes.
|
| Although if you really wanted to, and normalized your data
| like a good little Edgar F. Codd devotee, you could write
| something like this:
|
| SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2)))
| FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value
| as v2, v1.value * v2.value as dot FROM vectors as v1 INNER
| JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE
| v1.vector_id = "?" AND v2.vector_id = "?") as v;
|
| This assumes one table called "vectors" with columns
| vector_id, dimension, and value; vector_id and dimension
| being primary. The inner query grabs two vectors as
| separate columns with some self-join trickery, computes the
| product of each component, and then the outer query
| computes aggregate functions on the inner query to do the
| actual cosine similarity.
|
| No I have not tested this on an actual database engine, I
| probably screwed up the SQL somehow. And obviously it's
| easier to just have a database (or Postgres extension) that
| recognizes vector data as a distinct data type and gives
| you a dedicated cosine-similarity function.
| crowcroft wrote:
| Thanks for the explanation! Appreciate that you took the
| time to give an example. Makes a lot more sense why we
| reach for specific tools for this.
| bryantwolf wrote:
| You'd have to update the embedding every time the data used to
| generate it changes. For example, if you had an embedding for
| user profiles and they updated their bio, you would want to
| make a new embedding.
|
| I don't expect to have to change the embeddings for each icon
| all that often, so storing them seemed like a good choice.
| However, you probably don't need to cache the embedding for
| each search query since there will be long-tail ones that don't
| change that much.
|
| The reason to use pgvector over blobs is if you want to use the
| distance functions in your queries.
| thorum wrote:
| Can embeddings be used to capture stylistic features of text,
| rather than semantic? Like writing style?
| levocardia wrote:
| Probably, but you might need something more sophisticated than
| cosine distance. For example, you might take a dataset of
| business letters, diary entries, and fiction stories and train
| some classifier on top of the embeddings of each of the three
| types of text, then run (embeddings --> your classifier) on new
| text. But at that point you might just want to ask an LLM
| directly with a prompt like - "Classify the style of the
| following text as business, personal, or fiction: $YOUR TEXT$"
| vladimirzaytsev wrote:
| You may get way more accurate results from relatively small
| models as well as logits for each class if you ask one
| question per class instead.
| vladimirzaytsev wrote:
| Likely not, embeddings are very crude. Embeddings of a text is
| just an average of "meanings" of words.
|
| As is embeddings lack a lot of tricks that made transformers so
| efficient.
| aidenn0 wrote:
| Can someone give a qualitative explanation of what the vector of
| a word with 2 unrelated meanings would look like compared to the
| vector of a synonym of each of those meanings?
| base698 wrote:
| If you think about it like a point on a graph, and the vectors
| as just 2D points (x,y), then the synonyms would be close and
| the unrelated meanings would be further away.
| aidenn0 wrote:
| I'm guessing 2 dimensions isn't for this.
|
| Here's a concrete example: "bow" would need to be close to
| "ribbon" (as in a bow on a present) and also close to "gun"
| (as a weapon that shoots a projectile), but "ribbon" and
| "gun" would seem to need be far from each other. How does
| something like word2vec resolve this? Any transitive
| relationship would seem to fall afoul of this.
| base698 wrote:
| Yes, only more sophisticated embeddings can capture that
| and it's over 300+ dimensions.
| clementmas wrote:
| Embeddings are indeed a good starting point. Next step is
| choosing the model and the database. The comments here have been
| taken over by database companies so I'm skeptical about the
| opinions. I wish MySQL had a cosine search feature built in
| bootsmann wrote:
| pg_vector has you covered
| mrkeen wrote:
| Given not because they're sufficiently advanced
| technology indistinguishable from magic, but the opposite.
| Unlike LLMs, working with embeddings feels like regular
| deterministic code. <h3>Creating embeddings</h3>
|
| I was hoping for a bit more than: They're a bit
| of a black box Next, we chose an embedding model.
| OpenAI's embedding models will probably work just fine.
| mehulashah wrote:
| I think he is saying: embeddings are deterministic, so they are
| more predictable in production.
|
| They're still magic, with little explain ability or adaptability
| when they don't work.
| benreesman wrote:
| Without getting into any big debates about whether or not RAG is
| medium-term interesting or whatever, you can 'pip install
| sentence-transformers faiss' and just immediately start having
| fun. I recommend using straightforward cosine similarity to just
| crush the NYT's recommender as a fun project for two reasons:
| there's an API and plenty of corpus, and it's like, whoa, that's
| better than the New York Times.
|
| He's trying to sell a SaaS product (Pinecone), but he's doing it
| the right way: it's ok to be an influencer if you know what
| you're taking about.
|
| James Briggs has great stuff on this:
| https://youtube.com/@jamesbriggs
___________________________________________________________________
(page generated 2024-04-17 23:00 UTC)