[HN Gopher] Embeddings are a good starting point for the AI curi...
       ___________________________________________________________________
        
       Embeddings are a good starting point for the AI curious app
       developer
        
       Author : bryantwolf
       Score  : 333 points
       Date   : 2024-04-17 17:09 UTC (5 hours ago)
        
 (HTM) web link (bawolf.substack.com)
 (TXT) w3m dump (bawolf.substack.com)
        
       | dullcrisp wrote:
       | Is there any easy way to run the embedding logic locally? Maybe
       | even locally to the database? My understanding is that they're
       | hitting OpenAI's API to get the embedding for each search query
       | and then storing that in the database. I wouldn't want my search
       | function to be dependent on OpenAI if I could help it.
        
         | dvt wrote:
         | Yes, I use fastembed-rs[1] in a project I'm working on and it
         | runs flawlessly. You can store the embeddings in any boring
         | database (it's just an array of f32s at the end of the day).
         | But for fast vector math (which you need for similarity
         | search), a vector database is recommended, e.g. the pgvector[2]
         | postgres extension.
         | 
         | [1] https://github.com/Anush008/fastembed-rs
         | 
         | [2] https://github.com/pgvector/pgvector
        
           | J_Shelby_J wrote:
           | Fun timing!
           | 
           | I literally just published my first crate: candle_embed[1]
           | 
           | It uses Candle under the hood (the crate is more of a user
           | friendly wrapper) and lets you use any model on HF like the
           | new SoTA model from Snowflake[2].
           | 
           | [1] https://github.com/ShelbyJenkins/candle_embed [2]
           | https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        
         | simonw wrote:
         | There are a bunch of embedding models you can run on your own
         | machine. My LLM tool had plugins for some of those:
         | 
         | -
         | https://llm.datasette.io/en/stable/plugins/directory.html#em...
         | 
         | Here's how to use them:
         | https://simonwillison.net/2023/Sep/4/llm-embeddings/
        
         | notakash wrote:
         | If you're building an iOS app, I've had success storing vectors
         | in coredata and using a tiny coreml model that runs on device
         | for embedding and then doing cosine similarity.
        
         | ngalstyan4 wrote:
         | We provide this functionality in Lantern cloud via our Lantern
         | Extras extension:
         | <https://github.com/lanterndata/lantern_extras>
         | 
         | You can generate CLIP embeddings locally on the DB server via:
         | SELECT abstract,            introduction,            figure1,
         | clip_text(abstract) AS abstract_ai,
         | clip_text(introduction) AS introduction_ai,
         | clip_image(figure1) AS figure1_ai       INTO papers_augmented
         | FROM papers;
         | 
         | Then you can search for embeddings via:                 SELECT
         | abstract, introduction FROM papers_augmented ORDER BY
         | clip_text(query) <=> abstract_ai LIMIT 10;
         | 
         | The approach significantly decreases search latency and results
         | in cleaner code. As an added bonus, EXPLAIN ANALYZE can now
         | tell percentage of time spent in embedding generation vs
         | search.
         | 
         | The linked library enables embedding generation for a dozen
         | open source models and proprietary APIs (list here:
         | <https://lantern.dev/docs/develop/generate>, and adding new
         | ones is really easy.
        
           | charlieyuan wrote:
           | Lantern seems really cool! Interestingly we did try CLIP
           | (openclip) image embeddings but the results were poor for
           | 24px by 24px icons. Any ideas?
           | 
           | Charlie @ v0.app
        
             | ngalstyan4 wrote:
             | I have tried CLIP on my personal photo album collection and
             | it worked really well there - I could write detailed scene
             | descriptions of past road trips, and the photos I had in
             | mind would pop up. Probably the model is better for
             | everyday photos than for icons
        
         | jmorgan wrote:
         | Support for _some_ embedding models works in Ollama (and
         | llama.cpp - Bert models specifically)                 ollama
         | pull all-minilm            curl
         | http://localhost:11434/api/embeddings -d '{         "model":
         | "all-minilm",         "prompt": "Here is an article about
         | llamas..."       }'
         | 
         | Embedding models run quite well even on CPU since they are
         | smaller models. There are other implementations with a library
         | form factor like transformers.js
         | https://xenova.github.io/transformers.js/ and sentence-
         | transformers https://pypi.org/project/sentence-transformers/
        
         | laktek wrote:
         | If you are building using Supabase stack (Postgres as DB with
         | pgVector), we just released a built-in embedding generation API
         | yesterday. This works both locally (in CPUs) and you can deploy
         | it without any modifications.
         | 
         | Check this video on building Semantic Search in Supabase:
         | https://youtu.be/w4Rr_1whU-U
         | 
         | Also, the blog on announcement with links to text versions of
         | the tutorials: https://supabase.com/blog/ai-inference-now-
         | available-in-supa...
        
           | jonplackett wrote:
           | So handy! I already got some embeddings working with supabase
           | pgvector and OpenAI and it worked great.
           | 
           | What would the cost of running this be like compared to the
           | OpenAI embedding api?
        
             | laktek wrote:
             | There are no extra costs other than the what we'd normally
             | charge for Edge Function invocations (you get up to 500K in
             | the free plan and 2M in the Pro plan)
        
         | bryantwolf wrote:
         | This is a good call out. OpenAI embeddings were simple to stand
         | up, pretty good, cheap at this scale, and accessible to
         | everyone. I think that makes them a good starting point for
         | many people. That said, they're closed-source, and there are
         | open-source embeddings you can run on your infrastructure to
         | reduce external dependencies.
        
         | jonnycoder wrote:
         | The MTEB leaderboard has you covered. That is a goto for
         | finding the leading embedding models and I believe many of them
         | can run locally.
         | 
         | https://huggingface.co/spaces/mteb/leaderboard
        
         | xyc wrote:
         | Have just done this recently for local chat with pdf feature in
         | https://recurse.chat. (It's a macOS app that bundles a
         | llama.cpp server and local vector database)
         | 
         | Running an embedding server locally is pretty straightforward:
         | 
         | - Get llama.cpp release binary:
         | https://github.com/ggerganov/llama.cpp/releases
         | 
         | - Get a GGUF file: https://huggingface.co/CompendiumLabs/bge-
         | base-en-v1.5-gguf and https://huggingface.co/nomic-ai/nomic-
         | embed-text-v1-GGUF are some good choices to start
         | 
         | - Point the server to your embedding GGUF file (-m your.gguf)
         | and start the server (documentation: https://github.com/ggergan
         | ov/llama.cpp/blob/master/examples/...). Make sure you also add
         | --embedding to server start up param.
         | 
         | Then use it from an HTTP client:                   curl
         | http://localhost:8080/v1/embeddings \             -H "Content-
         | Type: application/json" \              -H "Authorization:
         | Bearer no-key" \             -d '{                 "input":
         | "hello"             }'
         | 
         | You can then store the embedding vector in local vector DBs of
         | your choice. You can refer to this comprehensive comparison:
         | https://superlinked.com/vector-db-comparison.
        
       | thisiszilff wrote:
       | One straightforward way to get started is to understand embedding
       | without any AI/deep learning magic. Just pick a vocabulary of
       | words (say, some 50k words), pick a unique index between 0 and
       | 49,999 for each of the words, and then produce embedding by
       | adding +1 to the given index for a given word each time it occurs
       | in a text. Then normalize the embedding so it adds up to one.
       | 
       | Presto -- embeddings! And you can use cosine similarity with them
       | and all that good stuff and the results aren't totally terrible.
       | 
       | The rest of "embeddings" builds on top of this basic strategy
       | (smaller vectors, filtering out words/tokens that occur
       | frequently enough that they don't signify similarity, handling
       | synonyms or words that are related to one another, etc. etc.).
       | But stripping out the deep learning bits really does make it
       | easier to understand.
        
         | pstorm wrote:
         | I'm trying to understand this approach. Maybe I am expecting
         | too much out of this basic approach, but how does this create a
         | similarity between words with indices close to each other?
         | Wouldn't it just be a popularity contest - the more common
         | words have higher indices and vice versa? For instance, "king"
         | and "prince" wouldn't necessarily have similar indices, but
         | they are semantically very similar.
        
           | zachrose wrote:
           | Maybe the idea is to order your vocabulary into some kind of
           | "semantic rainbow"? Like a one-dimensional embedding?
        
           | svieira wrote:
           | You are expecting too much out of this basic approach. The
           | "simple" similarity search in word2vec (used in
           | https://semantle.com/ if you haven't seen it) is based on
           | _multiple_ embeddings like this one (it's a simple neural
           | network not a simple embedding).
        
           | sdwr wrote:
           | It doesn't even work as described for popularity - one word
           | starts at 49,999 and one starts at 0.
        
             | itronitron wrote:
             | Yeah, that is a poorly written description. I think they
             | meant that each word gets a unique index location into an
             | array, and the value at that word's index location is
             | incremented whenever the word occurs.
        
           | jncfhnb wrote:
           | King doesn't need to appear commonly with prince. It just
           | needs to appear in the same context as prince.
           | 
           | It also leaves out the old "tf idf" normalization of
           | considering how common a word is broadly (less interesting)
           | vs in that particular document. Kind of like a shittier
           | attention. Used to make a big difference.
        
           | im3w1l wrote:
           | It's a document embedding, not a word embedding.
        
         | afro88 wrote:
         | How does this enable cosine similarity usage? I don't get the
         | link between incrementing a word's index by it's count in a
         | text and how this ends up with words that have similar meaning
         | to have a high cosine similarity value
        
           | sell_dennis wrote:
           | You're right, that approach doesn't enable getting embeddings
           | for an individual word. But it would work for comparing
           | similarity of documents - not that well of course, but it's a
           | toy example that might feel more intuitive
        
           | twelfthnight wrote:
           | I think they are talking about bag-of-words. If you apply a
           | dimensionality reduction technique like SVD or even random
           | projection on bag-of-words, you can effectively create a
           | basic embedding. Check out latent semantic indexing / latent
           | semantic analysis.
        
         | mschulkind wrote:
         | Aren't you just describing a bag-of-words model?
         | 
         | https://en.wikipedia.org/wiki/Bag-of-words_model
        
           | thisiszilff wrote:
           | Yes! And the follow up that cosine similarity (for BoW) is a
           | super simple similarity metric based on counting up the
           | number of words the two vectors have in common.
        
         | dekhn wrote:
         | Is that really an embedding? I normally think of an embedding
         | as an approximate lower-dimensional matrix of coefficients that
         | operate on a reduced set of composite variables that map the
         | data from a nonlinear to linear space.
        
           | thisiszilff wrote:
           | You're right that what I described isn't what people commonly
           | think about as embeddings (given we are more advanced now the
           | above description), but broadly an embedding is anything (in
           | nlp at least) that maps text into a fixed length vector. When
           | you make embedding like this, the nice thing is that cosine
           | similarity has an easy to understand similarity meaning:
           | count the number of words two documents have in common
           | (subject to some normalization constant).
           | 
           | Most fancy modern embedding strategies basically start with
           | this and then proceed to build on top of it to reduce
           | dimensions, represent words as vectors in their own right,
           | pass this into some neural layer, etc.
        
         | HarHarVeryFunny wrote:
         | Those would really just be identifiers. I think the key
         | property of embeddings is that the dimensions each individually
         | mean/measure something, and therefore the dot product of two
         | embeddings (similarity of direction of the vectors) is a
         | meaningful similarity measure of the things being represented.
         | 
         | The classic example is word embeddings such as word2vec, or
         | GloVE, where due to the embeddings being meaningful in this
         | way, one can see vector relationships such as "man - woman" =
         | "king - queen".
        
           | IanCal wrote:
           | They're not, I get why you think that though.
           | 
           | They're making a vector for a text that's the term
           | frequencies in the document.
           | 
           | It's one step simpler than tfidf which is a great starting
           | point.
        
             | OmarShehata wrote:
             | Are you saying it's pure chance that operations like "man -
             | woman" = "king - queen" (and many, many other similar
             | relationships and analogies) work?
             | 
             | If not please explain this comment to those of us ignorant
             | in these matters :)
        
               | StrangeDoctor wrote:
               | It's not pure chance that the above calculus shakes out,
               | but it doesn't have to be that way. If you are embedding
               | on a word by word level then it can happen, if it's a
               | little smaller or larger than word by word it's not
               | immediately clear what the calculation is doing.
               | 
               | But the main difference here is you get 1 embedding for
               | the document in question, not an embedding per word like
               | word2vec. So it's something more like "document about
               | OS/2 warp" - "wiki page for ibm" + "wiki page for
               | Microsoft" = "document on windows 3.1"
        
             | HarHarVeryFunny wrote:
             | OK, sounds counter-intuitive, but I'll take your word for
             | it!
             | 
             | It seems odd since the basis of word similarity captured in
             | this type of way is that word meanings are associated with
             | local context, which doesn't seem related to these global
             | occurrence counts.
             | 
             | Perhaps it works because two words with similar occurrence
             | counts are more likely to often appear close to each other
             | than two words where one has a high count, and another a
             | small count? But this wouldn't seem to work for small
             | counts, and anyways the counts are just being added to the
             | base index rather than making similar-count words closer in
             | the embedding space.
             | 
             | Do you have any explanation for why this captures any
             | similarity in meaning?
        
               | IanCal wrote:
               | > rather than making similar-count words closer in the
               | embedding space.
               | 
               | Ah I think I see the confusion here. They are describing
               | creating an embedding of a _document_ or piece of text.
               | At the base, the embedding of a single word would just be
               | a single 1. There is absolutely no help with word
               | similarity.
               | 
               | The problem of multiple meanings isn't solved by this
               | approach at all, at least not directly.
               | 
               | Talking about the "gravity of a situation" in a political
               | piece makes the text _a bit more similar_ to physics
               | discussions about gravity. But most of the words won 't
               | match as well, so your document vector is still more
               | similar to other political pieces than physics.
               | 
               | Going up the scale, here's a few basic starting points
               | that were (are?) the backbone of many production text
               | AI/ML systems.
               | 
               | 1. Bag of words. Here your vector has a 1 for words that
               | are present, and 0 for ones that aren't.
               | 
               | 2. Bag of words with a count. A little better, now we've
               | got the information that you said "gravity" fifty times
               | not once. Normalise it so text length doesn't matter and
               | everything fits into 0-1.
               | 
               | 3. TF-IDF. It's not very useful to know that you said a
               | common word a lot. Most texts do, what we care about is
               | ones that say it more than you'd expect so we take into
               | account how often the words appear in the entire corpus.
               | 
               | These don't help with words, but given how simple they
               | are they are shockingly useful. They have their stupid
               | moments, although one benefit is that it's very easy to
               | debug why they cause a problem.
        
           | thisiszilff wrote:
           | > I think the key property of embeddings is that the
           | dimensions each individually mean/measure something, and
           | therefore the dot product of two embeddings (similarity of
           | direction of the vectors) is a meaningful similarity measure
           | of the things being represented.
           | 
           | In this case each dimension is the presence of a word in a
           | particular text. So when you take the dot product of two
           | texts you are effectively counting the number of words the
           | two texts have in common (subject to some normalization
           | constants depending on how you normalize the embedding).
           | Cosine similarity still works for even these super naive
           | embeddings which makes it slightly easier to understand
           | before getting into any mathy stuff.
           | 
           | You are 100% right this won't give you the word embedding
           | analogies like king - man = queen or stuff like that. This
           | embedding has no concept of relationships between words.
        
             | HarHarVeryFunny wrote:
             | But that doesn't seem to be what you are describing in
             | terms of using incrementing indices and adding occurrence
             | counts.
             | 
             | If you want to create a bag of words text embedding then
             | you set the number of embedding dimensions to the
             | vocabulary size and the value of each dimension to the
             | global count of the corresponding word.
        
               | thisiszilff wrote:
               | Heh -- my explanation isn't the clearest I realize, but
               | yes, it is BoW.
               | 
               | Eg fix your vocab of 50k words (or whatever) and
               | enumerate it.
               | 
               | Then to make an embedding for some piece of text
               | 
               | 1. initialize an all zero vector of size 50k 2. for each
               | word in the text, add one to the index of the
               | corresponding word (per our enumeration). If the word
               | isn't in the 50k words in your vocabulary, then discard
               | it 3. (optionally), normalize the embedding to 1 (though
               | you don't really need this and can leave it off for the
               | toy example). initialize an embedding (for a single text)
               | as an all zero vector of size 50k
        
       | m1117 wrote:
       | ah pgvector is kind of annoying to start with, you have to set it
       | up and maintain, and then it starts falling apart when you have
       | more vectors
        
         | sdesol wrote:
         | Can you elaborate more on the falling apart? I can see pgvector
         | being intimidating for users with no experience standing up a
         | DB, but I don't see how Postgres or pgvector would fall apart.
         | Note, my reason for asking is I'm planning on going all in with
         | Postgres, so pgvector makes sense for me.
        
           | cargobuild wrote:
           | https://www.pinecone.io/blog/pinecone-vs-pgvector/ check it
           | out :)
        
         | hackernoteng wrote:
         | What is "more vectors"? How many are we talking about? We've
         | been using pgvector in production for more than 1 year without
         | any issues. We dont have a ton of vectors, less than 100,000,
         | and we filter queries by other fields so our total per cosine
         | function is probably more like max of 5000. Performance is fine
         | and no issues.
        
       | patrick-fitz wrote:
       | Nice project! I find it can be hard to think of a idea that is
       | well suited to use AI. Using embeddings for search is definitely
       | a good option to start with.
        
         | ParanoidShroom wrote:
         | I made a reverse image search when I learned about embeddings.
         | It's pretty fun to work with images
         | https://medium.com/@christophe.smet1/finding-dirty-xtc-with-...
        
       | LunaSea wrote:
       | Does anyone have examples of word (ngram) disambiguation when
       | doing Approximate Nearest Neighbour (ANN) on word vector
       | embeddings?
        
       | Imnimo wrote:
       | One of the challenges here is handling homonyms. If I search in
       | the app for "king", most of the top ten results are "ruler" icons
       | - showing a measuring stick. Rodent returns mostly computer mice,
       | etc.
       | 
       | https://www.v0.app/search?q=king
       | 
       | https://www.v0.app/search?q=rodent
       | 
       | This isn't a criticism of the app - I'd rather get a few funny
       | mismatches in exchange for being able to find related icons. But
       | it's an interesting puzzle to think about.
        
         | charlieyuan wrote:
         | Good call out! We think of this as a two part problem.
         | 
         | 1. The intent of the user. Is it a description of the look of
         | the icon or the utility of the icon? 2. How best to rank the
         | results which is a combination of intent, CTR of past search
         | queries, bootstrapping popularity via usage on open source
         | projects etc.
         | 
         | - Charlie of v0.app
        
         | itronitron wrote:
         | >> If I search in the app for "king", most of the top ten
         | results are "ruler" icons
         | 
         | I believe that's the measure of a man.
        
         | bryantwolf wrote:
         | Yeah, these can be cute, but they're not ideal. I think the
         | user feedback mechanism could help naturally align this over
         | time, but it would also be gameable. It's all interesting stuff
        
           | jonnycoder wrote:
           | As the op, you can do both semantic search (embedding) and
           | keyword search. Some RAG techniques call out using both for
           | better results. Nice product by the way!
        
         | dceddia wrote:
         | I was reading this article and thinking about things like, in
         | the case of doing transcription, if you heard the spoken word
         | "sign" in isolation you couldn't be sure whether it meant road
         | sign, spiritual sign, +/- sign, or even the sine function. This
         | seems like a similar problem where you pretty much require
         | context to make a good guess, otherwise the best it could do is
         | go off of how many times the word appears in the dataset right?
         | Is there something smarter it could do?
        
         | joshspankit wrote:
         | This is imo the worst part of embedding search.
         | 
         | Somehow Amazon continues to be the leader in muddy results
         | which is a sign that it's a huge problem domain and not easily
         | fixable even if you have massive resources.
        
       | EcommerceFlow wrote:
       | Embeddings have a special place in my heart since I learned about
       | them 2 years ago. Working in SEO, it felt like everything finally
       | "clicked" and I understood, on a lower level, how Google search
       | actually works, how they're able to show specific content
       | snippets directly on the search results page, etc. I never found
       | any "SEO Guru" discussing this at all back then (maybe even
       | now?), even though this was complete gold. It explains "topical
       | authority" and gave you clues on how Google itself understands
       | it.
        
       | minimaxir wrote:
       | One of my biggest annoyances with the modern AI tooling hype is
       | that you need to use a vector store for just working with
       | embeddings. You don't.
       | 
       | The reason vector stores are important for production use-cases
       | are mostly latency-related for larger sets of data (100k+
       | records), but if you're working on a toy project just learning
       | how to use embeddings, you can compute cosine distance with a
       | couple lines of numpy by doing a dot product of a normalized
       | query vectors with a matrix of normalized records.
       | 
       | Best of all, it gives you a reason to use Python's @ operator,
       | which with numpy matrices does a dot product.
        
         | twelfthnight wrote:
         | Even in production my guess is most teams would be better off
         | just rolling their own embedding model (huggingface) + caching
         | (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I
         | suppose there is some expertise needed, but working with a
         | vector database vendor has major drawbacks too.
        
           | hackernoteng wrote:
           | Using Postgres with pgvector is trivial and cheap. Its also
           | available on AWS RDS.
        
             | jonplackett wrote:
             | Also on supabase!
        
           | danielbln wrote:
           | Or you just shove it into Postgres + pg_vector and just use
           | the DBMS you already use anyway.
        
         | christiangenco wrote:
         | Yup. I was just playing around with this in Javascript
         | yesterday and with ChatGPT's help it was surprisingly simple to
         | go from text => embedding (via. `openai.embeddings.create`) and
         | then to compare the embedding similarity with the cosine
         | distance (which ChatGPT wrote for me):
         | https://gist.github.com/christiangenco/3e23925885e3127f2c177...
         | 
         | Seems like the next standard feature in every app is going to
         | be natural language search powered by embeddings.
        
           | minimaxir wrote:
           | For posterity, OpenAI embeddings come pre-normalized so you
           | can immediately dot-product.
           | 
           | Most embeddings providers do normalization by default, and
           | SentenceTransformers has a normalize_embeddings parameter
           | which does that. (it's a wrapper around PyTorch's
           | F.normalize)
        
         | itronitron wrote:
         | Does anyone know the provenance for when vectors started to be
         | called embeddings?
        
           | minimaxir wrote:
           | I think it was due to GloVe embeddings back then: I don't
           | recall them ever being called GloVe vectors, although the
           | "Ve" does stand for vector so it could have been RAS
           | syndrome.
        
             | itronitron wrote:
             | >> https://nlp.stanford.edu/projects/glove/
             | 
             | A quick scan of the project website yields zero uses of
             | 'embedding' and 23 of 'vector'
        
               | minimaxir wrote:
               | It's how I remember it when I was working with them back
               | in the day (word embeddings): I could be wrong.
        
         | bryantwolf wrote:
         | As an individual, I love the idea of pushing to simplify even
         | further to understand these core concepts. For the ecosystem, I
         | like that vector stores make these features accessible to
         | environments outside of Python.
        
           | simonw wrote:
           | If you ask ChatGPT to give you a cosine similarity function
           | that works against two arrays of floating numbers in any
           | programming language you'll get the code that you need.
           | 
           | Here's one in JavaScript (my prompt was "cosine similarity
           | function for two javascript arrays of floating point
           | numbers"):                   function cosineSimilarity(vecA,
           | vecB) {             if (vecA.length !== vecB.length) {
           | throw "Vectors do not have the same dimensions";
           | }             let dotProduct = 0.0;             let normA =
           | 0.0;             let normB = 0.0;             for (let i = 0;
           | i < vecA.length; i++) {                 dotProduct += vecA[i]
           | * vecB[i];                 normA += vecA[i] ** 2;
           | normB += vecB[i] ** 2;             }             if (normA
           | === 0 || normB === 0) {                 throw "One of the
           | vectors is zero, cannot compute similarity";             }
           | return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
           | }
           | 
           | Vector stores really aren't necessary if you're dealing with
           | less than a few hundred thousand vectors - load them up in a
           | bunch of in-memory arrays and run a function like that
           | against them using brute-force.
        
             | bryantwolf wrote:
             | I love it!
        
         | hereonout2 wrote:
         | 100k records is still pretty small!
         | 
         | It feels a bit like the hype that happended with "big data".
         | People ended up creating spark clusters to query a few million
         | records. Or using Hadoop for a dataset you could process with
         | awk.
         | 
         | Professionally I've only ever worked with dataset sizes in the
         | region of low millions and have never needed specialist tooling
         | to cope.
         | 
         | I assume these tools do serve a purpose but perhaps one that
         | only kicks in at a scale approaching billions.
        
           | smahs wrote:
           | This sentiment is pretty common I guess. Outside of a niche,
           | the massive scale for which a vast majority of the data tech
           | was designed doesn't exist and KISS wins outright. Though I
           | guess that's evolution, we want to test the limits in pursuit
           | of grandeur before mastering the utility (ex. pyramids).
        
             | jonnycoder wrote:
             | KISS doesn't get me employed though. I narrowly missed
             | being the chosen candidate for a State job which called for
             | Apache Spark experience. I missed two questions relating to
             | Spark and "what is a parquet file?" but otherwise did great
             | on the remaining behavioral questions (the hiring manager
             | gave me feedback after requesting it). Too bad they did not
             | have a question about processing data using command lines
             | tools.
        
           | mritchie712 wrote:
           | yeah, glad the hype around big data is dead. Not a lot of
           | solid numbers in here, but this post covers it well[0].
           | 
           | We have duckdb embedded in our product[1] and it works
           | perfectly well for billions of rows of a data without the
           | hadoop overhead.
           | 
           | 0 - https://motherduck.com/blog/big-data-is-dead/
           | 
           | 1 - https://www.definite.app/
        
           | hot_gril wrote:
           | I've been in the "mid-sized" area a lot where Numpy etc
           | cannot handle it, so I had to go to Postgres or more
           | specialized tooling like Spark. But I always started with the
           | simple thing and only moved up if it didn't suffice.
           | 
           | Similarly, I read how Postgres won't scale for a backend
           | application and I should use Citus, Spanner, or some NoSQL
           | thing. But that day has not yet arrived.
        
             | jerrygenser wrote:
             | Numpy might not be able to handle a full o(n^2) comparison
             | of vectors but you can use a lib with hsnw and it can have
             | great performance on medium (and large) datasets.
        
             | cornel_io wrote:
             | Right on: I've used a single Postgres database on AWS to
             | handle 1M+ concurrent users. If you're Google, sure, not
             | gonna cut it, but for most people these things scale
             | vertically a _lot_ further than you 'd expect (especially
             | if, like me, you grew up in the pre-SSD days and couldn't
             | get hundreds of gigs of RAM on a cloud instance).
             | 
             | Even when you do pass that point, you can often shard to
             | achieve horizontal scalability to at least some degree,
             | since the real heavy lifting is usually easy to break out
             | on a per-user basis. Some apps won't permit that (if you've
             | got cross-user joins then it's going to be a bit of a
             | headache), but at that point you've at least earned the
             | right to start building up a more complex stack and
             | complicating your queries to let things grow horizontally.
             | 
             | Horizontal scaling _is_ a huge headache, any way you cut
             | it, and TBH going with something like Spanner is just as
             | much of a headache because you have to understand its
             | limitations extremely well if you want _it_ to scale. It
             | doesn 't just magically make all your SQL infinitely
             | scalable, things that are hard to shard are typically also
             | hard to make fast on Spanner. What it's really good at is
             | taking an app with huge traffic where a) all the hot
             | queries _would_ be easy to shard, but b) you don 't want
             | the complexity of adding sharding logic (+re-sharding,
             | migration, failure handling, etc), and c) the tough to
             | shard queries are low frequency enough that you don't
             | really care if they're slow (I guess also d) you don't care
             | that it's hella expensive compared to a normal Postgres or
             | MySQL box). You still need to understand a lot more than
             | when using a normal DB, but it can add a lot of value in
             | those cases.
        
         | leobg wrote:
         | hnswlib, usearch. Both handle tens of millions of vectors
         | easily. The latter even without holding them in RAM.
        
         | ertgbnm wrote:
         | When I'm messing around, I normally have everything in a Pandas
         | DataFrame already so I just add embeddings as a column and
         | calculate cosine similarity on the fly. Even with a hundred
         | thousand rows, it's fast enough to calculate before I can even
         | move my eyes down on the screen to read the output.
         | 
         | I regret ever messing around with Pinecone for my tiny and
         | infrequently used set ups.
        
           | m1117 wrote:
           | Actually, I had a pretty good experience with Pinecone.
        
       | cargobuild wrote:
       | seeing comments about using pgvector... at pinecone, we spent
       | some time understanding it's limitations and pain points.
       | pinecone eliminates these pain points entirely and makes things
       | simple at any scale. check it out:
       | https://www.pinecone.io/blog/pinecone-vs-pgvector/
        
         | gregorymichael wrote:
         | Has Pinecone gotten any cheaper? Last time I tried it was
         | $75/month for the starter plan / single vector store.
        
           | cargobuild wrote:
           | yep. pinecone serverless has reduced costs significantly for
           | many workloads.
        
       | dvaun wrote:
       | I'd love to build a suite of local tooling to play around with
       | different embedding approaches.
       | 
       | I've had great results using SentenceTransformers for quick one-
       | off tasks at work for unique data asks.
       | 
       | I'm curious about clustering within the embeddings and seeing
       | what different approaches can yield and what applications they
       | work best for.
        
         | PaulHoule wrote:
         | If I have 50,000 historical articles and 5,000 new articles I
         | apply SBERT and then k-means with N=20 I get great results in
         | terms of articles about Ukraine, sports, chemistry, and
         | nerdcore from Lobsters ending up in distinct clusters.
         | 
         | I've used DBSCAN for finding duplicate content, this is less
         | successful. With the parameters I am using it is rare for there
         | to be a false positives, but there aren't that many true
         | positives. I'm sure I could do do better if I tuned it up but
         | I'm not sure if there is an operating point I'd really like.
        
       | kaycebasques wrote:
       | I have been saying similar things to my fellow technical writers
       | ever since the ChatGPT explosion. We now have a tool that makes
       | semantic search on arbitrary, diverse input much easier. Improved
       | semantic search could make a lot of common technical writing
       | workflows much more efficient. E.g. speeding up the mandatory
       | research that you must do before it's even possible to write an
       | effective doc.
        
       | gchadwick wrote:
       | For an article extolling the benefits of embeddings for
       | developers looking to dip their toe into the waters of AI it's
       | odd they don't actually have an intro to embeddings or to vector
       | databases. They just assume the reader already knows these
       | concepts and dives on in to how they use them.
       | 
       | Sure many do know these concepts already but they're probably not
       | the people wondering about a 'good starting point for the AI
       | curious app developer'.
        
         | charlieyuan wrote:
         | Apologies!
         | 
         | Here's a good primer on embeddings from openai:
         | https://platform.openai.com/docs/guides/embeddings
        
         | simonw wrote:
         | I published this pretty comprehensive intro to embeddings last
         | year: https://simonwillison.net/2023/Oct/23/embeddings/
        
         | gk1 wrote:
         | To add to the other recommendations, here's a primer on vector
         | DB's: https://www.pinecone.io/learn/vector-database/
        
       | hot_gril wrote:
       | This is where I got started too. Glove embedding stored in
       | Postgres.
       | 
       | Pgvector is nice, and it's cool seeing quick tutorials using it.
       | Back then, we only had cube, which didn't do cosine similarity
       | indexing out of the box (you had to normalize vectors and use
       | euclidean indexes) and only supported up to 100 dimensions. And
       | there were maybe other inconveniences I don't remember, cause
       | front page AI tutorials weren't using it.
        
         | isoprophlex wrote:
         | PGvector is very nice indeed. And you get to store your vectors
         | close to the rest of your data. I'm yet to understand the
         | unique use case for dedicated vector dbs. It seems so annoying,
         | having to query your vectors in a separate database without
         | being able to easily join/filter based on the rest of your
         | tables.
         | 
         | I stored ~6 million hacker news posts, their metadata, and the
         | vector embeddings in a cheap 20$/month vm running pgvector.
         | Querying is very fast. Maybe there's some penalty to pay when
         | you get to the billion+ row counts, but I'm happy so far.
        
           | hot_gril wrote:
           | You can also store vectors or matrices in a split-up fashion
           | as separate rows in a table, which is particularly useful if
           | they're sparse. I've handled huge sparse matrix expressions
           | (add, subtract, multiply, transpose) that way, cause numpy
           | couldn't deal with them.
        
       | crowcroft wrote:
       | My smooth brain might not understand this properly, but the idea
       | is we generate embeddings, store them, then use retrieval each
       | time we want to use them.
       | 
       | For simple things we might not need to worry about storing much,
       | we can generate the embeddings and just cache them or send them
       | straight to retrieval as an array or something...
       | 
       | The storing of embeddings seems the hard part, do I need a
       | special database or PG extension? Is there any reason I can't
       | store them as a blobs in SQlite if I don't have THAT much data,
       | and I don't care too much about speed? Do embeddings generated
       | ever 'expire'?
        
         | H1Supreme wrote:
         | Vector databases are used to store embeddings.
        
           | crowcroft wrote:
           | But why is that? I'm sure it's the 'best' way to do things,
           | but it also means more infrastructure which for simple apps
           | isn't worth the hassle.
           | 
           | I should use redis for queues but often I'll just use a table
           | in a SQLite database. For small scale projects I find it
           | works fine, I'm wondering what an equivalent simple option
           | for embeddings would be.
        
         | laborcontract wrote:
         | A KV store is both good enough and highly performant. I use
         | Redis for storing embeddings and expire them after a while.
         | Unless you have a highly specialized use case it's not
         | economical to persistently store chunk embedding.
         | 
         | Redis also does have vector search capability as well. However,
         | the most popular answer you'll get here is to use Postgres
         | (pgvectpr).
        
           | crowcroft wrote:
           | Redis sounds like a good option. I like that it's not more
           | infrastructure, I already have redis setup for my app so I'm
           | not adding more to the stack.
        
         | alexgarcia-xyz wrote:
         | Re storing vectors in BLOB columns: ya, if it's not a lot of
         | data and it's fast enough for you, then there's no problem
         | doing it like that. I'd even just store then in JSON/npy files
         | first and see how long you can get away with it. Once that gets
         | too slow, then try SQLite/redis/valkey, and when that gets too
         | slow, look into pgvector or other vector database solutions.
         | 
         | For SQLite specifically, very large BLOB columns might effect
         | query performance, especially for large embeddings. For
         | example, a 1536-dimension vector from OpenAI would take 1536 *
         | 4 = 6144 bytes of space, if stored in a compact BLOB format.
         | That's larger than SQLite default page size of 4096, so that
         | extra data will overflow into overflow pages. Which again,
         | isn't too big of a deal, but if the original table had small
         | values before, then table scans can be slower.
         | 
         | One solution is to move it to a separate table, ex on an
         | original `users` table, you can make a new `CREATE TABLE
         | users_embeddings(user_id, embedding)` table and just LEFT JOIN
         | that when you need it. Or you can use new techniques like
         | Matryoshka embeddings[0] or scalar/binary quantization[1] to
         | reduce the size of individual vectors, at the cost of lower
         | accuracy. Or you can bump the page size of your SQLite database
         | with `PRAGMA page_size=8192`.
         | 
         | I also have a SQLite extension for vector search[2], but
         | there's a number of usability/ergonomic issues with it. I'm
         | making a new one that I hope to release soon, which will
         | hopefully be a great middle ground between "store vectors in a
         | .npy files" and "use pgvector".
         | 
         | Re "do embeddings ever expire": nope! As long as you have
         | access to the same model, the same text input should give the
         | same embedding output. It's not like LLMs that have
         | temperatures/meta prompts/a million other dials that make
         | outputs non-deterministic, most embedding models should be
         | deterministic and should work forever.
         | 
         | [0] https://huggingface.co/blog/matryoshka
         | 
         | [1] https://huggingface.co/blog/embedding-quantization
         | 
         | [2] https://github.com/asg017/sqlite-vss
        
           | crowcroft wrote:
           | This is very useful appreciate the insight. Storing
           | embeddings in a table and joining when needed feels like a
           | really nice solution for what I'm trying to do.
        
         | kmeisthax wrote:
         | Yes, you can shove the embeddings in a BLOB, but then you can't
         | do the kinds of query operations you expect to be able to do
         | with embeddings.
        
           | crowcroft wrote:
           | Right like you could use it sort of like cache and send the
           | blobs to OpenAI to use their similarity API, but you couldn't
           | really use SQL to do cosine similarity operations?
           | 
           | My understanding of what's going on at a technical level
           | might be a bit limited.
        
             | kmeisthax wrote:
             | Yes.
             | 
             | Although if you really wanted to, and normalized your data
             | like a good little Edgar F. Codd devotee, you could write
             | something like this:
             | 
             | SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2)))
             | FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value
             | as v2, v1.value * v2.value as dot FROM vectors as v1 INNER
             | JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE
             | v1.vector_id = "?" AND v2.vector_id = "?") as v;
             | 
             | This assumes one table called "vectors" with columns
             | vector_id, dimension, and value; vector_id and dimension
             | being primary. The inner query grabs two vectors as
             | separate columns with some self-join trickery, computes the
             | product of each component, and then the outer query
             | computes aggregate functions on the inner query to do the
             | actual cosine similarity.
             | 
             | No I have not tested this on an actual database engine, I
             | probably screwed up the SQL somehow. And obviously it's
             | easier to just have a database (or Postgres extension) that
             | recognizes vector data as a distinct data type and gives
             | you a dedicated cosine-similarity function.
        
               | crowcroft wrote:
               | Thanks for the explanation! Appreciate that you took the
               | time to give an example. Makes a lot more sense why we
               | reach for specific tools for this.
        
         | bryantwolf wrote:
         | You'd have to update the embedding every time the data used to
         | generate it changes. For example, if you had an embedding for
         | user profiles and they updated their bio, you would want to
         | make a new embedding.
         | 
         | I don't expect to have to change the embeddings for each icon
         | all that often, so storing them seemed like a good choice.
         | However, you probably don't need to cache the embedding for
         | each search query since there will be long-tail ones that don't
         | change that much.
         | 
         | The reason to use pgvector over blobs is if you want to use the
         | distance functions in your queries.
        
       | thorum wrote:
       | Can embeddings be used to capture stylistic features of text,
       | rather than semantic? Like writing style?
        
         | levocardia wrote:
         | Probably, but you might need something more sophisticated than
         | cosine distance. For example, you might take a dataset of
         | business letters, diary entries, and fiction stories and train
         | some classifier on top of the embeddings of each of the three
         | types of text, then run (embeddings --> your classifier) on new
         | text. But at that point you might just want to ask an LLM
         | directly with a prompt like - "Classify the style of the
         | following text as business, personal, or fiction: $YOUR TEXT$"
        
           | vladimirzaytsev wrote:
           | You may get way more accurate results from relatively small
           | models as well as logits for each class if you ask one
           | question per class instead.
        
         | vladimirzaytsev wrote:
         | Likely not, embeddings are very crude. Embeddings of a text is
         | just an average of "meanings" of words.
         | 
         | As is embeddings lack a lot of tricks that made transformers so
         | efficient.
        
       | aidenn0 wrote:
       | Can someone give a qualitative explanation of what the vector of
       | a word with 2 unrelated meanings would look like compared to the
       | vector of a synonym of each of those meanings?
        
         | base698 wrote:
         | If you think about it like a point on a graph, and the vectors
         | as just 2D points (x,y), then the synonyms would be close and
         | the unrelated meanings would be further away.
        
           | aidenn0 wrote:
           | I'm guessing 2 dimensions isn't for this.
           | 
           | Here's a concrete example: "bow" would need to be close to
           | "ribbon" (as in a bow on a present) and also close to "gun"
           | (as a weapon that shoots a projectile), but "ribbon" and
           | "gun" would seem to need be far from each other. How does
           | something like word2vec resolve this? Any transitive
           | relationship would seem to fall afoul of this.
        
             | base698 wrote:
             | Yes, only more sophisticated embeddings can capture that
             | and it's over 300+ dimensions.
        
       | clementmas wrote:
       | Embeddings are indeed a good starting point. Next step is
       | choosing the model and the database. The comments here have been
       | taken over by database companies so I'm skeptical about the
       | opinions. I wish MySQL had a cosine search feature built in
        
         | bootsmann wrote:
         | pg_vector has you covered
        
       | mrkeen wrote:
       | Given                 not because they're sufficiently advanced
       | technology indistinguishable from magic, but the opposite.
       | Unlike LLMs, working with embeddings feels like regular
       | deterministic code.            <h3>Creating embeddings</h3>
       | 
       | I was hoping for a bit more than:                 They're a bit
       | of a black box            Next, we chose an embedding model.
       | OpenAI's embedding models will probably work just fine.
        
       | mehulashah wrote:
       | I think he is saying: embeddings are deterministic, so they are
       | more predictable in production.
       | 
       | They're still magic, with little explain ability or adaptability
       | when they don't work.
        
       | benreesman wrote:
       | Without getting into any big debates about whether or not RAG is
       | medium-term interesting or whatever, you can 'pip install
       | sentence-transformers faiss' and just immediately start having
       | fun. I recommend using straightforward cosine similarity to just
       | crush the NYT's recommender as a fun project for two reasons:
       | there's an API and plenty of corpus, and it's like, whoa, that's
       | better than the New York Times.
       | 
       | He's trying to sell a SaaS product (Pinecone), but he's doing it
       | the right way: it's ok to be an influencer if you know what
       | you're taking about.
       | 
       | James Briggs has great stuff on this:
       | https://youtube.com/@jamesbriggs
        
       ___________________________________________________________________
       (page generated 2024-04-17 23:00 UTC)