[HN Gopher] The Illustrated Word2Vec (2019)
___________________________________________________________________
The Illustrated Word2Vec (2019)
Author : wcedmisten
Score : 89 points
Date : 2024-04-18 13:08 UTC (1 days ago)
(HTM) web link (jalammar.github.io)
(TXT) w3m dump (jalammar.github.io)
| russfink wrote:
| "Embedding" --> representation(?)
|
| I do not think that word means what *I* think it means.
| epistasis wrote:
| That is essentially correct. You take an object and "embed" it
| in a high-dimensional vector space to represent it.
|
| For a deep dive, I highly recommend Vicki Boykis's free
| materials:
|
| https://vickiboykis.com/what_are_embeddings/
| coreyp_1 wrote:
| Quite frankly, it's stuff like this that makes me love the HN
| community.
|
| Thank you for the additional resource!
| mercurybee wrote:
| It's more common to refer to embeddings as low-dimensional.
| epistasis wrote:
| Can you give an example of that?
|
| I've rarely seen embeddings with fewer than hundreds of
| dimensions.
|
| UMAP/T-SNE are dimensional reduction techniques that could
| _maybe_ considered embeddings, but I haven 't encountered
| that in anything that relates to word2vec or LLMs or much
| of the current AI fashion.
| DISCURSIVE wrote:
| Yeah, we usually just say vector embeddings are the numerical
| representation of that piece of unstructured data. This
| glossary page put it together quite nicely.
| https://zilliz.com/glossary/vector-embeddings
| dang wrote:
| Discussed at the time:
|
| _The Illustrated Word2vec_ -
| https://news.ycombinator.com/item?id=19498356 - March 2019 (37
| comments)
| VHRanger wrote:
| This is a great guide.
|
| Also - despite the fact that language model embedding [1] are
| currently the hot rage, good old embedding models are more than
| good enough for most tasks.
|
| With just a bit of tuning, they're generally as good at many
| sentence embedding tasks [2], and with good libraries [3] you're
| getting something like 400k sentence/sec on laptop CPU versus
| ~4k-15k sentences/sec on a v100 for LM embeddings.
|
| When you should use language model embeddings:
|
| - Multilingual tasks. While some embedding models are
| multilingual aligned (eg. MUSE [4]), you still need to route the
| sentence to the correct embedding model file (you need something
| like langdetect). It's also cumbersome, with one 400mb file per
| language.
|
| For LM embedding models, many are multilingual aligned right
| away.
|
| - Tasks that are very context specific or require fine-tuning.
| For instance, if you're making a RAG system for medical
| documents, the embedding space is best when it creates larger
| deviations for the difference between seemingly-related medical
| words.
|
| This means models with more embedding dimensions, and heavily
| favors LM models over classic embedding models.
|
| 1. sbert.net
|
| 2.
| https://collaborate.princeton.edu/en/publications/a-simple-b...
|
| 3. https://github.com/oborchers/Fast_Sentence_Embeddings
|
| 4. https://github.com/facebookresearch/MUSE
| iman453 wrote:
| Could you explain what you mean by
|
| > Tasks that are very context specific or require fine-tuning.
| For instance, if you're making a RAG system for medical
| documents, the embedding space is best when it creates larger
| deviations for the difference between seemingly-related medical
| words.
|
| (sorry I'm very new to ML stuff :))
| SgtBastard wrote:
| Not the person you're replying too, but:
|
| Foundational models (GPT-4, Llama 3 etc) effectively compress
| "some" human knowledge into its neural network weights so
| that it can generate outputs from inputs.
|
| However, obviously it can't compress ALL human knowledge for
| obvious time and cost reasons, but also on the basis that not
| all knowledge is publicly available (it's either personal
| information such as your medical records or otherwise
| proprietary).
|
| So we build Retrieval Augmented Gen AI, where we retrieve
| additional knowledge that the model wouldn't know about to
| help answer the query.
|
| We found early on the LLMs are very effective at in-context
| learning (look at 1-shot, few-shot learning) and so if you
| can include the right reference material and/or private
| information, the foundational models can demonstrate that
| they've "learnt" something and answer far more effectively.
|
| The challenge is how do you the right content to pass to the
| foundational model? One very effective way is to use vector
| search, which basically means:
|
| Pass your query to an embedding model, get a vector back.
| Then use that vector to perform a cosine-similarity search on
| all of the private data you have, that you've previously also
| generated an embedding vector for.
|
| The closest vectors are likely to be the most similar (and
| relevant) _if_ the embedding model is able to generate very
| different vectors for sources that superficially, seemingly
| related topics but are actually very very different.
|
| A good embedding model returns very different vectors for
| "University" and "Universe" but similar for "University" and
| "College"
| tanananinena wrote:
| Classical word embeddings are static - their value doesn't
| change depending on the context they appear in.
|
| You can think of the word embedding as a weighted average of
| embeddings of words which co-occur with the initial word.
|
| So it's a bit of a blurry meaning.
|
| Is "bark" related to a dog? Or to a tree?
|
| Well, a bit of both, really. The embedding doesn't care about
| the context of the word - once it's been trained.
|
| So if you search for related documents based on word
| embeddings of your query - it can happen that you miss the
| mark. The embeddings simply don't encode the semantics you
| need.
|
| In fact, this can happen even with contextual embeddings,
| when you look for something specific or in a specialized
| domain. With word embeddings it's just much more apparent.
___________________________________________________________________
(page generated 2024-04-19 23:00 UTC)