[HN Gopher] The Illustrated Word2Vec (2019)
       ___________________________________________________________________
        
       The Illustrated Word2Vec (2019)
        
       Author : wcedmisten
       Score  : 89 points
       Date   : 2024-04-18 13:08 UTC (1 days ago)
        
 (HTM) web link (jalammar.github.io)
 (TXT) w3m dump (jalammar.github.io)
        
       | russfink wrote:
       | "Embedding" --> representation(?)
       | 
       | I do not think that word means what *I* think it means.
        
         | epistasis wrote:
         | That is essentially correct. You take an object and "embed" it
         | in a high-dimensional vector space to represent it.
         | 
         | For a deep dive, I highly recommend Vicki Boykis's free
         | materials:
         | 
         | https://vickiboykis.com/what_are_embeddings/
        
           | coreyp_1 wrote:
           | Quite frankly, it's stuff like this that makes me love the HN
           | community.
           | 
           | Thank you for the additional resource!
        
           | mercurybee wrote:
           | It's more common to refer to embeddings as low-dimensional.
        
             | epistasis wrote:
             | Can you give an example of that?
             | 
             | I've rarely seen embeddings with fewer than hundreds of
             | dimensions.
             | 
             | UMAP/T-SNE are dimensional reduction techniques that could
             | _maybe_ considered embeddings, but I haven 't encountered
             | that in anything that relates to word2vec or LLMs or much
             | of the current AI fashion.
        
         | DISCURSIVE wrote:
         | Yeah, we usually just say vector embeddings are the numerical
         | representation of that piece of unstructured data. This
         | glossary page put it together quite nicely.
         | https://zilliz.com/glossary/vector-embeddings
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _The Illustrated Word2vec_ -
       | https://news.ycombinator.com/item?id=19498356 - March 2019 (37
       | comments)
        
       | VHRanger wrote:
       | This is a great guide.
       | 
       | Also - despite the fact that language model embedding [1] are
       | currently the hot rage, good old embedding models are more than
       | good enough for most tasks.
       | 
       | With just a bit of tuning, they're generally as good at many
       | sentence embedding tasks [2], and with good libraries [3] you're
       | getting something like 400k sentence/sec on laptop CPU versus
       | ~4k-15k sentences/sec on a v100 for LM embeddings.
       | 
       | When you should use language model embeddings:
       | 
       | - Multilingual tasks. While some embedding models are
       | multilingual aligned (eg. MUSE [4]), you still need to route the
       | sentence to the correct embedding model file (you need something
       | like langdetect). It's also cumbersome, with one 400mb file per
       | language.
       | 
       | For LM embedding models, many are multilingual aligned right
       | away.
       | 
       | - Tasks that are very context specific or require fine-tuning.
       | For instance, if you're making a RAG system for medical
       | documents, the embedding space is best when it creates larger
       | deviations for the difference between seemingly-related medical
       | words.
       | 
       | This means models with more embedding dimensions, and heavily
       | favors LM models over classic embedding models.
       | 
       | 1. sbert.net
       | 
       | 2.
       | https://collaborate.princeton.edu/en/publications/a-simple-b...
       | 
       | 3. https://github.com/oborchers/Fast_Sentence_Embeddings
       | 
       | 4. https://github.com/facebookresearch/MUSE
        
         | iman453 wrote:
         | Could you explain what you mean by
         | 
         | > Tasks that are very context specific or require fine-tuning.
         | For instance, if you're making a RAG system for medical
         | documents, the embedding space is best when it creates larger
         | deviations for the difference between seemingly-related medical
         | words.
         | 
         | (sorry I'm very new to ML stuff :))
        
           | SgtBastard wrote:
           | Not the person you're replying too, but:
           | 
           | Foundational models (GPT-4, Llama 3 etc) effectively compress
           | "some" human knowledge into its neural network weights so
           | that it can generate outputs from inputs.
           | 
           | However, obviously it can't compress ALL human knowledge for
           | obvious time and cost reasons, but also on the basis that not
           | all knowledge is publicly available (it's either personal
           | information such as your medical records or otherwise
           | proprietary).
           | 
           | So we build Retrieval Augmented Gen AI, where we retrieve
           | additional knowledge that the model wouldn't know about to
           | help answer the query.
           | 
           | We found early on the LLMs are very effective at in-context
           | learning (look at 1-shot, few-shot learning) and so if you
           | can include the right reference material and/or private
           | information, the foundational models can demonstrate that
           | they've "learnt" something and answer far more effectively.
           | 
           | The challenge is how do you the right content to pass to the
           | foundational model? One very effective way is to use vector
           | search, which basically means:
           | 
           | Pass your query to an embedding model, get a vector back.
           | Then use that vector to perform a cosine-similarity search on
           | all of the private data you have, that you've previously also
           | generated an embedding vector for.
           | 
           | The closest vectors are likely to be the most similar (and
           | relevant) _if_ the embedding model is able to generate very
           | different vectors for sources that superficially, seemingly
           | related topics but are actually very very different.
           | 
           | A good embedding model returns very different vectors for
           | "University" and "Universe" but similar for "University" and
           | "College"
        
           | tanananinena wrote:
           | Classical word embeddings are static - their value doesn't
           | change depending on the context they appear in.
           | 
           | You can think of the word embedding as a weighted average of
           | embeddings of words which co-occur with the initial word.
           | 
           | So it's a bit of a blurry meaning.
           | 
           | Is "bark" related to a dog? Or to a tree?
           | 
           | Well, a bit of both, really. The embedding doesn't care about
           | the context of the word - once it's been trained.
           | 
           | So if you search for related documents based on word
           | embeddings of your query - it can happen that you miss the
           | mark. The embeddings simply don't encode the semantics you
           | need.
           | 
           | In fact, this can happen even with contextual embeddings,
           | when you look for something specific or in a specialized
           | domain. With word embeddings it's just much more apparent.
        
       ___________________________________________________________________
       (page generated 2024-04-19 23:00 UTC)