[HN Gopher] Word Embeddings Explained
       ___________________________________________________________________
        
       Word Embeddings Explained
        
       Author : nalgeon
       Score  : 45 points
       Date   : 2022-01-06 17:07 UTC (5 hours ago)
        
 (HTM) web link (lena-voita.github.io)
 (TXT) w3m dump (lena-voita.github.io)
        
       | nalgeon wrote:
       | Actually, that's not a single article, but a whole book (kind
       | of):
       | 
       | 1. Word Embeddings (this article)
       | 
       | 2. Text Classification
       | 
       | https://lena-voita.github.io/nlp_course/text_classification....
       | 
       | 3. Language Modeling
       | 
       | https://lena-voita.github.io/nlp_course/language_modeling.ht...
       | 
       | 4. Seq2Seq and Attention
       | 
       | https://lena-voita.github.io/nlp_course/seq2seq_and_attentio...
       | 
       | 5. ELMo, GPT, BERT
       | 
       | https://lena-voita.github.io/nlp_course/transfer_learning.ht...
       | 
       | All by Lena Voita, who has a real talent for explaining complex
       | stuff.
        
       | thaumasiotes wrote:
       | Hmm. There was some discussion on the chatbot thread about the
       | use of models of reality in language processing, and that looks
       | like the best way to understand this idea.
       | 
       | > The easiest you can do is to represent words as one-hot
       | vectors: for the i-th word in the vocabulary, the vector has 1 on
       | the i-th dimension and 0 on the rest. In Machine Learning, this
       | is the most simple way to represent categorical features.
       | 
       | > You probably can guess why one-hot vectors are not the best way
       | to represent words. One of the problems is that for large
       | vocabularies, these vectors will be very long: vector
       | dimensionality is equal to the vocabulary size.
       | 
       | > What is really important, is that these vectors know nothing
       | about the words they represent. For example, one-hot vectors
       | "think" that cat is as close to dog as it is to table! We can say
       | that one-hot vectors do not capture meaning.
       | 
       | Well, the obvious improvement to this would be to use numeric
       | identifiers rather than implicit numeric identifiers where you
       | derive the number from the position of the only set bit in a very
       | long bitstring. That approach is considered and rejected:
       | 
       | > we can easily understand the text "I saw a cat", but _our
       | models can not - they need vectors of features_. [my emphasis]
       | 
       | but I would suggest that what this means is that the whole
       | approach to model design is flawed. Get a model that can handle
       | opaque identifiers on their own terms.
       | 
       | However, the approach described here has some valuable things to
       | say about modeling the world, even if it pretends that what it's
       | really modeling are words.
       | 
       | The little slideshow working through an example ends up
       | undermining the theory that it means to support. The examples are
       | carefully chosen to suggest that the approach has much more
       | validity than it really does. But if we add "syrup" to the list
       | of words to which "tezguino" is being compared, we immediately
       | see that it satisfies conditions (1), (2), and (4), but not (3).
       | This puts it on an equal footing with "wine", which satisfies
       | conditions (1), (2), and (3), but not (4). The problem is that
       | tezguino, a grain alcohol, is much more similar to wine, a fruit
       | alcohol, than it is to corn syrup. We shouldn't be calling this a
       | tossup.
       | 
       | So we've highlighted a couple of problems:
       | 
       | - Condition (2) is worthless. "Everybody likes ____" can be
       | applied to any noun phrase. This yields exactly zero information
       | about the semantics of the phrase. By including it as 1/4 of our
       | test battery, we're making ourselves stupider, seeing
       | similarities and dissimilarities that aren't there.
       | 
       | - Condition (3), which tells us that something is alcoholic, is
       | much more valuable than condition (1), which tells us that
       | something is a liquid, which is in turn more valuable than
       | condition (4), which tells us that something is derived from
       | corn. It makes no sense to weight these equally.
       | 
       | This is the problem in the distributional hypothesis as
       | described: similarity of rows in the table doesn't lead to any
       | useful inferences, because which rows are similar to which other
       | rows is subject to too much discretion on the part of the person
       | who assembles the model. The conditions being checked, considered
       | as a set, _are_ our model of reality, and we need to be careful
       | what that looks like. In this example, we have a very badly
       | designed model (which more or less defines liquids, alcohol, and
       | corn products, plus making a semantically-irrelevant grammatical
       | distinction) which has to be saved by a very carefully chosen set
       | of words being modeled. (Also, the example claims that it is more
       | incorrect to say  "everyone likes motor oil" than it is to say
       | "everyone likes wine". This is false.)
       | 
       | The post goes on to discuss how we can derive a model of reality
       | automatically from a textual corpus, and I think this approach is
       | a big problem in language processing today. You need the model of
       | reality to exist independently, and then you can hook up a model
       | of language to the model of reality. Without an independent model
       | of reality, you'll always be limited to language models that can
       | produce and understand grammatical sentences at the same level as
       | someone who is severely mentally retarded. (In fact, there is a
       | mental disorder of exactly this kind:
       | https://en.wikipedia.org/wiki/Receptive_aphasia .) Reality exists
       | outside of language, and language refers to it.
       | 
       | Reading this post has actually significantly lowered my opinion
       | of vector representations of words, which I previously knew
       | essentially nothing about. I've seen people raving about how cool
       | it is that "king" - [male] + [female] is "queen", but again that
       | just highlights that the vector representation is a shoddy
       | approximation of the model you want. What's "king" if you
       | subtract [male] without simultaneously adding [female]? Why does
       | the model even consider that a possibility? In a more accurate
       | model, "king" (along with many other words) has a gender
       | attribute, and the possible values of the gender attribute are
       | [male] and [female], and "king" requires the [male] value.
        
         | bluefox wrote:
         | 1. thaumasiotes is __________.
         | 
         | 2. everyone likes being __________, but...
         | 
         | 3. ... not everyone likes criticism, even when it's __________.
         | 
         | 4. therefore, often HN comments are downvoted despite being
         | __________.
        
       | rfw300 wrote:
       | This was a wonderfully accessible intro to word embeddings and
       | how they work. Worth a read for anyone just starting out with NLP
       | or just with a passing interest in it.
        
         | tsumnia wrote:
         | Agreed, and appropriately timed! I just finished using one-hot
         | encoding for a predictor on student practice. The results were
         | better than traditional methods, but only marginal better. When
         | discussing the results I specifically mentioned how the N
         | didn't understand meaning or levels of difficulty between
         | exercises. This helps add more context to my own understanding
         | and will definitely guide my next study.
        
         | omegalulw wrote:
         | It doesn't seem to cover contextual word embeddings though
         | which are all the rage since BERT came in.
        
           | gillesjacobs wrote:
           | You're right, I too was a bit disappointed that the page
           | didn't mention contextual embeddings since that has been the
           | best performing and most robust approach in NLP for nearly
           | four years now.
           | 
           | It's still a clear introduction to word vectors though and it
           | is good to know where we came from.
        
       ___________________________________________________________________
       (page generated 2022-01-06 23:01 UTC)