[HN Gopher] Word Embeddings Explained
___________________________________________________________________
Word Embeddings Explained
Author : nalgeon
Score : 45 points
Date : 2022-01-06 17:07 UTC (5 hours ago)
(HTM) web link (lena-voita.github.io)
(TXT) w3m dump (lena-voita.github.io)
| nalgeon wrote:
| Actually, that's not a single article, but a whole book (kind
| of):
|
| 1. Word Embeddings (this article)
|
| 2. Text Classification
|
| https://lena-voita.github.io/nlp_course/text_classification....
|
| 3. Language Modeling
|
| https://lena-voita.github.io/nlp_course/language_modeling.ht...
|
| 4. Seq2Seq and Attention
|
| https://lena-voita.github.io/nlp_course/seq2seq_and_attentio...
|
| 5. ELMo, GPT, BERT
|
| https://lena-voita.github.io/nlp_course/transfer_learning.ht...
|
| All by Lena Voita, who has a real talent for explaining complex
| stuff.
| thaumasiotes wrote:
| Hmm. There was some discussion on the chatbot thread about the
| use of models of reality in language processing, and that looks
| like the best way to understand this idea.
|
| > The easiest you can do is to represent words as one-hot
| vectors: for the i-th word in the vocabulary, the vector has 1 on
| the i-th dimension and 0 on the rest. In Machine Learning, this
| is the most simple way to represent categorical features.
|
| > You probably can guess why one-hot vectors are not the best way
| to represent words. One of the problems is that for large
| vocabularies, these vectors will be very long: vector
| dimensionality is equal to the vocabulary size.
|
| > What is really important, is that these vectors know nothing
| about the words they represent. For example, one-hot vectors
| "think" that cat is as close to dog as it is to table! We can say
| that one-hot vectors do not capture meaning.
|
| Well, the obvious improvement to this would be to use numeric
| identifiers rather than implicit numeric identifiers where you
| derive the number from the position of the only set bit in a very
| long bitstring. That approach is considered and rejected:
|
| > we can easily understand the text "I saw a cat", but _our
| models can not - they need vectors of features_. [my emphasis]
|
| but I would suggest that what this means is that the whole
| approach to model design is flawed. Get a model that can handle
| opaque identifiers on their own terms.
|
| However, the approach described here has some valuable things to
| say about modeling the world, even if it pretends that what it's
| really modeling are words.
|
| The little slideshow working through an example ends up
| undermining the theory that it means to support. The examples are
| carefully chosen to suggest that the approach has much more
| validity than it really does. But if we add "syrup" to the list
| of words to which "tezguino" is being compared, we immediately
| see that it satisfies conditions (1), (2), and (4), but not (3).
| This puts it on an equal footing with "wine", which satisfies
| conditions (1), (2), and (3), but not (4). The problem is that
| tezguino, a grain alcohol, is much more similar to wine, a fruit
| alcohol, than it is to corn syrup. We shouldn't be calling this a
| tossup.
|
| So we've highlighted a couple of problems:
|
| - Condition (2) is worthless. "Everybody likes ____" can be
| applied to any noun phrase. This yields exactly zero information
| about the semantics of the phrase. By including it as 1/4 of our
| test battery, we're making ourselves stupider, seeing
| similarities and dissimilarities that aren't there.
|
| - Condition (3), which tells us that something is alcoholic, is
| much more valuable than condition (1), which tells us that
| something is a liquid, which is in turn more valuable than
| condition (4), which tells us that something is derived from
| corn. It makes no sense to weight these equally.
|
| This is the problem in the distributional hypothesis as
| described: similarity of rows in the table doesn't lead to any
| useful inferences, because which rows are similar to which other
| rows is subject to too much discretion on the part of the person
| who assembles the model. The conditions being checked, considered
| as a set, _are_ our model of reality, and we need to be careful
| what that looks like. In this example, we have a very badly
| designed model (which more or less defines liquids, alcohol, and
| corn products, plus making a semantically-irrelevant grammatical
| distinction) which has to be saved by a very carefully chosen set
| of words being modeled. (Also, the example claims that it is more
| incorrect to say "everyone likes motor oil" than it is to say
| "everyone likes wine". This is false.)
|
| The post goes on to discuss how we can derive a model of reality
| automatically from a textual corpus, and I think this approach is
| a big problem in language processing today. You need the model of
| reality to exist independently, and then you can hook up a model
| of language to the model of reality. Without an independent model
| of reality, you'll always be limited to language models that can
| produce and understand grammatical sentences at the same level as
| someone who is severely mentally retarded. (In fact, there is a
| mental disorder of exactly this kind:
| https://en.wikipedia.org/wiki/Receptive_aphasia .) Reality exists
| outside of language, and language refers to it.
|
| Reading this post has actually significantly lowered my opinion
| of vector representations of words, which I previously knew
| essentially nothing about. I've seen people raving about how cool
| it is that "king" - [male] + [female] is "queen", but again that
| just highlights that the vector representation is a shoddy
| approximation of the model you want. What's "king" if you
| subtract [male] without simultaneously adding [female]? Why does
| the model even consider that a possibility? In a more accurate
| model, "king" (along with many other words) has a gender
| attribute, and the possible values of the gender attribute are
| [male] and [female], and "king" requires the [male] value.
| bluefox wrote:
| 1. thaumasiotes is __________.
|
| 2. everyone likes being __________, but...
|
| 3. ... not everyone likes criticism, even when it's __________.
|
| 4. therefore, often HN comments are downvoted despite being
| __________.
| rfw300 wrote:
| This was a wonderfully accessible intro to word embeddings and
| how they work. Worth a read for anyone just starting out with NLP
| or just with a passing interest in it.
| tsumnia wrote:
| Agreed, and appropriately timed! I just finished using one-hot
| encoding for a predictor on student practice. The results were
| better than traditional methods, but only marginal better. When
| discussing the results I specifically mentioned how the N
| didn't understand meaning or levels of difficulty between
| exercises. This helps add more context to my own understanding
| and will definitely guide my next study.
| omegalulw wrote:
| It doesn't seem to cover contextual word embeddings though
| which are all the rage since BERT came in.
| gillesjacobs wrote:
| You're right, I too was a bit disappointed that the page
| didn't mention contextual embeddings since that has been the
| best performing and most robust approach in NLP for nearly
| four years now.
|
| It's still a clear introduction to word vectors though and it
| is good to know where we came from.
___________________________________________________________________
(page generated 2022-01-06 23:01 UTC)