[HN Gopher] Koan: A word2vec negative sampling implementation wi...
___________________________________________________________________
Koan: A word2vec negative sampling implementation with correct CBOW
update
Author : polm23
Score : 47 points
Date : 2021-01-02 08:15 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| polm23 wrote:
| This is an interesting surprise about good old word vectors. From
| the README:
|
| > Although continuous bag of word (CBOW) embeddings can be
| trained more quickly than skipgram (SG) embeddings, it is a
| common belief that SG embeddings tend to perform better in
| practice. This was observed by the original authors of Word2Vec
| [1] and also in subsequent work [2]. However, we found that
| popular implementations of word2vec with negative sampling such
| as word2vec and gensim do not implement the CBOW update
| correctly, thus potentially leading to misconceptions about the
| performance of CBOW embeddings when trained correctly.
|
| The upshot is that they get similar results with CBOW while
| training three times faster than skipgram.
|
| Given the popularity of Transformers, and that Fasttext exists,
| I'm curious as to what inspired them to even try this, but it's
| certainly an interesting result. There's so much word vector
| research that relies on quirks of the word2vec implementation.
| jarym wrote:
| With all the eyeballs on word2vec and gensim how did this not
| get picked up before?
| gojomo wrote:
| Gensim considers the `word2vec.c` code, from the original
| authors of the Word2Vec paper, as canonical and seeks to
| match its behavior exactly, even in ways it might deviate
| from some interpretations of the paper.
|
| If there's an actual benefit to be had here, Gensim could add
| it as an option - but would likely always default to the same
| CBOW behavior as in `word2vec.c` (& similarly, FastText) -
| rather than this 'koan' variant.
| logram wrote:
| Apparently it did: https://github.com/RaRe-
| Technologies/gensim/issues/1873
| gojomo wrote:
| While I still need to read this paper in detail, I'm not
| sure their only change is to this scaling of the update.
|
| The `koan` CBOW change has mixed effects on benchmarks, and
| makes their implementation no longer match the choices of
| the original, canonical `word2vec.c` release from the
| original Google authors of the word2vec paper. (Or, by my
| understanding, the CBOW mode of the FastText code.)
|
| So all the reasoning in that issue for why Gensim didn't
| want to make any change stands. Of course, if there's an
| alternate mode that offers proven benefits, it'd be a
| welcome suggestion/addition. (At this point, it's possible
| that simply using the `cbow_mean=0` sum-rather-than-average
| mode, or a different starting `alpha`, matches any claimed
| benefits of koan_CBOW.)
| mlthoughts2018 wrote:
| The paper itself says the only change is normalizing by
| the context window size C.
| nmfisher wrote:
| I'm not surprised that industry prefers W2V over Transformers,
| given how heavy-duty the latter can be at inference time.
|
| It's been a few years since I looked at it, but IIRC fastText
| is basically just w2v with subwords, so it's also possible this
| negative sampling fix applies to w2v and fastText equally.
| riku_iki wrote:
| One can also use shallow transformer models if inference
| throughput is important.
| steve_g wrote:
| To save a few clicks, here's the paper that describes the fix
| and gives some comparisons with the supposedly broken
| implementation.
|
| https://arxiv.org/pdf/2012.15332.pdf
| gojomo wrote:
| While I'm unsure of this paper/implementation's main claims
| without a closer reading, the Appendix D 'alias method for
| negative sampling' looks like it might be a nice standalone
| performance improvement to Gensim (& others') negative-
| sampling code.
| Der_Einzige wrote:
| This is awesome! Especially because word2vec and it's derivatives
| are much more useful in some cases than transformers are.
|
| For instance, they store the vocabulary. I can query for similar
| words, or do vector math and convert it back to words. That is
| much harder to do with transformers.
|
| Also, not surprised at all that this kind of bug made it through
| inspire of how popular word2vec is. NLP is chalk full of tiny
| bugs like this and there is all sorts of low hanging fruit for
| interested enough researchers...
| gwenzek wrote:
| Why not a pull request?
| mlthoughts2018 wrote:
| Why do work when you can milk it for status?
| mlthoughts2018 wrote:
| I'm not convinced by this paper. I've trained a lot of from-
| scratch word, sentence and query embeddings in my career, it's
| probably the single main thing I've done. I've never observed
| rescaling the average context vector to have an impact on
| application performance. It amounts to rescaling gradient terms,
| but most of those are being backprop'd from layers with batch
| normalization, strict activation functions, clipping, etc. There
| are many, many non-linear effects contributing to how that
| rescaling constant plays a role, and in anything other a
| completely shallow word2vec model with no further layers and
| where you just want to extract the embeddings in some
| application-agnostic way, that normalizing constant is not going
| to matter.
| piker wrote:
| Noticed in the code no Huffman tree. Then FN2 in the paper: "In
| this work, we always use the negative sampling formulations of
| Word2vec objectives which are consis-tently more efficient and
| effective than the hierarchical softmax formulations." Is that
| consensus?
___________________________________________________________________
(page generated 2021-01-02 23:01 UTC)