[HN Gopher] Koan: A word2vec negative sampling implementation wi...
       ___________________________________________________________________
        
       Koan: A word2vec negative sampling implementation with correct CBOW
       update
        
       Author : polm23
       Score  : 47 points
       Date   : 2021-01-02 08:15 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | polm23 wrote:
       | This is an interesting surprise about good old word vectors. From
       | the README:
       | 
       | > Although continuous bag of word (CBOW) embeddings can be
       | trained more quickly than skipgram (SG) embeddings, it is a
       | common belief that SG embeddings tend to perform better in
       | practice. This was observed by the original authors of Word2Vec
       | [1] and also in subsequent work [2]. However, we found that
       | popular implementations of word2vec with negative sampling such
       | as word2vec and gensim do not implement the CBOW update
       | correctly, thus potentially leading to misconceptions about the
       | performance of CBOW embeddings when trained correctly.
       | 
       | The upshot is that they get similar results with CBOW while
       | training three times faster than skipgram.
       | 
       | Given the popularity of Transformers, and that Fasttext exists,
       | I'm curious as to what inspired them to even try this, but it's
       | certainly an interesting result. There's so much word vector
       | research that relies on quirks of the word2vec implementation.
        
         | jarym wrote:
         | With all the eyeballs on word2vec and gensim how did this not
         | get picked up before?
        
           | gojomo wrote:
           | Gensim considers the `word2vec.c` code, from the original
           | authors of the Word2Vec paper, as canonical and seeks to
           | match its behavior exactly, even in ways it might deviate
           | from some interpretations of the paper.
           | 
           | If there's an actual benefit to be had here, Gensim could add
           | it as an option - but would likely always default to the same
           | CBOW behavior as in `word2vec.c` (& similarly, FastText) -
           | rather than this 'koan' variant.
        
           | logram wrote:
           | Apparently it did: https://github.com/RaRe-
           | Technologies/gensim/issues/1873
        
             | gojomo wrote:
             | While I still need to read this paper in detail, I'm not
             | sure their only change is to this scaling of the update.
             | 
             | The `koan` CBOW change has mixed effects on benchmarks, and
             | makes their implementation no longer match the choices of
             | the original, canonical `word2vec.c` release from the
             | original Google authors of the word2vec paper. (Or, by my
             | understanding, the CBOW mode of the FastText code.)
             | 
             | So all the reasoning in that issue for why Gensim didn't
             | want to make any change stands. Of course, if there's an
             | alternate mode that offers proven benefits, it'd be a
             | welcome suggestion/addition. (At this point, it's possible
             | that simply using the `cbow_mean=0` sum-rather-than-average
             | mode, or a different starting `alpha`, matches any claimed
             | benefits of koan_CBOW.)
        
               | mlthoughts2018 wrote:
               | The paper itself says the only change is normalizing by
               | the context window size C.
        
         | nmfisher wrote:
         | I'm not surprised that industry prefers W2V over Transformers,
         | given how heavy-duty the latter can be at inference time.
         | 
         | It's been a few years since I looked at it, but IIRC fastText
         | is basically just w2v with subwords, so it's also possible this
         | negative sampling fix applies to w2v and fastText equally.
        
           | riku_iki wrote:
           | One can also use shallow transformer models if inference
           | throughput is important.
        
         | steve_g wrote:
         | To save a few clicks, here's the paper that describes the fix
         | and gives some comparisons with the supposedly broken
         | implementation.
         | 
         | https://arxiv.org/pdf/2012.15332.pdf
        
           | gojomo wrote:
           | While I'm unsure of this paper/implementation's main claims
           | without a closer reading, the Appendix D 'alias method for
           | negative sampling' looks like it might be a nice standalone
           | performance improvement to Gensim (& others') negative-
           | sampling code.
        
       | Der_Einzige wrote:
       | This is awesome! Especially because word2vec and it's derivatives
       | are much more useful in some cases than transformers are.
       | 
       | For instance, they store the vocabulary. I can query for similar
       | words, or do vector math and convert it back to words. That is
       | much harder to do with transformers.
       | 
       | Also, not surprised at all that this kind of bug made it through
       | inspire of how popular word2vec is. NLP is chalk full of tiny
       | bugs like this and there is all sorts of low hanging fruit for
       | interested enough researchers...
        
       | gwenzek wrote:
       | Why not a pull request?
        
         | mlthoughts2018 wrote:
         | Why do work when you can milk it for status?
        
       | mlthoughts2018 wrote:
       | I'm not convinced by this paper. I've trained a lot of from-
       | scratch word, sentence and query embeddings in my career, it's
       | probably the single main thing I've done. I've never observed
       | rescaling the average context vector to have an impact on
       | application performance. It amounts to rescaling gradient terms,
       | but most of those are being backprop'd from layers with batch
       | normalization, strict activation functions, clipping, etc. There
       | are many, many non-linear effects contributing to how that
       | rescaling constant plays a role, and in anything other a
       | completely shallow word2vec model with no further layers and
       | where you just want to extract the embeddings in some
       | application-agnostic way, that normalizing constant is not going
       | to matter.
        
       | piker wrote:
       | Noticed in the code no Huffman tree. Then FN2 in the paper: "In
       | this work, we always use the negative sampling formulations of
       | Word2vec objectives which are consis-tently more efficient and
       | effective than the hierarchical softmax formulations." Is that
       | consensus?
        
       ___________________________________________________________________
       (page generated 2021-01-02 23:01 UTC)