[HN Gopher] Word2Vec Explained. Explaining the Intuition of Word...
       ___________________________________________________________________
        
       Word2Vec Explained. Explaining the Intuition of Word2Vec
        
       Author : ColinWright
       Score  : 90 points
       Date   : 2022-03-27 14:00 UTC (9 hours ago)
        
 (HTM) web link (towardsdatascience.com)
 (TXT) w3m dump (towardsdatascience.com)
        
       | ehsankia wrote:
       | Great fun use of word2vec
       | 
       | https://semantle.novalis.org/
       | 
       | https://semantle.pimanrul.es/
        
         | svcrunch wrote:
         | Also take a look at Semantris:
         | 
         | https://research.google.com/semantris
        
         | gojomo wrote:
         | Also:
         | 
         | https://transorthogonal-linguistics.herokuapp.com/TOL/boy/ma...
         | 
         | (Which can reproduce an old XKCD about the 'purity' of other
         | scientific fields compared to math:
         | https://twitter.com/RadimRehurek/status/638531775333949440)
        
       | gibsonf1 wrote:
       | The problem is, words are not the issue, concepts are. And for
       | understanding meaning, both causal and conceptual understanding
       | of a spacetime model of the world is needed. That's why the
       | word2vec approach to nlp is truly a dead end, although some
       | associations can be gleaned.
        
       | it_does_follow wrote:
       | Am I alone in really disliking Towards Data Science?
       | 
       | While their articles always look nice, their content is all
       | written quickly by data scientists wanting to polish their resume
       | with the ultimate aim of rapidly generating content for TDS that
       | will match every conceivable data science related search. This
       | post clearly exists solely so that TDS can get the top spot for
       | "Word2vec explained" (which they have). As evidence of this
       | tactic you can see that there already is a TDS post "Word2vec
       | made easy" [0], offering nothing substantially different than
       | this one.
       | 
       | The problem is that content is almost never useful, it just looks
       | nice at first skim through. The authors, at no real fault of
       | their own, are just eager novices that rarely have new
       | perspective to add to a topic. It's not uncommon to find huge
       | conceptual errors (or at least gaps) in the content there.
       | 
       | I personally encourage everyone at every level to write about
       | what they can, but the issue is that TDS has manipulated this
       | population of eager data scientists in order to dominant search
       | results on nearly every single topic they can cover related to
       | DS, which has made searching for anything tedious.
       | 
       | Compare this post to the fantastic work of Jay Alammar [1]. Jay's
       | post is truly excellent, covering a lot of interesting details
       | about word2vec and providing excellent visuals as well.
       | 
       | I'm assuming TDS will fold as soon as DS stops being a "hot"
       | topic (which I think we'll be in the relatively near future), and
       | will personally be glad to see the web rid of their low signal
       | blog spam.
       | 
       | 0. https://towardsdatascience.com/word2vec-made-
       | easy-139a31a4b8... 1. https://jalammar.github.io/illustrated-
       | word2vec/
        
         | heyhihello wrote:
         | Agree on poor TDS quality.
         | 
         | You should check out Amazon's MLU's interactives - they're like
         | mini nyt articles on different algorithms:
         | 
         | https://mlu-explain.github.io/
        
           | bllguo wrote:
           | As another recommendation, Distill is higher level, and has
           | less topic coverage, but their article quality is fantastic:
           | 
           | https://distill.pub/
        
         | minimaxir wrote:
         | TDS is a banned domain on HN:
         | https://news.ycombinator.com/from?site=towardsdatascience.co...
         | 
         | It's unusual that this article got vouched.
        
           | ColinWright wrote:
           | I thought this particular article gave a balanced, high-level
           | overview, along with enough detail and references to provide
           | a good starting point.
           | 
           | Yes, perhaps it's a bit light-weight, but as an introduction
           | I thought it did a good job.
        
             | minimaxir wrote:
             | Unusual in the sense that getting vouched as an outlier,
             | not necessarily as an indication of the article quality.
        
       | kqr wrote:
       | I have been tempted to try word2vec-like techniques on e-commerce
       | shopping carts as a way to find a particular type of
       | recommendation. I suspect data will be too sparse, though.
       | 
       | Has anyone approached similar techniques on non-text corpuses?
        
         | random314 wrote:
         | I have applied it to ecommerce shopping carts and it works
         | quite well :). The itemids(words) viewed in sequence in a
         | session can be thought of as a sentence.
        
         | rdedev wrote:
         | Ive been looking at a way to use transformer based models on
         | tabular data. The hope is that these models have a much better
         | contextual understanding of words. So embeddings from these
         | models should be of better quality than just word2vec ones
        
           | kevin948 wrote:
           | Same here. Find any good resources? I've been leaning on
           | auto-encoders to encode better than word-2-vec and its ilk.
        
             | VHRanger wrote:
             | Network node embeddings are the best for tabular data. I
             | maintain a library on it here, but there's plenty of good
             | alternatives:
             | 
             | https://github.com/VHRanger/nodevectors
        
           | VHRanger wrote:
           | For sparser data, you should just do normal network node
           | embeddings.
           | 
           | Look into node2vec libraries for instance
        
         | OccamsRazr wrote:
         | You may find this airbnb paper relevant. They use skip-grams to
         | generate feature vectors for their listings.
         | 
         | https://www.kdd.org/kdd2018/accepted-papers/view/real-time-p...
        
         | cgearhart wrote:
         | It works...for some definition of "works". It's been applied to
         | all kinds of problems--including graphs (Node2Vec) and many
         | other cases where the input isn't "words"--to the point that
         | I'd consider it a weak baseline for any embedding task. In my
         | experience it is unreasonably effective for simple problems
         | (make a binary classifier for tweets), but the effectiveness
         | drops quickly as the problem gets more complicated.
         | 
         | In your proposed use case I would bet that you will "see" the
         | kind of similarity you're looking for based on vector
         | similarity, but I also expect it to largely be an illusion due
         | to confirmation bias. It will be much harder to make that
         | similarity actionable to solve the actual business use case.
         | (Like 30% of the time it'll work like magic; 60% of the time
         | it'll be "meh"; 10% of the time it'll be hilariously wrong.)
        
         | laughy wrote:
         | I have applied it to the names in a population database. It
         | learnt interesting, and expected structure. Visualized with
         | UMAP it clustered by gender first, and then something that
         | probably could be described as cultural origin of name.
        
       | samuel wrote:
       | For me the key point to understand what's going on(assumming I
       | got it), is that the hidden layer "has" to produce similar
       | representations for words that appear in the same contexts so the
       | output layer can predict them.
       | 
       | The intuition behind doc2vec it's a bit harder to grasp. I
       | understand the role of the "paragraph word": it provides context
       | to the prediction so in "the ball hit the ---" in a basketball
       | text the classifier would predict "rim" and in a football one
       | "goalpost"(simplifying). But I still don't get why similar texts
       | get similar latent representations.
        
       | mgaunard wrote:
       | word2vec is 9 years old, hardly a "recent breakthrough".
        
         | ColinWright wrote:
         | I'm a little surprised that that's the most useful and
         | constructive thing you can say about the article.
        
           | visarga wrote:
           | Since it's 2022, use Sentence Transformers to embed short
           | phrases. They are leaps above w2v. Or just use any model from
           | Hugging Face. It's just 10 lines of code, really easy to
           | start with.
           | 
           | https://sbert.net
           | 
           | https://huggingface.co
        
             | minimaxir wrote:
             | Although Hacker News can reach Stack Overflow reductiveness
             | of "just use X lol"...using Transformers for NLP is indeed
             | the best answer for all in terms of performance, speed, and
             | ease of implementation.
             | 
             | A weakness of TDS monopolizing data science SEO is that
             | it's hard for better techniques to surface.
        
       | throwra620 wrote:
        
       ___________________________________________________________________
       (page generated 2022-03-27 23:01 UTC)