[HN Gopher] Word2Vec Explained. Explaining the Intuition of Word...
___________________________________________________________________
Word2Vec Explained. Explaining the Intuition of Word2Vec
Author : ColinWright
Score : 90 points
Date : 2022-03-27 14:00 UTC (9 hours ago)
(HTM) web link (towardsdatascience.com)
(TXT) w3m dump (towardsdatascience.com)
| ehsankia wrote:
| Great fun use of word2vec
|
| https://semantle.novalis.org/
|
| https://semantle.pimanrul.es/
| svcrunch wrote:
| Also take a look at Semantris:
|
| https://research.google.com/semantris
| gojomo wrote:
| Also:
|
| https://transorthogonal-linguistics.herokuapp.com/TOL/boy/ma...
|
| (Which can reproduce an old XKCD about the 'purity' of other
| scientific fields compared to math:
| https://twitter.com/RadimRehurek/status/638531775333949440)
| gibsonf1 wrote:
| The problem is, words are not the issue, concepts are. And for
| understanding meaning, both causal and conceptual understanding
| of a spacetime model of the world is needed. That's why the
| word2vec approach to nlp is truly a dead end, although some
| associations can be gleaned.
| it_does_follow wrote:
| Am I alone in really disliking Towards Data Science?
|
| While their articles always look nice, their content is all
| written quickly by data scientists wanting to polish their resume
| with the ultimate aim of rapidly generating content for TDS that
| will match every conceivable data science related search. This
| post clearly exists solely so that TDS can get the top spot for
| "Word2vec explained" (which they have). As evidence of this
| tactic you can see that there already is a TDS post "Word2vec
| made easy" [0], offering nothing substantially different than
| this one.
|
| The problem is that content is almost never useful, it just looks
| nice at first skim through. The authors, at no real fault of
| their own, are just eager novices that rarely have new
| perspective to add to a topic. It's not uncommon to find huge
| conceptual errors (or at least gaps) in the content there.
|
| I personally encourage everyone at every level to write about
| what they can, but the issue is that TDS has manipulated this
| population of eager data scientists in order to dominant search
| results on nearly every single topic they can cover related to
| DS, which has made searching for anything tedious.
|
| Compare this post to the fantastic work of Jay Alammar [1]. Jay's
| post is truly excellent, covering a lot of interesting details
| about word2vec and providing excellent visuals as well.
|
| I'm assuming TDS will fold as soon as DS stops being a "hot"
| topic (which I think we'll be in the relatively near future), and
| will personally be glad to see the web rid of their low signal
| blog spam.
|
| 0. https://towardsdatascience.com/word2vec-made-
| easy-139a31a4b8... 1. https://jalammar.github.io/illustrated-
| word2vec/
| heyhihello wrote:
| Agree on poor TDS quality.
|
| You should check out Amazon's MLU's interactives - they're like
| mini nyt articles on different algorithms:
|
| https://mlu-explain.github.io/
| bllguo wrote:
| As another recommendation, Distill is higher level, and has
| less topic coverage, but their article quality is fantastic:
|
| https://distill.pub/
| minimaxir wrote:
| TDS is a banned domain on HN:
| https://news.ycombinator.com/from?site=towardsdatascience.co...
|
| It's unusual that this article got vouched.
| ColinWright wrote:
| I thought this particular article gave a balanced, high-level
| overview, along with enough detail and references to provide
| a good starting point.
|
| Yes, perhaps it's a bit light-weight, but as an introduction
| I thought it did a good job.
| minimaxir wrote:
| Unusual in the sense that getting vouched as an outlier,
| not necessarily as an indication of the article quality.
| kqr wrote:
| I have been tempted to try word2vec-like techniques on e-commerce
| shopping carts as a way to find a particular type of
| recommendation. I suspect data will be too sparse, though.
|
| Has anyone approached similar techniques on non-text corpuses?
| random314 wrote:
| I have applied it to ecommerce shopping carts and it works
| quite well :). The itemids(words) viewed in sequence in a
| session can be thought of as a sentence.
| rdedev wrote:
| Ive been looking at a way to use transformer based models on
| tabular data. The hope is that these models have a much better
| contextual understanding of words. So embeddings from these
| models should be of better quality than just word2vec ones
| kevin948 wrote:
| Same here. Find any good resources? I've been leaning on
| auto-encoders to encode better than word-2-vec and its ilk.
| VHRanger wrote:
| Network node embeddings are the best for tabular data. I
| maintain a library on it here, but there's plenty of good
| alternatives:
|
| https://github.com/VHRanger/nodevectors
| VHRanger wrote:
| For sparser data, you should just do normal network node
| embeddings.
|
| Look into node2vec libraries for instance
| OccamsRazr wrote:
| You may find this airbnb paper relevant. They use skip-grams to
| generate feature vectors for their listings.
|
| https://www.kdd.org/kdd2018/accepted-papers/view/real-time-p...
| cgearhart wrote:
| It works...for some definition of "works". It's been applied to
| all kinds of problems--including graphs (Node2Vec) and many
| other cases where the input isn't "words"--to the point that
| I'd consider it a weak baseline for any embedding task. In my
| experience it is unreasonably effective for simple problems
| (make a binary classifier for tweets), but the effectiveness
| drops quickly as the problem gets more complicated.
|
| In your proposed use case I would bet that you will "see" the
| kind of similarity you're looking for based on vector
| similarity, but I also expect it to largely be an illusion due
| to confirmation bias. It will be much harder to make that
| similarity actionable to solve the actual business use case.
| (Like 30% of the time it'll work like magic; 60% of the time
| it'll be "meh"; 10% of the time it'll be hilariously wrong.)
| laughy wrote:
| I have applied it to the names in a population database. It
| learnt interesting, and expected structure. Visualized with
| UMAP it clustered by gender first, and then something that
| probably could be described as cultural origin of name.
| samuel wrote:
| For me the key point to understand what's going on(assumming I
| got it), is that the hidden layer "has" to produce similar
| representations for words that appear in the same contexts so the
| output layer can predict them.
|
| The intuition behind doc2vec it's a bit harder to grasp. I
| understand the role of the "paragraph word": it provides context
| to the prediction so in "the ball hit the ---" in a basketball
| text the classifier would predict "rim" and in a football one
| "goalpost"(simplifying). But I still don't get why similar texts
| get similar latent representations.
| mgaunard wrote:
| word2vec is 9 years old, hardly a "recent breakthrough".
| ColinWright wrote:
| I'm a little surprised that that's the most useful and
| constructive thing you can say about the article.
| visarga wrote:
| Since it's 2022, use Sentence Transformers to embed short
| phrases. They are leaps above w2v. Or just use any model from
| Hugging Face. It's just 10 lines of code, really easy to
| start with.
|
| https://sbert.net
|
| https://huggingface.co
| minimaxir wrote:
| Although Hacker News can reach Stack Overflow reductiveness
| of "just use X lol"...using Transformers for NLP is indeed
| the best answer for all in terms of performance, speed, and
| ease of implementation.
|
| A weakness of TDS monopolizing data science SEO is that
| it's hard for better techniques to surface.
| throwra620 wrote:
___________________________________________________________________
(page generated 2022-03-27 23:01 UTC)