[HN Gopher] A visual exploration of vector embeddings
___________________________________________________________________
A visual exploration of vector embeddings
Author : pamelafox
Score : 93 points
Date : 2025-05-28 20:21 UTC (1 days ago)
(HTM) web link (blog.pamelafox.org)
(TXT) w3m dump (blog.pamelafox.org)
| isjustintime wrote:
| I love the visual approaches used to explain these concepts.
| Words and math hurt my brain, but when accompanied by charts and
| diagrams, my brain hurts much less.
| cratermoon wrote:
| If you like this you'll love Grant Sanderson's series on linear
| algebra and LLMs. https://www.youtube.com/watch?v=wjZofJX0v4M
| minimaxir wrote:
| Since this was oriented toward a Python audience, it may have
| also been useful to demonstrate on the poster _how_ in Python you
| can create the embeddings (e.g. using requests /OpenAI client and
| hitting OpenAI's embeddings API) and calculate the similarities
| (e.g. using numpy) since most won't read the linked notebooks.
| Mostly as a good excuse to showoff Python's rare @ operator for
| dot products in numpy.
|
| As a tangent, what root data source are you using to calculate
| the movie embeddings?
| pamelafox wrote:
| I thought I'd make this blog post be language-agnostic, but
| agreed that a Python-specific version would be helpful.
|
| Here's where I calculate cosine without numpy:
| https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
|
| And in the distance notebook, I calculate with numpy:
| https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
| I didn't use the @ operator! TIL.
|
| I forget where I originally got the Disney movie titles, but it
| is notably _just_ the titles. A better ranking would be based
| off a movie synopsis as well. Here 's where I calculated their
| embeddings using OpenAI: https://github.com/pamelafox/vector-
| embeddings-demos/blob/ma...
|
| Maybe I can submit a poster to Pytorch that would include the
| Python code as well.
| godelski wrote:
| The more I've studied this stuff the less useful I actually think
| the vitalizations are. Pamela uses the classic approach and I'm
| not trying to call her wrong but I think our intuitions really
| fail us outside 2D and 3D.
|
| Once you move up in dimensionality things get really messy really
| fast. There's a contraction in variance and the meaning of
| distance becomes much more fuzzy. You can't differentiate your
| nearest neighbor from your furthest. Angles get much harder too.
| Everything is orthogonal, in most directions too! I'm not all
| that surprised "god" and "dog". I EXPECT them to be. After all,
| they are the reverse of one another. The question rather is about
| "similar in which direction?"
|
| There's no need to believe you've measured along a direction that
| is human meaningful. So doesn't have to be semantics. Doesn't
| have to be permutations either. Just like you can rotate your xy
| axis and travel in both directions.
|
| So these things can really trick us. At best, be very careful to
| not become overly reliant upon them
| pamelafox wrote:
| I actually explicitly left the PCA graphs out of the blog post
| version, as I think they lose so much information as to be
| deceiving. That's what I told folks in person at the poster
| session as well.
|
| I think the other graphs I included aren't deceiving, they're
| just not quite as fun as an attempt to visualize the similarity
| space.
| godelski wrote:
| Yeah PCA gets tough. It isn't great for non-linear
| relationships and I mean that's the whole reason we use
| activation functions haha. And don't get me started on how
| people refer to t-SNE as dimensionality reduction instead of
| visualization...
|
| I don't think the other graphs are necessarily deceiving but
| I think they don't capture as much information as we often
| imply and I think this ends up leading people to make wrong
| assumptions about what is happening in the data.
|
| Embeddings and immersions get really fucking weird at high
| dimensions. I mean it gets weird at like 4D and insane by
| 10D. The spaces we're talking about are incomprehensible.
| Every piece of geometric intuition you have should be thrown
| out the window. It won't help you, it harms you. If you start
| digging into the high dimensional statistics and metric
| theory for high dimensions you'll quickly see what I'm
| talking about. Like the craziness of Lp distances and
| contraction of variance. Like you have to really dig into why
| we prefer L1 over L2 and why even fractional ps are of
| interest. We run into all kinds of problems with i.i.d.
| assumptions and all that. It is wild how many assumptions are
| being made that we generally don't even think about. They
| seem obvious and natural to use, but they don't work very
| well when D>3. I do think the visualizations become useful
| again once you start getting used to this again but that's
| more like in the way that you are interpreting it with far
| less generalization in meaning.
|
| I'm not trying to dunk on your post. I think it is fine. But
| I think our ML community needs to be having more
| conversations about these limits. We're really running into
| issues with them.
| been-jammin wrote:
| I think visually so very helpful thanks. I also agree once
| you get into higher dimensionality it becomes difficult to
| represent visually. Nevertheless helpful for an 'old' (50)
| computer scientist wrapping my head around AI concepts so I
| can keep up with my team.
| minimaxir wrote:
| > I EXPECT them to be. After all, they are the reverse of one
| another.
|
| That isn't how tokenized inputs work. It's partially the same
| reason why "how many r's are in strawberry" is a hard problem
| for LLMs.
|
| All these models are trained for semantic similarity by how
| they are actually _used_ in relation to other words, so a data
| point where that doesn 't follow intuitively is indeed weird.
| godelski wrote:
| I'm not talking about Tokenization.
|
| It can get confusing because we usually role tokenization and
| embedding up as a singular process but the tokenization is
| the translation of our characters into numeric
| representations. There's self discovery of what the atomic
| units should be (bounded by our vocabulary size).
|
| The process is, at a high level: string -> integer ->
| vec<float>. You are learning the string splits, integer IDs,
| and vector embeddings. You are literally building a
| dictionary. The BPE paper is a good place to start[0], but it
| is far from the place we are now.
|
| The embeddings is this data in that latent representation
| space. > All these models are trained for
| semantic similarity
|
| Citation needed...
|
| There's no real good measure of semantic similarity so it
| would be really naive to assume that this must be happening.
| There is a natural pressure for this to occur because words
| are generated in a bias way, but that's different than saying
| they're trained to be semantically similar. There's even a
| bit of discussion about this in the Word2Vec paper[1], but
| you should also follow some of the citations to dig deeper.
|
| You need to think VERY carefully about the vector basis[2].
| You can very easily create an infinite number of basis
| vectors that are isomorphic to the standard cartesian
| coordinate. We usually use [[1,0],[0,1]], but there's no
| reason you can't use some rotation like [[1/sqrt(2),
| -1/sqrt(2)],[1/sqrt(2),1/sqrt(2)]]. Our (x,y) space is
| isomorphic to our new (u,v) space but traveling along our u
| basis vector is not equivalent to traveling along the x basis
| vector (\hat{i}) or even the y (\hat{j}). You are traveling
| along them equally! u is still orthogonal to v and x is still
| orthogonal to y but it is a rotation. we can also do
| something more complex like using polar coordinates. All this
| stuff is equivalent! They all provide linearly independent
| unit vectors that span our field.
|
| The point is, the semantics is a happy outcome, not a
| guaranteed or even specifically trained for outcome. We
| should expect it to happen frequently because of hour our
| languages evolved but the "god" "dog" example perfectly
| illustrates how this is naive.
|
| You * _CANNOT*_ train for semantic similarity until you *
| _DEFINE*_ semantic similarity. That definition needs to be a
| strong rigorous mathematical one. Not an ad-hoc Justice
| Potter "know it when I see it" kinda policy. The way they
| are used in relation to other words is definitely not well
| aligned to semantics. I can talk about cats and dogs or cats
| and potatoes all day long. The real similarity we'll come up
| with there is nouns and that's not much in the way of
| semantics. Even the examples I gave aren't strictly nouns.
| Shit gets real fucking messy real fast[3]. It's not just
| English, it happens in every language[4]
|
| We can get WAY more into this, but no, sorry, that's not how
| this works.
|
| [0] https://arxiv.org/abs/1508.07909
|
| [1] https://arxiv.org/abs/1301.3781
|
| [2] https://en.wikipedia.org/wiki/Basis_(linear_algebra)
|
| [3] I'll leave you with my favorite example of linguistic
| ambiguity Read rhymes with lead and
| lead rhymes with read but read doesn't rhyme with lead
| and lead doesn't rhyme with read
|
| [4] https://en.wikipedia.org/wiki/Lion-
| Eating_Poet_in_the_Stone_...
| PaulHoule wrote:
| The thing about high dimensional vector spaces is that when N
| is large they are strangely different from the N=2 and N=3
| cases we are familiar with. For instance when N=3 you could
| imagine that a cube is not all that different from a sphere,
| just sand away the corners. If N=10,000 though, the "cube" has
| a huge number of corners which are a distance of 100 away from
| the origin whereas the sphere never gets past 1. Hypercubes
| look something like this
|
| https://www.amazon.com/Torre-Tagus-901918B-Spike-Sphere/dp/B...
|
| A consequence of that is that many visualizations give people
| the wrong idea so I wouldn't try too hard.
|
| Of everything in the article I like the histograms of
| similarity the best but they are in the weeds a lot with things
| like "god" ~ "dog". When I was building search engines I looked
| a lot at graphs that showed the similarity distribution of
| relevant vs irrelevant results
|
| I'll argue bitterly about word embeddings being "very good" for
| anything; actually that similarity distribution looks pretty
| good, but my experience is that when you are looking at N words
| word vectors look promising when N=5 but when N>50 or so they
| break down completely. I've worked on teams that were
| considering both RNN and CNN models. My thinking was that if
| word embeddings had any knowledge in them that a deep model
| could benefit from you could also train a classical ML model
| (say some kind of SVM) to classify words on some characteristic
| like "is a color" or "is a kind of person" or "can be used as a
| verb" but I could never get it to work.
|
| Now I went looking and never found that anyone had published
| positive or negative results for such a classifier, my feeling
| was it was a terrible tarpit, particularly when N was tiny it
| would almost seem to work but when N increased it would always
| fall apart. Between the bias that people don't publish negative
| results and that people who got negative results might blame
| themselves and not word embeddings or the hype around word
| embeddings, they didn't get published.
|
| I do collect papers from arXiv where people do some boring text
| classification task because I do boring text classification
| tasks and I facepalm so often because people often try 15 or so
| algos, most of which never work well, and word embeddings are
| always in that category. If people tried some classical ML
| algos with bag-of-words and pooled ModernBERT they'd sample a
| good segment of the efficient frontier -- a BERT embedding
| doesn't just capture the word, it captures the meaning of the
| word in context which is night and day different when it comes
| to relevance because matching the synonyms of all the different
| word senses brings as many or more irrelevant matches as it
| does relevant ones.
| ithkuil wrote:
| Geometry in higher dimensions is not only hard to imagine, it's
| straight up weird.
|
| Take a cube on N dimensions and pack N dimensional spheres
| inside that cube. Then fit a sphere inside the cube so that it
| touches but doesn't overlap with any of the other spheres.
|
| In 2D and 3D is easy to visualize and you can see that sphere
| in the center is smaller than the other spheres and _of course_
| it 's smaller than the cube itself; after all, it's surrounded
| by the other spheres that are by construction inside the cube.
|
| Above 10 dimensions the size of the inner hypersphere is
| actually bigger than the size of the hypercube despite being
| surrounded by hyperspheres that are contained inside the hyper-
| cube!
|
| The math behind it is straightforward but the implication is as
| counterintuitive as it gets
| godelski wrote:
| Or how the volume of an n-ball goes to 0[0,1]
|
| Or how gaussian balls are like soap bubbles[2]
|
| The latter of which being highly relevant to vector
| embeddings. Because if you aren't a uniform distribution, the
| density of your mass isn't uniform. MEANING if you linearly
| interpolate between two points in the space you are likely to
| get things that are not representative of your distribution.
| It happens because it is easy to confuse a linear line with a
| geodesic[3]. Like trying to draw a straight line between Los
| Angeles and Paris. You're going to be going through the dirt
| most of the time. Looks nothing like cities or even habitable
| land.
|
| I think the basic math is straight forward but there's a lot
| of depth that is straight up ignored in most of our
| discussions about this stuff. There's a lot of deep math here
| and we really need to talk a lot about the algebraic
| structures, topologies, get deep into metric theory and set
| theory to push forward in answering these questions. I think
| this belief that "the math is easy" is holding us back. I
| like to say "you don't need to know math to train good
| models, but you do need math to know why your models are
| wrong." (Obvious reference to "all models are wrong, but some
| are useful") Especially in CS we have this tendency to
| oversimplify things and it really is just arrogance that
| doesn't help us.
|
| [0] https://davidegerosa.com/nsphere/
|
| [1] https://en.wikipedia.org/wiki/Volume_of_an_n-ball
|
| [2] https://www.inference.vc/high-dimensional-gaussian-
| distribut...
|
| [3] https://en.wikipedia.org/wiki/Geodesic
| terranmott wrote:
| Agree and you might like 4D Toys! One of my favorite things for
| building a little intuition in different dimensions.
|
| I like how it shows shadows and cross-sections from 2D->3D, and
| then from 3D->4D. Really captures the uncanny playfulness of it
| all.
|
| https://4dtoys.com/
| godelski wrote:
| I remember when that was dropped! It nicely complements
| Flatland, going in the other direction. Highly recommend, but
| it is easier to miss the limitations of intuitions here than
| it is when looking at Flatland. But still, highly recommend.
| I think it helps highlight how non-intuitive things are. I'd
| suggest people try to predict movements of things before that
| movement happening. Helps to see how wrong you are because we
| have a tendency to post hoc justify why we actually knew
| something was going to happen lol
| antirez wrote:
| Here I tried to use 2D visualization, and it may be more
| immediate:
|
| https://antirez.com/news/150
| janalsncm wrote:
| Where can I find the original dataset for doing stylometry?
| pamelafox wrote:
| Thanks for sharing that too!
|
| I do have a notebook that does a PCA reduction to plot a
| similarity space: https://github.com/pamelafox/vector-
| embeddings-demos/blob/ma...
|
| But as I noted in another comment, I think it loses so much
| information as to be deceiving.
|
| I also find this 3d visualization to be fun:
| https://projector.tensorflow.org/
|
| But once again, huge loss in information.
|
| I personally learn more by actually seeing the relative
| similarity ranking and scores within a dataset, versus trying
| to visualize all of the nodes on the same graph with a massive
| dimension simplification.
|
| That 3d visualization is what originally intrigued me though,
| to see how else I could visualize. :)
| pamelafox wrote:
| I forgot that I also put together this little website, if you
| want to compare vectors for word2vec versus text-embedding-
| ada-002: https://pamelafox.github.io/vectors-comparison/
|
| (I never added text-embedding-3 to it)
| podgietaru wrote:
| I did the traditional blog post about this, and used it to create
| an RSS Aggregator website using AWS Bedrock.
|
| https://aws.amazon.com/blogs/machine-learning/use-language-e...
|
| The website is unfortunately down now, due to the fact I no
| longer work at Amazon, but the code is still readily available if
| you want to run it yourself.
|
| https://github.com/aws-samples/rss-aggregator-using-cohere-e...
___________________________________________________________________
(page generated 2025-05-29 23:00 UTC)