hngopher.com

       [HN Gopher] A visual exploration of vector embeddings
       ___________________________________________________________________
        
       A visual exploration of vector embeddings
        
       Author : pamelafox
       Score  : 93 points
       Date   : 2025-05-28 20:21 UTC (1 days ago)
        
 (HTM) web link (blog.pamelafox.org)
 (TXT) w3m dump (blog.pamelafox.org)
        
       | isjustintime wrote:
       | I love the visual approaches used to explain these concepts.
       | Words and math hurt my brain, but when accompanied by charts and
       | diagrams, my brain hurts much less.
        
         | cratermoon wrote:
         | If you like this you'll love Grant Sanderson's series on linear
         | algebra and LLMs. https://www.youtube.com/watch?v=wjZofJX0v4M
        
       | minimaxir wrote:
       | Since this was oriented toward a Python audience, it may have
       | also been useful to demonstrate on the poster _how_ in Python you
       | can create the embeddings (e.g. using requests /OpenAI client and
       | hitting OpenAI's embeddings API) and calculate the similarities
       | (e.g. using numpy) since most won't read the linked notebooks.
       | Mostly as a good excuse to showoff Python's rare @ operator for
       | dot products in numpy.
       | 
       | As a tangent, what root data source are you using to calculate
       | the movie embeddings?
        
         | pamelafox wrote:
         | I thought I'd make this blog post be language-agnostic, but
         | agreed that a Python-specific version would be helpful.
         | 
         | Here's where I calculate cosine without numpy:
         | https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
         | 
         | And in the distance notebook, I calculate with numpy:
         | https://github.com/pamelafox/vector-embeddings-demos/blob/ma...
         | I didn't use the @ operator! TIL.
         | 
         | I forget where I originally got the Disney movie titles, but it
         | is notably _just_ the titles. A better ranking would be based
         | off a movie synopsis as well. Here 's where I calculated their
         | embeddings using OpenAI: https://github.com/pamelafox/vector-
         | embeddings-demos/blob/ma...
         | 
         | Maybe I can submit a poster to Pytorch that would include the
         | Python code as well.
        
       | godelski wrote:
       | The more I've studied this stuff the less useful I actually think
       | the vitalizations are. Pamela uses the classic approach and I'm
       | not trying to call her wrong but I think our intuitions really
       | fail us outside 2D and 3D.
       | 
       | Once you move up in dimensionality things get really messy really
       | fast. There's a contraction in variance and the meaning of
       | distance becomes much more fuzzy. You can't differentiate your
       | nearest neighbor from your furthest. Angles get much harder too.
       | Everything is orthogonal, in most directions too! I'm not all
       | that surprised "god" and "dog". I EXPECT them to be. After all,
       | they are the reverse of one another. The question rather is about
       | "similar in which direction?"
       | 
       | There's no need to believe you've measured along a direction that
       | is human meaningful. So doesn't have to be semantics. Doesn't
       | have to be permutations either. Just like you can rotate your xy
       | axis and travel in both directions.
       | 
       | So these things can really trick us. At best, be very careful to
       | not become overly reliant upon them
        
         | pamelafox wrote:
         | I actually explicitly left the PCA graphs out of the blog post
         | version, as I think they lose so much information as to be
         | deceiving. That's what I told folks in person at the poster
         | session as well.
         | 
         | I think the other graphs I included aren't deceiving, they're
         | just not quite as fun as an attempt to visualize the similarity
         | space.
        
           | godelski wrote:
           | Yeah PCA gets tough. It isn't great for non-linear
           | relationships and I mean that's the whole reason we use
           | activation functions haha. And don't get me started on how
           | people refer to t-SNE as dimensionality reduction instead of
           | visualization...
           | 
           | I don't think the other graphs are necessarily deceiving but
           | I think they don't capture as much information as we often
           | imply and I think this ends up leading people to make wrong
           | assumptions about what is happening in the data.
           | 
           | Embeddings and immersions get really fucking weird at high
           | dimensions. I mean it gets weird at like 4D and insane by
           | 10D. The spaces we're talking about are incomprehensible.
           | Every piece of geometric intuition you have should be thrown
           | out the window. It won't help you, it harms you. If you start
           | digging into the high dimensional statistics and metric
           | theory for high dimensions you'll quickly see what I'm
           | talking about. Like the craziness of Lp distances and
           | contraction of variance. Like you have to really dig into why
           | we prefer L1 over L2 and why even fractional ps are of
           | interest. We run into all kinds of problems with i.i.d.
           | assumptions and all that. It is wild how many assumptions are
           | being made that we generally don't even think about. They
           | seem obvious and natural to use, but they don't work very
           | well when D>3. I do think the visualizations become useful
           | again once you start getting used to this again but that's
           | more like in the way that you are interpreting it with far
           | less generalization in meaning.
           | 
           | I'm not trying to dunk on your post. I think it is fine. But
           | I think our ML community needs to be having more
           | conversations about these limits. We're really running into
           | issues with them.
        
           | been-jammin wrote:
           | I think visually so very helpful thanks. I also agree once
           | you get into higher dimensionality it becomes difficult to
           | represent visually. Nevertheless helpful for an 'old' (50)
           | computer scientist wrapping my head around AI concepts so I
           | can keep up with my team.
        
         | minimaxir wrote:
         | > I EXPECT them to be. After all, they are the reverse of one
         | another.
         | 
         | That isn't how tokenized inputs work. It's partially the same
         | reason why "how many r's are in strawberry" is a hard problem
         | for LLMs.
         | 
         | All these models are trained for semantic similarity by how
         | they are actually _used_ in relation to other words, so a data
         | point where that doesn 't follow intuitively is indeed weird.
        
           | godelski wrote:
           | I'm not talking about Tokenization.
           | 
           | It can get confusing because we usually role tokenization and
           | embedding up as a singular process but the tokenization is
           | the translation of our characters into numeric
           | representations. There's self discovery of what the atomic
           | units should be (bounded by our vocabulary size).
           | 
           | The process is, at a high level: string -> integer ->
           | vec<float>. You are learning the string splits, integer IDs,
           | and vector embeddings. You are literally building a
           | dictionary. The BPE paper is a good place to start[0], but it
           | is far from the place we are now.
           | 
           | The embeddings is this data in that latent representation
           | space.                 > All these models are trained for
           | semantic similarity
           | 
           | Citation needed...
           | 
           | There's no real good measure of semantic similarity so it
           | would be really naive to assume that this must be happening.
           | There is a natural pressure for this to occur because words
           | are generated in a bias way, but that's different than saying
           | they're trained to be semantically similar. There's even a
           | bit of discussion about this in the Word2Vec paper[1], but
           | you should also follow some of the citations to dig deeper.
           | 
           | You need to think VERY carefully about the vector basis[2].
           | You can very easily create an infinite number of basis
           | vectors that are isomorphic to the standard cartesian
           | coordinate. We usually use [[1,0],[0,1]], but there's no
           | reason you can't use some rotation like [[1/sqrt(2),
           | -1/sqrt(2)],[1/sqrt(2),1/sqrt(2)]]. Our (x,y) space is
           | isomorphic to our new (u,v) space but traveling along our u
           | basis vector is not equivalent to traveling along the x basis
           | vector (\hat{i}) or even the y (\hat{j}). You are traveling
           | along them equally! u is still orthogonal to v and x is still
           | orthogonal to y but it is a rotation. we can also do
           | something more complex like using polar coordinates. All this
           | stuff is equivalent! They all provide linearly independent
           | unit vectors that span our field.
           | 
           | The point is, the semantics is a happy outcome, not a
           | guaranteed or even specifically trained for outcome. We
           | should expect it to happen frequently because of hour our
           | languages evolved but the "god" "dog" example perfectly
           | illustrates how this is naive.
           | 
           | You * _CANNOT*_ train for semantic similarity until you *
           | _DEFINE*_ semantic similarity. That definition needs to be a
           | strong rigorous mathematical one. Not an ad-hoc Justice
           | Potter  "know it when I see it" kinda policy. The way they
           | are used in relation to other words is definitely not well
           | aligned to semantics. I can talk about cats and dogs or cats
           | and potatoes all day long. The real similarity we'll come up
           | with there is nouns and that's not much in the way of
           | semantics. Even the examples I gave aren't strictly nouns.
           | Shit gets real fucking messy real fast[3]. It's not just
           | English, it happens in every language[4]
           | 
           | We can get WAY more into this, but no, sorry, that's not how
           | this works.
           | 
           | [0] https://arxiv.org/abs/1508.07909
           | 
           | [1] https://arxiv.org/abs/1301.3781
           | 
           | [2] https://en.wikipedia.org/wiki/Basis_(linear_algebra)
           | 
           | [3] I'll leave you with my favorite example of linguistic
           | ambiguity                 Read rhymes with lead       and
           | lead rhymes with read       but read doesn't rhyme with lead
           | and lead doesn't rhyme with read
           | 
           | [4] https://en.wikipedia.org/wiki/Lion-
           | Eating_Poet_in_the_Stone_...
        
         | PaulHoule wrote:
         | The thing about high dimensional vector spaces is that when N
         | is large they are strangely different from the N=2 and N=3
         | cases we are familiar with. For instance when N=3 you could
         | imagine that a cube is not all that different from a sphere,
         | just sand away the corners. If N=10,000 though, the "cube" has
         | a huge number of corners which are a distance of 100 away from
         | the origin whereas the sphere never gets past 1. Hypercubes
         | look something like this
         | 
         | https://www.amazon.com/Torre-Tagus-901918B-Spike-Sphere/dp/B...
         | 
         | A consequence of that is that many visualizations give people
         | the wrong idea so I wouldn't try too hard.
         | 
         | Of everything in the article I like the histograms of
         | similarity the best but they are in the weeds a lot with things
         | like "god" ~ "dog". When I was building search engines I looked
         | a lot at graphs that showed the similarity distribution of
         | relevant vs irrelevant results
         | 
         | I'll argue bitterly about word embeddings being "very good" for
         | anything; actually that similarity distribution looks pretty
         | good, but my experience is that when you are looking at N words
         | word vectors look promising when N=5 but when N>50 or so they
         | break down completely. I've worked on teams that were
         | considering both RNN and CNN models. My thinking was that if
         | word embeddings had any knowledge in them that a deep model
         | could benefit from you could also train a classical ML model
         | (say some kind of SVM) to classify words on some characteristic
         | like "is a color" or "is a kind of person" or "can be used as a
         | verb" but I could never get it to work.
         | 
         | Now I went looking and never found that anyone had published
         | positive or negative results for such a classifier, my feeling
         | was it was a terrible tarpit, particularly when N was tiny it
         | would almost seem to work but when N increased it would always
         | fall apart. Between the bias that people don't publish negative
         | results and that people who got negative results might blame
         | themselves and not word embeddings or the hype around word
         | embeddings, they didn't get published.
         | 
         | I do collect papers from arXiv where people do some boring text
         | classification task because I do boring text classification
         | tasks and I facepalm so often because people often try 15 or so
         | algos, most of which never work well, and word embeddings are
         | always in that category. If people tried some classical ML
         | algos with bag-of-words and pooled ModernBERT they'd sample a
         | good segment of the efficient frontier -- a BERT embedding
         | doesn't just capture the word, it captures the meaning of the
         | word in context which is night and day different when it comes
         | to relevance because matching the synonyms of all the different
         | word senses brings as many or more irrelevant matches as it
         | does relevant ones.
        
         | ithkuil wrote:
         | Geometry in higher dimensions is not only hard to imagine, it's
         | straight up weird.
         | 
         | Take a cube on N dimensions and pack N dimensional spheres
         | inside that cube. Then fit a sphere inside the cube so that it
         | touches but doesn't overlap with any of the other spheres.
         | 
         | In 2D and 3D is easy to visualize and you can see that sphere
         | in the center is smaller than the other spheres and _of course_
         | it 's smaller than the cube itself; after all, it's surrounded
         | by the other spheres that are by construction inside the cube.
         | 
         | Above 10 dimensions the size of the inner hypersphere is
         | actually bigger than the size of the hypercube despite being
         | surrounded by hyperspheres that are contained inside the hyper-
         | cube!
         | 
         | The math behind it is straightforward but the implication is as
         | counterintuitive as it gets
        
           | godelski wrote:
           | Or how the volume of an n-ball goes to 0[0,1]
           | 
           | Or how gaussian balls are like soap bubbles[2]
           | 
           | The latter of which being highly relevant to vector
           | embeddings. Because if you aren't a uniform distribution, the
           | density of your mass isn't uniform. MEANING if you linearly
           | interpolate between two points in the space you are likely to
           | get things that are not representative of your distribution.
           | It happens because it is easy to confuse a linear line with a
           | geodesic[3]. Like trying to draw a straight line between Los
           | Angeles and Paris. You're going to be going through the dirt
           | most of the time. Looks nothing like cities or even habitable
           | land.
           | 
           | I think the basic math is straight forward but there's a lot
           | of depth that is straight up ignored in most of our
           | discussions about this stuff. There's a lot of deep math here
           | and we really need to talk a lot about the algebraic
           | structures, topologies, get deep into metric theory and set
           | theory to push forward in answering these questions. I think
           | this belief that "the math is easy" is holding us back. I
           | like to say "you don't need to know math to train good
           | models, but you do need math to know why your models are
           | wrong." (Obvious reference to "all models are wrong, but some
           | are useful") Especially in CS we have this tendency to
           | oversimplify things and it really is just arrogance that
           | doesn't help us.
           | 
           | [0] https://davidegerosa.com/nsphere/
           | 
           | [1] https://en.wikipedia.org/wiki/Volume_of_an_n-ball
           | 
           | [2] https://www.inference.vc/high-dimensional-gaussian-
           | distribut...
           | 
           | [3] https://en.wikipedia.org/wiki/Geodesic
        
         | terranmott wrote:
         | Agree and you might like 4D Toys! One of my favorite things for
         | building a little intuition in different dimensions.
         | 
         | I like how it shows shadows and cross-sections from 2D->3D, and
         | then from 3D->4D. Really captures the uncanny playfulness of it
         | all.
         | 
         | https://4dtoys.com/
        
           | godelski wrote:
           | I remember when that was dropped! It nicely complements
           | Flatland, going in the other direction. Highly recommend, but
           | it is easier to miss the limitations of intuitions here than
           | it is when looking at Flatland. But still, highly recommend.
           | I think it helps highlight how non-intuitive things are. I'd
           | suggest people try to predict movements of things before that
           | movement happening. Helps to see how wrong you are because we
           | have a tendency to post hoc justify why we actually knew
           | something was going to happen lol
        
       | antirez wrote:
       | Here I tried to use 2D visualization, and it may be more
       | immediate:
       | 
       | https://antirez.com/news/150
        
         | janalsncm wrote:
         | Where can I find the original dataset for doing stylometry?
        
         | pamelafox wrote:
         | Thanks for sharing that too!
         | 
         | I do have a notebook that does a PCA reduction to plot a
         | similarity space: https://github.com/pamelafox/vector-
         | embeddings-demos/blob/ma...
         | 
         | But as I noted in another comment, I think it loses so much
         | information as to be deceiving.
         | 
         | I also find this 3d visualization to be fun:
         | https://projector.tensorflow.org/
         | 
         | But once again, huge loss in information.
         | 
         | I personally learn more by actually seeing the relative
         | similarity ranking and scores within a dataset, versus trying
         | to visualize all of the nodes on the same graph with a massive
         | dimension simplification.
         | 
         | That 3d visualization is what originally intrigued me though,
         | to see how else I could visualize. :)
        
       | pamelafox wrote:
       | I forgot that I also put together this little website, if you
       | want to compare vectors for word2vec versus text-embedding-
       | ada-002: https://pamelafox.github.io/vectors-comparison/
       | 
       | (I never added text-embedding-3 to it)
        
       | podgietaru wrote:
       | I did the traditional blog post about this, and used it to create
       | an RSS Aggregator website using AWS Bedrock.
       | 
       | https://aws.amazon.com/blogs/machine-learning/use-language-e...
       | 
       | The website is unfortunately down now, due to the fact I no
       | longer work at Amazon, but the code is still readily available if
       | you want to run it yourself.
       | 
       | https://github.com/aws-samples/rss-aggregator-using-cohere-e...
        
       ___________________________________________________________________
       (page generated 2025-05-29 23:00 UTC)