[HN Gopher] How does cosine similarity work?
       ___________________________________________________________________
        
       How does cosine similarity work?
        
       Author : tomhazledine
       Score  : 60 points
       Date   : 2024-09-04 11:52 UTC (2 days ago)
        
 (HTM) web link (tomhazledine.com)
 (TXT) w3m dump (tomhazledine.com)
        
       | OutOfHere wrote:
       | For those who don't want to use a full-blown RAG database,
       | scipy.spatial.distance has a convenient cosine distance function.
       | And for those who don't even want to use SciPy, the formula in
       | the linked post.
       | 
       | For anyone new to the topic, note that the monotonic
       | interpretation of cosine distance is opposite to that of cosine
       | similarity.
        
         | ashvardanian wrote:
         | SciPy distances module has its own problems. It's pretty slow,
         | and constantly overflows in mixed precision scenarios. It also
         | raises the wrong type of errors when it overflows, and uses
         | general purpose `math` package instead of `numpy` for square
         | roots. So use it with caution.
         | 
         | I've outlined some of the related issues here:
         | https://github.com/ashvardanian/SimSIMD#cosine-similarity-re...
        
           | OutOfHere wrote:
           | Noted, and thanks for your great work. My experience with it
           | is limited to working with LLM embeddings, which I believe
           | have been cleanly between 0 and 1. As such, I am yet to
           | encounter these issues.
           | 
           | Regarding the speed, yes, I wouldn't use it with big data. Up
           | to a few thousand items has been fine for me, or perhaps a
           | few hundred if pairwise.
        
       | dbfclark wrote:
       | A good way to understand why cosine similarity is so common in
       | NLP is to think in terms of a keyword search. A bag-of-words
       | vector represents a document as a sparse vector of its word
       | counts; counting the number of occurrences of some set of query
       | words is the dot product of the query vector with the document
       | vector; normalizing for length gives you cosine similarity. If
       | you have word embedding vectors instead of discrete words, you
       | can think of the same game, just now the "count" of a word with
       | another word is the similarity of the word embeddings instead of
       | a 0/1. Finally, LLMs give sentence embeddings as weighted sums of
       | contextual word vectors, so it's all just fuzzy word counting
       | again.
        
       | heyitsguay wrote:
       | Sorta related -- whenever I'm doing something with embeddings, i
       | just normalize them to length one, at which point cosine
       | similarity becomes a simple dot product. Is there ever a reason
       | to not normalize embedding length? An application where that
       | length matters?
        
         | psyklic wrote:
         | For the LLM itself, length matters. For example, the final
         | logits are computed as the un-normalized dot product, making
         | them a function of both direction and magnitude. This means
         | that if you embed then immediately un-embed (using the same
         | embeddings for both), a different token might be obtained. In
         | models such as GPT2, the embedding vector magnitude is loosely
         | correlated with token frequency.
        
         | ashvardanian wrote:
         | On the practical side, dot products are great, but break in
         | mixed precision and integer representations, where accurately
         | normalizing to unit length isn't feasible.
         | 
         | In other cases people prefer L2 distances for embeddings, where
         | the magnitude can have a serious impact on the distance between
         | a pair of points.
        
         | janalsncm wrote:
         | If you're feeling guilty about it you can usually store the un-
         | normalized lengths separately.
        
       | ashvardanian wrote:
       | Three separate passes over JavaScript arrays are quite costly,
       | especially for high-dimensional vectors. I'd recommend using
       | `TypedArray` with vanilla `for` loops. It will make things
       | faster, and will allow using C extensions, if you want to benefit
       | from modern hardware features, while still implementing the logic
       | in JavaScript: https://ashvardanian.com/posts/javascript-ai-
       | vector-search/
        
       | acjohnson55 wrote:
       | I might have missed this, but I think the post might bury the
       | lede that in a high dimensional space, two randomly chosen
       | vectors are very unlikely to have high cosine similarity. Or
       | maybe another way to put it is that the expected value of the
       | cosine of two random vectors approaches zero as the
       | dimensionality increases.
       | 
       | Most similarity metrics will be very low if vectors don't even
       | point in the same direction, so cosine similarity is a cheap way
       | to filter out the vast majority of the data set.
       | 
       | It's been a while since I've studied this stuff, so I might be
       | off target.
        
         | OutOfHere wrote:
         | Even if two random vectors don't have high cosine similarity,
         | and I have not had this issue in 3000 dimensions, the cosine
         | similarity is still usable in relative terms, i.e. relative to
         | other items in the dataset. This keeps it useful.
        
         | DavidSJ wrote:
         | Nitpick: The expected value of the cosine is 0 even in low-
         | dimensional spaces. It's the expected _square_ of that (i.e.
         | the variance) which gets smaller with the dimension.
        
       | naijaboiler wrote:
       | Imagine 2 points in 3 dimensional space with a vector being the
       | line from the origin to the point. So you have 2 vectors pointing
       | going to the 2 points from the origin.
       | 
       | If those points are really close together, then angle between the
       | two vector lines is very small. Loosely speaking cosine is a way
       | to quantize how close two lines with a shared origin is. If both
       | lines are the same, the angle between them is 0, and the cosine
       | of 0 is 1. If two lines are 90 degrees apart, their cosine is 0.
       | If two lines are 180 degrees apart, their cosine is -1. So cosine
       | is a way to quantify the closeness of two lines which share to
       | same origin
       | 
       | To go back with 2 points in space that we started with, we can
       | measure how close those 2 points are by taking the cosine of the
       | lines going from origin to the two points. If they are close, the
       | angle between them is small. If they are the exact same point,
       | the angle between the lines is 0. That line is called a vector
       | 
       | Cosine similarity measures how closes two vectors are in
       | Euclidean space. That's we end up using it a lot. It's no the
       | only way to measure closeness. There are many others
        
         | stouset wrote:
         | Are all the points in question one unit distant from the
         | origin?
        
           | viciousvoxel wrote:
           | yes, cosine similarity involves normalizing the points (by
           | the L2 norm) and then dot product. In other words the points
           | lie on the unit (hyper)sphere.
        
       | derbOac wrote:
       | Maybe I missed this but I was surprised they didn't mention the
       | connection to correlation. Cosine similarity can be thought of as
       | a correlation, and some bivariate distributions (normal I think?)
       | can be rexpressed in terms of cosine similarity.
       | 
       | There's also some generalizations to higher dimensional notions
       | of cosines that are kind of interesting.
        
       | lern_too_spel wrote:
       | This doesn't properly explain what it says it explains. To
       | explain it correctly, you have to explain why the dot product of
       | two vectors computed as the sum of the products of the
       | coefficients of an orthonormal basis is a scalar equal to the
       | product of the Euclidean magnitudes of the vectors and the cosine
       | of the angle between them. The Wikipedia article on dot product
       | explains this reasonably well, so just read that.
        
       | esafak wrote:
       | It works like the dot product in the case of spherical
       | embeddings. Normalizing the embeddings makes it easier to
       | understand too.
        
       | cproctor wrote:
       | One thing I've wondered for a while: Is there a principled reason
       | (e.g. explainable in terms of embedding training) why a vector's
       | magnitude can be ignored within a pretrained embedding, such that
       | cosine similarity is a good measure of semantic distance? Or is
       | it just a computationally-inexpensive trick that works well in
       | practice?
       | 
       | For example, if I have a set of words and I want to consider
       | their relative location on an axis between two anchor words (e.g.
       | "good" and "evil"), it makes sense to me to project all the words
       | onto the vector from "good" to "evil." Would comparing each
       | word's "good" and "evil" cosine similarity be equivalent, or even
       | preferable? (I know there are questions about the
       | interpretability of this kind of geometry.)
        
         | marginalia_nu wrote:
         | Dunno if I have the full answer, but it seems in high
         | dimensional spaces, you can typically throw away a lot of
         | information and still preserve distance.
         | 
         | The J-L lemma is at least somewhat related, even though it
         | doesn't to my understanding quite describe the same
         | transformation.
         | 
         | https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...
         | 
         | see also https://en.m.wikipedia.org/wiki/Random_projection
        
         | nostrademons wrote:
         | So I first learned about cosine similarity in the context of
         | traditional information retrieval, and the simplified models
         | used in that field before the development of LLMs, TensorFlow,
         | and large-scale machine learning might prove instructive.
         | 
         | Imagine you have a simple bag-of-words model of a document,
         | where you just count the number of occurrences of each word in
         | the document. Numerically, this is represented as a vector
         | where each dimension is one _token_ (so, you might have one
         | number for the word  "number", another for "cosine", another
         | for "the", and so on), and the magnitude of that component is
         | the count of the number of times it occurs. Intuitively, cosine
         | similarity is a measure of _how frequently the same word
         | appears in both documents_. Words that appear in both documents
         | get multiplied together, but words that are only in one get
         | multiplied by zero and drop out of the cosine sum. So because
         | "cosine", "number", and "vector" appear frequently in my post,
         | it will appear similar to other documents about math. Because
         | "words" and "documents" appear frequently, it will appear
         | similar to other documents about metalanguage or information
         | retrieval.
         | 
         | And intuitively, the reason the magnitude doesn't matter is
         | that _those counts will be much higher in longer documents, but
         | the length of the document doesn 't say much about what the
         | document is about_. The reason you take the cosine (which has a
         | denominator of magnitude-squared) is a form of length
         | normalization, so that you can get sensible results without
         | biasing toward shorter or longer documents.
         | 
         | Most machine-learned embeddings are similar. The components of
         | the vector are _features_ that your ML model has determined are
         | important. If the product of the same dimension of two items is
         | large, it indicates that they are similar in that dimension. If
         | it 's zero, it indicates that that feature is not particularly
         | representative of the item. Embeddings are often normalized,
         | and for normalized vectors the fact that magnitude drops out
         | doesn't really matter. But it doesn't hurt either: the
         | magnitude will be one, so magnitude^2 is also 1 and you just
         | take the pair-wise product of the vectors.
        
           | d110af5ccf wrote:
           | > the reason the magnitude doesn't matter is that those
           | counts will be much higher in longer documents ...
           | 
           | To be a bit more explicit (of my intuition). The vector is
           | encoding a ratio, isn't it? You want to treat 3:2, 6:4, 12:8,
           | ... as equivalent in this case; normalization does exactly
           | that.
        
         | magicalhippo wrote:
         | I've been wondering the same.
         | 
         | When I dabbled with latent semantic indexing[1], using cosine
         | similarity made sense as the dimensions of the input vectors
         | were words, for example a 1 if a word was present or 0 if not.
         | So one would expect vectors that point in a similar direction
         | to be related.
         | 
         | I haven't studied LLM embedding layers in depth, so yeah been
         | wondering about using certain norms[2] instead to determine if
         | two embeddings are similar. Does it depends on the embedding
         | layer for example?
         | 
         | Should be noted it's been many years since I learned linear
         | algebra, so getting somewhat rusty.
         | 
         | [1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis
         | 
         | [2]: https://en.m.wikipedia.org/wiki/Norm_(mathematics)
        
         | Scene_Cast2 wrote:
         | Some embedding models are explicitly trained on cosine
         | similarity. Otherwise, if you have a 512D vector, discarding
         | magnitude is like discarding just a single dimension (i.e. you
         | get 511 independent dimensions).
        
           | extasia wrote:
           | This is not quite right; you are actually losing information
           | about each of the dimensions and your mental model of
           | reducing the dimensionality by one is misleading.
           | 
           | Consider [1,0] and [x,x] Normalised we get [1,0] and
           | [sqrt(.5),sqrt(.5)] -- clearly something has changed because
           | the first vector is now larger in dimension zero than the
           | second, despite starting off as an arbitrary value, x, which
           | could have been smaller than 1. As such we have lost
           | information about x's magnitude which we cannot recover from
           | just the normalized vector.
        
       | dheera wrote:
       | It's just a normalized dot product. People use "cosine
       | similarity" to sound knowledgeable
        
       | seydor wrote:
       | dot product is easiest to understand as the projection of one
       | vector to the other. the rest of it is self explanatory
        
       | janalsncm wrote:
       | Cosine similarity for unit vectors looks like the angle between
       | hands on a high dimensional clock. 12 and 6 have -1 cosine sim. 5
       | and 6 are pretty close.
       | 
       | Cosine similarity works if the model has been deliberately
       | trained with cosine similarity as the distance metric. If they
       | were trained with Euclidean distance the results aren't reliable.
       | 
       | Example: (0,1) and (0,2) have a cosine similarity of 1 but
       | nonzero Euclidean distance.
        
       ___________________________________________________________________
       (page generated 2024-09-06 23:00 UTC)