[HN Gopher] How does cosine similarity work?
___________________________________________________________________
How does cosine similarity work?
Author : tomhazledine
Score : 60 points
Date : 2024-09-04 11:52 UTC (2 days ago)
(HTM) web link (tomhazledine.com)
(TXT) w3m dump (tomhazledine.com)
| OutOfHere wrote:
| For those who don't want to use a full-blown RAG database,
| scipy.spatial.distance has a convenient cosine distance function.
| And for those who don't even want to use SciPy, the formula in
| the linked post.
|
| For anyone new to the topic, note that the monotonic
| interpretation of cosine distance is opposite to that of cosine
| similarity.
| ashvardanian wrote:
| SciPy distances module has its own problems. It's pretty slow,
| and constantly overflows in mixed precision scenarios. It also
| raises the wrong type of errors when it overflows, and uses
| general purpose `math` package instead of `numpy` for square
| roots. So use it with caution.
|
| I've outlined some of the related issues here:
| https://github.com/ashvardanian/SimSIMD#cosine-similarity-re...
| OutOfHere wrote:
| Noted, and thanks for your great work. My experience with it
| is limited to working with LLM embeddings, which I believe
| have been cleanly between 0 and 1. As such, I am yet to
| encounter these issues.
|
| Regarding the speed, yes, I wouldn't use it with big data. Up
| to a few thousand items has been fine for me, or perhaps a
| few hundred if pairwise.
| dbfclark wrote:
| A good way to understand why cosine similarity is so common in
| NLP is to think in terms of a keyword search. A bag-of-words
| vector represents a document as a sparse vector of its word
| counts; counting the number of occurrences of some set of query
| words is the dot product of the query vector with the document
| vector; normalizing for length gives you cosine similarity. If
| you have word embedding vectors instead of discrete words, you
| can think of the same game, just now the "count" of a word with
| another word is the similarity of the word embeddings instead of
| a 0/1. Finally, LLMs give sentence embeddings as weighted sums of
| contextual word vectors, so it's all just fuzzy word counting
| again.
| heyitsguay wrote:
| Sorta related -- whenever I'm doing something with embeddings, i
| just normalize them to length one, at which point cosine
| similarity becomes a simple dot product. Is there ever a reason
| to not normalize embedding length? An application where that
| length matters?
| psyklic wrote:
| For the LLM itself, length matters. For example, the final
| logits are computed as the un-normalized dot product, making
| them a function of both direction and magnitude. This means
| that if you embed then immediately un-embed (using the same
| embeddings for both), a different token might be obtained. In
| models such as GPT2, the embedding vector magnitude is loosely
| correlated with token frequency.
| ashvardanian wrote:
| On the practical side, dot products are great, but break in
| mixed precision and integer representations, where accurately
| normalizing to unit length isn't feasible.
|
| In other cases people prefer L2 distances for embeddings, where
| the magnitude can have a serious impact on the distance between
| a pair of points.
| janalsncm wrote:
| If you're feeling guilty about it you can usually store the un-
| normalized lengths separately.
| ashvardanian wrote:
| Three separate passes over JavaScript arrays are quite costly,
| especially for high-dimensional vectors. I'd recommend using
| `TypedArray` with vanilla `for` loops. It will make things
| faster, and will allow using C extensions, if you want to benefit
| from modern hardware features, while still implementing the logic
| in JavaScript: https://ashvardanian.com/posts/javascript-ai-
| vector-search/
| acjohnson55 wrote:
| I might have missed this, but I think the post might bury the
| lede that in a high dimensional space, two randomly chosen
| vectors are very unlikely to have high cosine similarity. Or
| maybe another way to put it is that the expected value of the
| cosine of two random vectors approaches zero as the
| dimensionality increases.
|
| Most similarity metrics will be very low if vectors don't even
| point in the same direction, so cosine similarity is a cheap way
| to filter out the vast majority of the data set.
|
| It's been a while since I've studied this stuff, so I might be
| off target.
| OutOfHere wrote:
| Even if two random vectors don't have high cosine similarity,
| and I have not had this issue in 3000 dimensions, the cosine
| similarity is still usable in relative terms, i.e. relative to
| other items in the dataset. This keeps it useful.
| DavidSJ wrote:
| Nitpick: The expected value of the cosine is 0 even in low-
| dimensional spaces. It's the expected _square_ of that (i.e.
| the variance) which gets smaller with the dimension.
| naijaboiler wrote:
| Imagine 2 points in 3 dimensional space with a vector being the
| line from the origin to the point. So you have 2 vectors pointing
| going to the 2 points from the origin.
|
| If those points are really close together, then angle between the
| two vector lines is very small. Loosely speaking cosine is a way
| to quantize how close two lines with a shared origin is. If both
| lines are the same, the angle between them is 0, and the cosine
| of 0 is 1. If two lines are 90 degrees apart, their cosine is 0.
| If two lines are 180 degrees apart, their cosine is -1. So cosine
| is a way to quantify the closeness of two lines which share to
| same origin
|
| To go back with 2 points in space that we started with, we can
| measure how close those 2 points are by taking the cosine of the
| lines going from origin to the two points. If they are close, the
| angle between them is small. If they are the exact same point,
| the angle between the lines is 0. That line is called a vector
|
| Cosine similarity measures how closes two vectors are in
| Euclidean space. That's we end up using it a lot. It's no the
| only way to measure closeness. There are many others
| stouset wrote:
| Are all the points in question one unit distant from the
| origin?
| viciousvoxel wrote:
| yes, cosine similarity involves normalizing the points (by
| the L2 norm) and then dot product. In other words the points
| lie on the unit (hyper)sphere.
| derbOac wrote:
| Maybe I missed this but I was surprised they didn't mention the
| connection to correlation. Cosine similarity can be thought of as
| a correlation, and some bivariate distributions (normal I think?)
| can be rexpressed in terms of cosine similarity.
|
| There's also some generalizations to higher dimensional notions
| of cosines that are kind of interesting.
| lern_too_spel wrote:
| This doesn't properly explain what it says it explains. To
| explain it correctly, you have to explain why the dot product of
| two vectors computed as the sum of the products of the
| coefficients of an orthonormal basis is a scalar equal to the
| product of the Euclidean magnitudes of the vectors and the cosine
| of the angle between them. The Wikipedia article on dot product
| explains this reasonably well, so just read that.
| esafak wrote:
| It works like the dot product in the case of spherical
| embeddings. Normalizing the embeddings makes it easier to
| understand too.
| cproctor wrote:
| One thing I've wondered for a while: Is there a principled reason
| (e.g. explainable in terms of embedding training) why a vector's
| magnitude can be ignored within a pretrained embedding, such that
| cosine similarity is a good measure of semantic distance? Or is
| it just a computationally-inexpensive trick that works well in
| practice?
|
| For example, if I have a set of words and I want to consider
| their relative location on an axis between two anchor words (e.g.
| "good" and "evil"), it makes sense to me to project all the words
| onto the vector from "good" to "evil." Would comparing each
| word's "good" and "evil" cosine similarity be equivalent, or even
| preferable? (I know there are questions about the
| interpretability of this kind of geometry.)
| marginalia_nu wrote:
| Dunno if I have the full answer, but it seems in high
| dimensional spaces, you can typically throw away a lot of
| information and still preserve distance.
|
| The J-L lemma is at least somewhat related, even though it
| doesn't to my understanding quite describe the same
| transformation.
|
| https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...
|
| see also https://en.m.wikipedia.org/wiki/Random_projection
| nostrademons wrote:
| So I first learned about cosine similarity in the context of
| traditional information retrieval, and the simplified models
| used in that field before the development of LLMs, TensorFlow,
| and large-scale machine learning might prove instructive.
|
| Imagine you have a simple bag-of-words model of a document,
| where you just count the number of occurrences of each word in
| the document. Numerically, this is represented as a vector
| where each dimension is one _token_ (so, you might have one
| number for the word "number", another for "cosine", another
| for "the", and so on), and the magnitude of that component is
| the count of the number of times it occurs. Intuitively, cosine
| similarity is a measure of _how frequently the same word
| appears in both documents_. Words that appear in both documents
| get multiplied together, but words that are only in one get
| multiplied by zero and drop out of the cosine sum. So because
| "cosine", "number", and "vector" appear frequently in my post,
| it will appear similar to other documents about math. Because
| "words" and "documents" appear frequently, it will appear
| similar to other documents about metalanguage or information
| retrieval.
|
| And intuitively, the reason the magnitude doesn't matter is
| that _those counts will be much higher in longer documents, but
| the length of the document doesn 't say much about what the
| document is about_. The reason you take the cosine (which has a
| denominator of magnitude-squared) is a form of length
| normalization, so that you can get sensible results without
| biasing toward shorter or longer documents.
|
| Most machine-learned embeddings are similar. The components of
| the vector are _features_ that your ML model has determined are
| important. If the product of the same dimension of two items is
| large, it indicates that they are similar in that dimension. If
| it 's zero, it indicates that that feature is not particularly
| representative of the item. Embeddings are often normalized,
| and for normalized vectors the fact that magnitude drops out
| doesn't really matter. But it doesn't hurt either: the
| magnitude will be one, so magnitude^2 is also 1 and you just
| take the pair-wise product of the vectors.
| d110af5ccf wrote:
| > the reason the magnitude doesn't matter is that those
| counts will be much higher in longer documents ...
|
| To be a bit more explicit (of my intuition). The vector is
| encoding a ratio, isn't it? You want to treat 3:2, 6:4, 12:8,
| ... as equivalent in this case; normalization does exactly
| that.
| magicalhippo wrote:
| I've been wondering the same.
|
| When I dabbled with latent semantic indexing[1], using cosine
| similarity made sense as the dimensions of the input vectors
| were words, for example a 1 if a word was present or 0 if not.
| So one would expect vectors that point in a similar direction
| to be related.
|
| I haven't studied LLM embedding layers in depth, so yeah been
| wondering about using certain norms[2] instead to determine if
| two embeddings are similar. Does it depends on the embedding
| layer for example?
|
| Should be noted it's been many years since I learned linear
| algebra, so getting somewhat rusty.
|
| [1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis
|
| [2]: https://en.m.wikipedia.org/wiki/Norm_(mathematics)
| Scene_Cast2 wrote:
| Some embedding models are explicitly trained on cosine
| similarity. Otherwise, if you have a 512D vector, discarding
| magnitude is like discarding just a single dimension (i.e. you
| get 511 independent dimensions).
| extasia wrote:
| This is not quite right; you are actually losing information
| about each of the dimensions and your mental model of
| reducing the dimensionality by one is misleading.
|
| Consider [1,0] and [x,x] Normalised we get [1,0] and
| [sqrt(.5),sqrt(.5)] -- clearly something has changed because
| the first vector is now larger in dimension zero than the
| second, despite starting off as an arbitrary value, x, which
| could have been smaller than 1. As such we have lost
| information about x's magnitude which we cannot recover from
| just the normalized vector.
| dheera wrote:
| It's just a normalized dot product. People use "cosine
| similarity" to sound knowledgeable
| seydor wrote:
| dot product is easiest to understand as the projection of one
| vector to the other. the rest of it is self explanatory
| janalsncm wrote:
| Cosine similarity for unit vectors looks like the angle between
| hands on a high dimensional clock. 12 and 6 have -1 cosine sim. 5
| and 6 are pretty close.
|
| Cosine similarity works if the model has been deliberately
| trained with cosine similarity as the distance metric. If they
| were trained with Euclidean distance the results aren't reliable.
|
| Example: (0,1) and (0,2) have a cosine similarity of 1 but
| nonzero Euclidean distance.
___________________________________________________________________
(page generated 2024-09-06 23:00 UTC)