[HN Gopher] Harnessing the Universal Geometry of Embeddings
       ___________________________________________________________________
        
       Harnessing the Universal Geometry of Embeddings
        
       Author : jxmorris12
       Score  : 63 points
       Date   : 2025-05-21 18:15 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jxmorris12 wrote:
       | Hi HN, I'm Jack, the last author of this paper. It feels good to
       | release this, the fruit of a two-year quest to "align" two vector
       | spaces without any paired data. It's fun to look back a bit and
       | note that at least two people told me this wasn't possible:
       | 
       | 1. An MIT professor who works on similar geometry alignment
       | problems didn't want to work on this with me because he was
       | certain we would need at least a little bit of paired data
       | 
       | 2. A vector database startup founder who told me about his plan
       | to randomly rotate embeddings to guarantee user security (and
       | ignored me when I said it might not be a good idea)
       | 
       | The practical takeaway is something that many people already
       | understood, which is that _embeddings are not encrypted_ , even
       | if you don't have access to the model that produced them.
       | 
       | As one example, in the Cursor security policy
       | (https://www.cursor.com/security#codebase-indexing) they state:
       | 
       | > Embedding reversal: academic work has shown that reversing
       | embeddings is possible in some cases. Current attacks rely on
       | having access to the model [...]
       | 
       | This is no longer the case. Since all embedding models are
       | learning ~the same thing, we can decode any embedding vectors,
       | given we have at least a few thousand of them.
        
         | Lerc wrote:
         | I must admit reading the abstract made me think to myself that
         | I should read the paper in skeptical mode.
         | 
         | Does this extend to being able to analytically determine which
         | concepts are encodable in one embedding but not another? An
         | embedding from a deft tiny stories LLM presumably cannot encode
         | concepts about RNA replication.
         | 
         | Assuming that is true. If you can detect when you are trying to
         | put a square peg into a round hole, does this mean you have the
         | ability to remove square holes from a system?
        
           | jxmorris12 wrote:
           | Very fair!
           | 
           | > Does this extend to being able to analytically determine
           | which concepts are encodable in one embedding but not
           | another? An embedding from a deft tiny stories LLM presumably
           | cannot encode concepts about RNA replication.
           | 
           | Yeah, this is a great point. We're mostly building off of
           | this prior work on the Platonic Representation Hypothesis
           | (https://arxiv.org/abs/2405.07987). I think our findings go-
           | so-far as to apply to large-enough models that are well-
           | enough trained on The Internet. So, text and images. Maybe
           | audio, too, if the audio is scraped from the Internet.
           | 
           | So I don't think your tinystories example qualifies for the
           | PRH, since it's not enough data and it's not representative
           | of the whole Internet. And RNA data is (I would guess)
           | something very different altogether.
           | 
           | > Assuming that is true. If you can detect when you are
           | trying to put a square peg into a round hole, does this mean
           | you have the ability to remove square holes from a system?
           | 
           | Not sure I follow this part.
        
             | Lerc wrote:
             | >So I don't think your tinystories example qualifies for
             | the PRH, since it's not enough data and it's not
             | representative of the whole Internet. And RNA data is (I
             | would guess) something very different altogether.
             | 
             | My thought there was that you'd be comparing tinystories to
             | a model that trained on the entire internet. The RNA
             | related information would be a subset of the second
             | representation that has no comparable encoding in the
             | tinystories space. Can you detect that? If both models have
             | to be of sufficient scale to work the question becomes
             | "what is the scale, is it sliding or a threshold? "
             | 
             | > _> Assuming that is true. If you can detect when you are
             | trying to put a square peg into a round hole, does this
             | mean you have the ability to remove square holes from a
             | system?_
             | 
             | >Not sure I follow this part.
             | 
             | Perhaps the metaphor doesn't work so well. If you can
             | detect if something is encodable in one embedding model but
             | not another. Can you then leverage that detection ability
             | in order to modify an embedding model so that it cannot
             | represent an idea.
        
               | eximius wrote:
               | As I read the paper, you would be able to detect it in a
               | couple of ways
               | 
               | 1. possibly high loss where the models don't have
               | compatible embedding concepts 2. given a sufficient
               | "sample" of vectors from each space, projecting them to
               | the same backbone would show clusters where they have
               | mismatched concepts
               | 
               | It's not obvious to me how you'd use either of those to
               | tweak the vector space of one to not represent some
               | concept, though.
               | 
               | But if you just wanted to make an embedding that is
               | unable to represent some concept, presumably you could
               | already do that by training disjoin "unrepresentable
               | concepts" to a single point.
        
           | oofbey wrote:
           | I read a lot of AI papers on arxiv, and it's been a while
           | since I read one where the first line of the abstract had me
           | scoffing and done.
           | 
           | > We introduce the FIRST method for translating text
           | embeddings from one vector space to another without any
           | paired data
           | 
           | (emphasis mine)
           | 
           | Nope. I'm not gonna go a literature search for you right now
           | and find the references, but this is certainly not the first
           | attempt to do unsupervised alignment of embeddings, text or
           | otherwise. People were doing this back in ~2016.
        
         | srean wrote:
         | Doesn't the space of embeddings have some symmetries, that when
         | applied does not change the output sequence ?
         | 
         | For example, global rotation that does not change embedded
         | vector x embedded vector dot-product and changes query vector x
         | embedded dot-product in an equivariant way.
        
           | jxmorris12 wrote:
           | Yes. So the idea was that an orthogonal rotation will
           | 'encrypt' the embeddings without affecting performance, since
           | orthogonality preserves cosine similarity. It's a good idea,
           | but we can un-rotate the embeddings using our GAN.
        
             | srean wrote:
             | I can understand that two relatively rotated embeddings
             | from the same or similar dataset can be realigned as long
             | as they don't have internal geometric symmetries. The same
             | way we can re-align two globes -- look for matching shapes,
             | continents.
             | 
             | EDIT: Perfect symmetries, for example, feature-less
             | spheres, or the analogues of platonic solids would break
             | this. If the embedded space has no geometric symmetries you
             | would be in business.
             | 
             | Re-aligning, essentially would be akin to solving a graph-
             | isomorphism problem.
             | 
             | Lie algebraic formulation would make it less generic than
             | an arbitrary graph-isomorphism problem. Essentially reduce
             | it to a high dimensional procrustes problem. Generic graph
             | isomorphism can be quite a challenge.
             | 
             | https://en.m.wikipedia.org/wiki/Procrustes_analysis
             | 
             | EDIT: Sinkhorn balancing over a set of points (say a
             | d-dimensional tetrahedron, essentially a simplex) furthest
             | from each other might be a good first cut to try. You might
             | have already done so, I haven't read your paper yet.
        
               | jxmorris12 wrote:
               | Right, that's why the baselines here come from the land
               | of Optimal Transport, which looks at the world through
               | isomorphisms, exactly as you've suggested.
               | 
               | The GAN works way better than traditional OT methods
               | though. I really don't know why, this is the part that
               | feels like magic to me.
        
               | srean wrote:
               | Got you. I can understand that this has a chance of
               | working if the embeddings have converted to their global
               | optimum. Otherwise all bets ought to be off.
               | 
               | All the best.
               | 
               | I can totally understand the professors point, little bit
               | of alignment data ought significantly increase the chance
               | of success. Otherwise it will have to rely on these small
               | deviations from symmetry to anchor the orientation.
        
               | jxmorris12 wrote:
               | Yeah, we didn't get around to testing what the impact
               | would be of having a small amount of aligned data. I've
               | seen other papers asserting that as few as five pairs is
               | enough to go a long way.
        
         | nimish wrote:
         | Hooray, finally we are getting the geometric analysis of
         | embedding spaces we need. Information geometry and differential
         | geometry is finally getting its moment in the sun!
        
         | SubiculumCode wrote:
         | I am a curious amateur, so I may say something dumb. but:
         | Suppose you take a number of smaller embedding models, and one
         | more advanced embedding model. Suppose for a document, you
         | convert each model's embeddings to their universal embedding
         | representation and examine the universal embedding spaces.
         | 
         | On a per document basis, would the universal embeddings of the
         | smaller models (less performant) cluster around the better
         | model's universal embedding space, in a way suggestive that
         | they are each targeting the 'true' embedding space, but with
         | additional error/noise?
         | 
         | If so, can averaging the universal embeddings from a collection
         | of smaller models effectively approximate the universal
         | embedding space of the stronger model? Could you then use your
         | "averaged universal embeddings" as a target to train a new
         | embedding model?
        
         | jackpirate wrote:
         | I hate to be "reviewer 2", but:
         | 
         | I used to work on what your paper calls "unsupervised
         | transport", that is machine translation between two languages
         | without alignment data. You note that this field has existed
         | since ~2016 and you provide a number of references, but you
         | only dedicate ~4 lines of text to this branch of research.
         | There's no comparison about why your technique is different to
         | this prior work or why the prior algorithms can't be applied to
         | the output of modern LLMs.
         | 
         | Naively, I would expect off-the-shelf embedding alignment
         | algorithms (like <https://github.com/artetxem/vecmap> and <http
         | s://github.com/facebookresearch/fastText/tree/main/align...>,
         | neither of which are cited or compared against) to work quite
         | well on this problem. So I'm curious if they don't or why they
         | don't.
         | 
         | I can imagine there is lots of room for improvements around
         | implicit regularization in the algorithms. Specifically, these
         | algorithms were designed with word2vec output in mind
         | (typically 300 dimensional vectors with 200000 observations),
         | but your problem has higher dimensional vectors with fewer
         | observations and so would likely require different
         | hyperparameter tuning. IIRC, there's no explicit regularization
         | in these methods, but hyperparameters like stepsize/stepcount
         | can implicitly add L2 regularization, which you probably need
         | for your application.
         | 
         | ---
         | 
         | PS.
         | 
         | I *strongly dislike* your name of vec2vec. You aren't the
         | first/only algorithm for taking vectors as input and getting
         | vectors as output, and you have no right to claim such a
         | general title.
         | 
         | ---
         | 
         | PPS.
         | 
         | I believe there is a minor typo with footnote 1. The note is
         | "Our code is available on GitHub." but it is attached to the
         | sentence "In practice, it is unrealistic to expect that such a
         | database be available."
        
           | newfocogi wrote:
           | Naming things is hard. Noting the two alternative approaches
           | that you referenced are called "vecmap" and "alignment" which
           | "aren't the first/only algorithm for ... and you have no
           | right to claim such a general title" could easily apply there
           | as well.
        
           | jxmorris12 wrote:
           | Hey, I appreciate the perspective. We definitely should cite
           | both those papers, and will do so in the next version of our
           | draft. There are a lot of papers in this area, and they're
           | all a few years old now, so you might understand how we
           | missed two of them.
           | 
           | We tested all of the methods in the Python Optimal Transport
           | package (https://pythonot.github.io/) and reported the _max_
           | in most of our tables. So some of this is covered. A lot of
           | these methods also require a seed dictionary, which we don 't
           | have in our case. That said, you're welcome to take any
           | number of these tools and plug them into our codebase; the
           | results would definitely be interesting, although we can
           | expect the adversarial methods still work best, as they do in
           | the problem settings you mention.
           | 
           | As for the name - the paper you recommend is called 'vecmap'
           | which seems equally general, doesn't it? Google shows me
           | there are others who have developed their own 'vec2vec'.
           | There is a lot of repetition in AI these days, so collisions
           | happen.
        
             | jackpirate wrote:
             | > We tested all of the methods in the Python Optimal
             | Transport package (https://pythonot.github.io/) and
             | reported the max in most of our tables.
             | 
             | Sorry if I'm being obtuse, but I don't see any mention of
             | the POT package in your paper or of what specific
             | algorithms you used from it to compare against. My best
             | guess is that you used the linear map similar to the
             | example at
             | <https://pythonot.github.io/auto_examples/domain-
             | adaptation/p...>. The methods I mentioned are also linear,
             | but contain a number of additional tricks that result in
             | _much_ better performance than a standard L2 loss, and so I
             | would expect those methods to outperform your OT baseline.
             | 
             | > As for the name - the paper you recommend is called
             | 'vecmap' which seems equally general, doesn't it? Google
             | shows me there are others who have developed their own
             | 'vec2vec'. There is a lot of repetition in AI these days,
             | so collisions happen.
             | 
             | But both of those papers are about generic vector
             | alignment, so the generality of the name makes sense. Your
             | contribution here seems specifically about the LLM use
             | case, and so a name that implies the LLM use case would be
             | preferable.
             | 
             | I do agree though that in general naming is hard and I
             | don't have a better name to suggest. I also agree that
             | there's lots of related papers, and you can't cite/discuss
             | them all reasonably.
             | 
             | And I don't mean to be overly critical... the application
             | to LLMs is definitely cool. I wouldn't have read the paper
             | and written up my critiques if I didn't overall like it :)
        
           | mjburgess wrote:
           | > I _strongly dislike_ your name of vec2vec.
           | 
           | Imagine having more than a passing understanding of
           | philosophy, and then reading much of any major computer
           | science papers. By this "No right to claim" logic, I'd have
           | you all on trial.
        
         | logicchains wrote:
         | Does this result imply that if we had a LLM trained on a very
         | large volume of only English data, and one trained only on a
         | very large volume of data in another language, your technique
         | could be used to translate between the two languages? Pretty
         | cool. If we somehow came across a huge volume of text in an
         | alien language, your technique could potentially translate
         | their language into ours (although maybe the same could be
         | achieved just by training a single LLM on both languages?).
        
         | chompychop wrote:
         | Don't you mean "John" instead of "Jack"? :)
        
       | airylizard wrote:
       | The fact that embeddings from different models can be translated
       | into a shared latent space (and back) supports the notion that
       | semantic anchors or guides are not just model-specific hacks, but
       | potentially universal tools. Fantastic read, thank you.
       | 
       | Given the demonstrated risk of information leakage from
       | embeddings, have you explored any methods for hardening,
       | obfuscating, or 'watermarking' embedding spaces to resist
       | universal translation and inversion?
        
         | jxmorris12 wrote:
         | > Given the demonstrated risk of information leakage from
         | embeddings, have you explored any methods for hardening,
         | obfuscating, or 'watermarking' embedding spaces to resist
         | universal translation and inversion?
         | 
         | No, we haven't tried anything like that. There's definitely a
         | _need_ for it. People are using embeddings all over the place,
         | not to mention all of the other representations people pass
         | around (kv caches, model weights, etc.).
         | 
         | One consideration is that's likely going to be a tradeoff
         | between embedding usefulness and invertability. So if we
         | watermark our embedding space somehow, or apply some other
         | 'defense' to make inversion difficult, we will probably
         | sacrifice some quality. It's not clear yet how much that would
         | be.
        
           | airylizard wrote:
           | Are you continuing research? Is there somewhere we can follow
           | along?
        
             | jxmorris12 wrote:
             | Yes! For now just our Twitters: - Rishi, the first author:
             | x.com/rishi_d_jha - Me: x.com/jxmnop
             | 
             | And there's obviously always ArXiv. Maybe we should make a
             | blog or something, but the updates really don't come that
             | often.
        
       | kridsdale1 wrote:
       | This seems like a catastrophe in the wings for legal-services-RAG
       | companies.
        
       | SubiculumCode wrote:
       | Can this be used to allow different embedding models to
       | communicate with each other in embedding space?
        
         | jxmorris12 wrote:
         | Yes, you can definitely convert the outputs from one model to
         | the space of another, and then use them.
        
       | mjburgess wrote:
       | I don't see how the "different data" aspect is evidenced. If the
       | "modality" of the data is the same, we're choosing a highly
       | specific subset of all possible data -- and, in practice,
       | radically more narrow than just that. Any sufficiently capable
       | LLM is going to have to be trained on a corpus not-so-dissimilar
       | to all electronic texts which exist in the standard corpa used
       | for LLM training.
       | 
       | The idea that a data set is "different" merely because its some
       | subset of this maximal corpa is a difference without a
       | distinction. What isnt being proposed is, say, that training just
       | on all the works of scifi fiction lead to a zero-info
       | translatable embedding space projectable into all the works of
       | horror, and the like (or say that english-scifi can be bridged to
       | japanese-scifi by way of a english-japanese-horror-corpus).
       | 
       | The very objective of creating LLMs with useful capabilities
       | _entials_ an extremely similar dataset starting point. We do not
       | have so many petabytes of training data here that there is any
       | meaningful sense in which OpenAI uses  "only this discrete
       | subspace" and perplextiy, "yet another". All useful LLMs sample
       | roughly randomly across the maximal corpus that we have to hand.
       | 
       | Thus this hype around there being a platonic form of how word
       | tokens ought be arranged seems wholly unevidenced. Reality has a
       | "natural arrangement" -- this does not show that our highly lossy
       | encoding of it in english has anything like a unique or natural
       | correspondence. It has a circumstantial correspondance in "all
       | recorded electronic texts" which are the basis for training all
       | generally useful LLMs.
        
       | kevmo314 wrote:
       | Very cool! I've been looking for something like this for a while
       | and couldn't find anyone doing it. I've been investigating a way
       | to translate LoRAs between models and this seems like it could be
       | a first step towards that.
        
       ___________________________________________________________________
       (page generated 2025-05-21 23:01 UTC)