[HN Gopher] Embeddings, vectors, and arithmetic
       ___________________________________________________________________
        
       Embeddings, vectors, and arithmetic
        
       Author : montyanderson
       Score  : 97 points
       Date   : 2023-12-14 18:50 UTC (1 days ago)
        
 (HTM) web link (montyanderson.net)
 (TXT) w3m dump (montyanderson.net)
        
       | throwup238 wrote:
       | _> At Prodia, we 've started to investigate building safety
       | systems by checking if the input prompts are within a distance
       | threshold of known adult or illegal concepts._
       | 
       | This is why we can't have nice AI things.
       | 
       | After my experience with RAG across a dozen models and god knows
       | how many experiments against parts of Libgen's archive in topics
       | I'm familiar with, I'm not sure embeddings are actually useful
       | for anything requiring any kind of accuracy. They're great for
       | low stakes purposes or as a step in a human driven workflow but
       | like LLMs they're a very fuzzy and often times inaccurate tool.
        
         | batch12 wrote:
         | I've had some luck with using embeddings for categorization and
         | for fingerprinting authors. But yeah for retrieval alone the
         | results have been mixed. I get better results there from
         | smaller sized text comparisons and the results haven't been
         | terrible, just not perfect.
        
           | montyanderson wrote:
           | That's really interesting. Was it hard to separate the
           | semantic meaning from the author's style?
        
         | petra wrote:
         | Can you expand more on what processes you've used ? Because
         | tools like elicit.org embedding based search seems pretty
         | acurate.
        
         | montyanderson wrote:
         | I don't see why AI integrations shouldn't cater to a user's
         | wants: there are a few things I don't want to see when
         | generating images day-to-day. Using embeddings for safety
         | filters lets you have uncensored models available in multiple
         | modes depending on audience.
        
       | atticora wrote:
       | "illegal concepts"
       | 
       | Embeddings mean that when we have a thought police they can now
       | be more targeted and effective than before. Any thought you
       | express can be objectively measured "using the euclidean distance
       | or cosine similarity" for illegal concepts, and censored,
       | corrected or punished accordingly. I imagine that this will come
       | early for comment sections on the web.
        
         | RockyMcNuts wrote:
         | if I invest a lot of money to make a nice forum like HN or a
         | social media site, don't I get to determine the right policies
         | to keep a nice private space?
         | 
         | doesn't it infringe my rights if people use my site and my
         | money to harass people or to spread stuff that is against my
         | economic interest, religion, values, etc. or that I just don't
         | like and didn't intend?
         | 
         | if people are going to run around using AI to spread deepfake
         | ragebait memes, shouldn't I get to enforce policies using the
         | same technology they use to pollute the space?
        
         | robrenaud wrote:
         | Embeddings aren't objective, their point of view is some kind
         | of complicated aggregate of the training data, which is itself
         | (as of now) mostly human written text.
         | 
         | I'd honestly love to see AI fending off the eternal winter on
         | message boards.
         | 
         | Is question redundant and basic? Direct user to a specialized
         | AI that can explain the topic well and save the regulars from
         | having to have the same discussion as two days ago.
         | 
         | But it is certainly the case that stronger machine text
         | understanding can be used for censorship and oppression. As
         | pretty much all powerful, general tools can be used for
         | nefarious purposes. But it can also be used for a wide range of
         | great purposes as well.
        
         | ducttapecrown wrote:
         | Gonna frame this comment so I can point to it in the year it
         | happens next decade.
        
       | waynecochran wrote:
       | So the question is: do embeddings form a linear space? I.e. does
       | scaling and addition make any sense?
        
         | Der_Einzige wrote:
         | Yes, that's what the glove paper showed. Linear substructure in
         | PCA space.
        
           | RockyMcNuts wrote:
           | does that apply for ada-002 embeddings, if you don't use
           | GloVe? I would think it only applies if you create embeddings
           | using a linear model?
        
             | nightski wrote:
             | I'm not a mathematician but on their site it says - "Cosine
             | similarity and Euclidean distance will result in the
             | identical rankings" so I am assuming that would have to be
             | the case?
        
               | yorwba wrote:
               | When you have unit vectors, cosine similarity and
               | euclidean distance always result in identical rankings,
               | even when those unit vectors are assigned at random and
               | have no semantic structure at all. That's because the
               | euclidean distance |a - b|2 = <a - b, a - b>  = <a, a>  -
               | 2<a, b>  + <b, b>  = |a|2 - 2|a||b|cos F + |b|2 for unit
               | vectors with |a| = |b| = 1 only depends on the cosine of
               | the angle F between the two vectors.
        
               | kridsdale3 wrote:
               | Pythagoras strikes again.
        
         | binarymax wrote:
         | Not always, another comment mentions it is true for GloVe, but
         | that doesn't mean it is true for all model's vector spaces.
        
         | samus wrote:
         | In a "good" embedding, they do.
        
           | tonyarkles wrote:
           | When I was first reading about Word2vec I thought it was
           | absolutely wonderful that the vector delta (king - queen) was
           | similar to the vector delta (man - woman). That captures
           | relationships in such a fascinating way!
        
             | esafak wrote:
             | It is not obvious to me why word2vec's training objective
             | yields this. word2vec ensures similarity of related words,
             | but why can you then perform linear algebra on unrelated
             | words?
        
               | blackbear_ wrote:
               | The training objective does not have much to do with it,
               | the bigger reason is that these neural networks
               | themselves use linear algebra (dot products/projections
               | and whatnot) to manipulate embeddings.
        
       | loisaidasam wrote:
       | I found the missing link to the emoji page on Barney Hill's
       | Github in case anyone else was looking for it:
       | 
       | https://www.barneyhill.com/pages/emoji/index.html
        
         | YPCrumble wrote:
         | I'd love to see the code on how the vector embedding works.
        
           | amne wrote:
           | If you search word2vec (paper published early 2010s IIRC) I
           | believe you will find very good material.
        
       | nuz wrote:
       | Really inspirational project. Does anyone know if there's an
       | analog of this to the LLM space? I.e. can you take the embedding
       | of two sentences and get a 'combined' sentence? Either by
       | searching a corpus for the closest match or by feeding that
       | combined embedding into an LLM that generates it.
        
         | wint3rmute wrote:
         | A recent paper [1] shows that what you're describing is
         | possible to some degree - you can reproduce text from its
         | embeddings. The paper provides the code implementing the
         | reversal process, so you could quickly hack together a
         | prototype :)
         | 
         | I also recommend a video by Yannic Kilcher [2], explaining the
         | paper.
         | 
         | [1] https://arxiv.org/abs/2310.06816
         | 
         | [2] https://www.youtube.com/watch?v=FY5j3P9tCeA
        
         | kridsdale3 wrote:
         | It seems to me like a lot of what LLMs and Image-generators do
         | is find the interpolation two points in concept-space. So
         | that's not vector-addition but vector-averaging. It's still
         | arithmetic.
        
       | pjs_ wrote:
       | Perhaps I'm missing something but this looks like a heavy case of
       | what's old is new again - the original "king - man + woman =
       | queen" paper is nearly a decade old:
       | 
       | https://arxiv.org/abs/1509.01692
        
       | kridsdale3 wrote:
       | I find it so fascinating that at the end of the article the
       | author alludes to something I've started becoming aware of:
       | 
       | There is a zone of illegal thoughts, that becomes definable by
       | model-training. A physical boundary in n-dimensional concept-
       | space. An "aligned" or "safe" AI system knows where this boundary
       | is and does not reach inside it. Vectors (embeddings) that would
       | probe it should instead intersect the surface like a ray-trace in
       | graphics, and return the embedded concept at minimum distance to
       | the safe-idea-boundary.
       | 
       | Intuitively, we all know what this zone is. It's the difference
       | between being a wild barbarian and a gentleman. Or being chill vs
       | antisocial. Seeing it in pure math is pretty awesome.
        
       ___________________________________________________________________
       (page generated 2023-12-15 23:01 UTC)