[HN Gopher] Embeddings, vectors, and arithmetic
___________________________________________________________________
Embeddings, vectors, and arithmetic
Author : montyanderson
Score : 97 points
Date : 2023-12-14 18:50 UTC (1 days ago)
(HTM) web link (montyanderson.net)
(TXT) w3m dump (montyanderson.net)
| throwup238 wrote:
| _> At Prodia, we 've started to investigate building safety
| systems by checking if the input prompts are within a distance
| threshold of known adult or illegal concepts._
|
| This is why we can't have nice AI things.
|
| After my experience with RAG across a dozen models and god knows
| how many experiments against parts of Libgen's archive in topics
| I'm familiar with, I'm not sure embeddings are actually useful
| for anything requiring any kind of accuracy. They're great for
| low stakes purposes or as a step in a human driven workflow but
| like LLMs they're a very fuzzy and often times inaccurate tool.
| batch12 wrote:
| I've had some luck with using embeddings for categorization and
| for fingerprinting authors. But yeah for retrieval alone the
| results have been mixed. I get better results there from
| smaller sized text comparisons and the results haven't been
| terrible, just not perfect.
| montyanderson wrote:
| That's really interesting. Was it hard to separate the
| semantic meaning from the author's style?
| petra wrote:
| Can you expand more on what processes you've used ? Because
| tools like elicit.org embedding based search seems pretty
| acurate.
| montyanderson wrote:
| I don't see why AI integrations shouldn't cater to a user's
| wants: there are a few things I don't want to see when
| generating images day-to-day. Using embeddings for safety
| filters lets you have uncensored models available in multiple
| modes depending on audience.
| atticora wrote:
| "illegal concepts"
|
| Embeddings mean that when we have a thought police they can now
| be more targeted and effective than before. Any thought you
| express can be objectively measured "using the euclidean distance
| or cosine similarity" for illegal concepts, and censored,
| corrected or punished accordingly. I imagine that this will come
| early for comment sections on the web.
| RockyMcNuts wrote:
| if I invest a lot of money to make a nice forum like HN or a
| social media site, don't I get to determine the right policies
| to keep a nice private space?
|
| doesn't it infringe my rights if people use my site and my
| money to harass people or to spread stuff that is against my
| economic interest, religion, values, etc. or that I just don't
| like and didn't intend?
|
| if people are going to run around using AI to spread deepfake
| ragebait memes, shouldn't I get to enforce policies using the
| same technology they use to pollute the space?
| robrenaud wrote:
| Embeddings aren't objective, their point of view is some kind
| of complicated aggregate of the training data, which is itself
| (as of now) mostly human written text.
|
| I'd honestly love to see AI fending off the eternal winter on
| message boards.
|
| Is question redundant and basic? Direct user to a specialized
| AI that can explain the topic well and save the regulars from
| having to have the same discussion as two days ago.
|
| But it is certainly the case that stronger machine text
| understanding can be used for censorship and oppression. As
| pretty much all powerful, general tools can be used for
| nefarious purposes. But it can also be used for a wide range of
| great purposes as well.
| ducttapecrown wrote:
| Gonna frame this comment so I can point to it in the year it
| happens next decade.
| waynecochran wrote:
| So the question is: do embeddings form a linear space? I.e. does
| scaling and addition make any sense?
| Der_Einzige wrote:
| Yes, that's what the glove paper showed. Linear substructure in
| PCA space.
| RockyMcNuts wrote:
| does that apply for ada-002 embeddings, if you don't use
| GloVe? I would think it only applies if you create embeddings
| using a linear model?
| nightski wrote:
| I'm not a mathematician but on their site it says - "Cosine
| similarity and Euclidean distance will result in the
| identical rankings" so I am assuming that would have to be
| the case?
| yorwba wrote:
| When you have unit vectors, cosine similarity and
| euclidean distance always result in identical rankings,
| even when those unit vectors are assigned at random and
| have no semantic structure at all. That's because the
| euclidean distance |a - b|2 = <a - b, a - b> = <a, a> -
| 2<a, b> + <b, b> = |a|2 - 2|a||b|cos F + |b|2 for unit
| vectors with |a| = |b| = 1 only depends on the cosine of
| the angle F between the two vectors.
| kridsdale3 wrote:
| Pythagoras strikes again.
| binarymax wrote:
| Not always, another comment mentions it is true for GloVe, but
| that doesn't mean it is true for all model's vector spaces.
| samus wrote:
| In a "good" embedding, they do.
| tonyarkles wrote:
| When I was first reading about Word2vec I thought it was
| absolutely wonderful that the vector delta (king - queen) was
| similar to the vector delta (man - woman). That captures
| relationships in such a fascinating way!
| esafak wrote:
| It is not obvious to me why word2vec's training objective
| yields this. word2vec ensures similarity of related words,
| but why can you then perform linear algebra on unrelated
| words?
| blackbear_ wrote:
| The training objective does not have much to do with it,
| the bigger reason is that these neural networks
| themselves use linear algebra (dot products/projections
| and whatnot) to manipulate embeddings.
| loisaidasam wrote:
| I found the missing link to the emoji page on Barney Hill's
| Github in case anyone else was looking for it:
|
| https://www.barneyhill.com/pages/emoji/index.html
| YPCrumble wrote:
| I'd love to see the code on how the vector embedding works.
| amne wrote:
| If you search word2vec (paper published early 2010s IIRC) I
| believe you will find very good material.
| nuz wrote:
| Really inspirational project. Does anyone know if there's an
| analog of this to the LLM space? I.e. can you take the embedding
| of two sentences and get a 'combined' sentence? Either by
| searching a corpus for the closest match or by feeding that
| combined embedding into an LLM that generates it.
| wint3rmute wrote:
| A recent paper [1] shows that what you're describing is
| possible to some degree - you can reproduce text from its
| embeddings. The paper provides the code implementing the
| reversal process, so you could quickly hack together a
| prototype :)
|
| I also recommend a video by Yannic Kilcher [2], explaining the
| paper.
|
| [1] https://arxiv.org/abs/2310.06816
|
| [2] https://www.youtube.com/watch?v=FY5j3P9tCeA
| kridsdale3 wrote:
| It seems to me like a lot of what LLMs and Image-generators do
| is find the interpolation two points in concept-space. So
| that's not vector-addition but vector-averaging. It's still
| arithmetic.
| pjs_ wrote:
| Perhaps I'm missing something but this looks like a heavy case of
| what's old is new again - the original "king - man + woman =
| queen" paper is nearly a decade old:
|
| https://arxiv.org/abs/1509.01692
| kridsdale3 wrote:
| I find it so fascinating that at the end of the article the
| author alludes to something I've started becoming aware of:
|
| There is a zone of illegal thoughts, that becomes definable by
| model-training. A physical boundary in n-dimensional concept-
| space. An "aligned" or "safe" AI system knows where this boundary
| is and does not reach inside it. Vectors (embeddings) that would
| probe it should instead intersect the surface like a ray-trace in
| graphics, and return the embedded concept at minimum distance to
| the safe-idea-boundary.
|
| Intuitively, we all know what this zone is. It's the difference
| between being a wild barbarian and a gentleman. Or being chill vs
| antisocial. Seeing it in pure math is pretty awesome.
___________________________________________________________________
(page generated 2023-12-15 23:01 UTC)