[HN Gopher] What are embeddings?
___________________________________________________________________
What are embeddings?
Author : Anon84
Score : 132 points
Date : 2023-06-25 16:27 UTC (6 hours ago)
(HTM) web link (vickiboykis.com)
(TXT) w3m dump (vickiboykis.com)
| cubefox wrote:
| An array of floats (an n-dimensional vector) which represents
| some piece of data like a text or an image. Different embeddings
| can be more or less close to each other, and this closeness
| indicates similarity.
| KRAKRISMOTT wrote:
| OP you need to make it clear that it's a book. The website is
| confusing.
| gtirloni wrote:
| There is a big green button that says "Get PDF" when I visit
| it.
| charcircuit wrote:
| A mapping whose codomain is an N dimensional space.
| moralestapia wrote:
| Don't know why you're downvoted since you're absolutely
| correct.
|
| They're just locality-sensitive hash functions.
| corobo wrote:
| I'd imagine it's because it's the title of the thing being
| linked to, not an actual question to be answered.
|
| Farting out a quick one line answer is a boring comment
| layer8 wrote:
| Those seem like two very different definitions.
| pizza wrote:
| lsh: similarity(x, y) < thresh => E[P(d(lsh(x), lsh(y)) <
| 1)] > 1-eps for some eps
|
| For a model that learns a good representation, and some
| suitable distance function to measure distances between
| embeddings (e.g. Euclidean) model:
| similarity(x, y) < thresh => E[P(d(emb(x), emb(y)) < r] >
| 1-eps
|
| for some cutoff distance r in the vector space
| neonate wrote:
| Paper:
| https://github.com/veekaybee/what_are_embeddings/blob/main/e...
| [deleted]
| sp332 wrote:
| I'd like to find a way to start with an embedding and have the
| computer generate some text that corresponds, at least
| approximately. There are tools that do that for images, right?
| Like Stable Diffusion, you can put an image in, get an embedding,
| then do gradient descent in latent space to find a new embedding,
| then generate a new image from that.
| kreeben wrote:
| The most basic of embedding (that I can think of) is one where
| the number of dimensions corresponds to the number of unique
| characters in your lexicon. If there is a number in one of the
| components of your embedding that is greater than "0", then you
| know what characters they are. This embedding does not encode
| the order of the characters, though. They are just a "bag of
| characters". If you were to then also encode the order of the
| characters in, say, yet another embedding, you could use those
| two embeddings to recreate the original word.
|
| Combine the two embeddings into a new vector space and BAM,
| you've invented "embedding2word".
| jorlow wrote:
| Gpt (and many others) just add these embeddings together in
| the model, so you could do that and have one vector that
| encodes both things together
| freeone3000 wrote:
| You can get this fairly trivially with word and sentence
| embeddings just by running the inverse (huggingface models have
| this as a builtin). For llama, the same is possible, but the
| matrix transpose is your responsibility :)
| sp332 wrote:
| Ok, I got annoyed at oobabooga (around when it first came
| out) and have been messing with llama.cpp since then. I can't
| tell if this feature request is the same thing we're talking
| about here?
| https://github.com/ggerganov/llama.cpp/issues/1552 I guess if
| I want more features, I should move on to something that has
| more features lol.
| TeMPOraL wrote:
| Looking at that issue, I wonder how is it that everyone
| seemed to not understand the poster's question.
|
| The feature itself is something I wanted to play with too,
| as it's kind of an obvious thing to want. I mean, these
| models execute a pipeline:
|
| [text] -> [tokens] -> <[embeddings] -> [inference] ->
| [embeddings]> -> [tokens] -> [text]
|
| Where the part in < ... > may or may not be implemented as
| a single step (i.e. all three parts interleaved).
|
| Now, apparently all the magic (not the "how transformer
| works", but the "how the hell are they this good" / "GPT-4
| is uncanny valley" kind of magic) of transformer models
| sits in the latent space and is invoked by the < ... > bit.
| We also know for sure that you can make the pipeline look
| like this:
|
| [text] -> [tokens] -> <[embeddings] -> [inference]> ->
| [embeddings]
|
| So with the two things in mind, it's kind of obvious you'd
| also want a pipe that looks like:
|
| [embeddings] -> [inference] -> [embeddings] (and optionally
| -> [tokens] -> [text])
|
| for the sole purpose of messing around and exploring the
| latent space itself.
|
| I'm very much not up to date with the whole space, so I
| might be missing something, but I'd thought that poking
| around the latent space would be getting _a lot_ more
| attention than it seems to be getting.
| danieldk wrote:
| This is basically how RNN encoder/decoder architectures worked.
| The encoder encoded the input as a single vector and the
| decoder would decode this into text (eg. for machine
| translation) [1]. However, fixed-length vectors generally
| required too much 'compression' to represent variable-length
| text, so people started adding attention mechanisms so that the
| decoder could also attend to the input text. And the seminal
| Transformer paper by Vaswani and others showed that you only
| need an attention mechanism and you could ditch the RNN (hence
| the title 'Attention is all you need') and here we are.
|
| So, this has been possible already for quite a long time.
|
| [1] https://arxiv.org/pdf/1409.3215.pdf
| sdenton4 wrote:
| To be sure, the seq2seq style already allowed multi length
| embeddings before transformers were a thing. RNN vs
| Transformer is entirely an implementation choice; you can
| build a seq2seq model with any combination of transformer,
| RNN, and conventional layers.
|
| Each of these layer types have different computational costs
| for training and inference, and encode different inductive
| biases, which may be more or less appropriate to a given
| problem.
___________________________________________________________________
(page generated 2023-06-25 23:00 UTC)