[HN Gopher] You could have designed state of the art positional ...
___________________________________________________________________
You could have designed state of the art positional encoding
Author : Philpax
Score : 199 points
Date : 2024-11-17 20:31 UTC (1 days ago)
(HTM) web link (fleetwood.dev)
(TXT) w3m dump (fleetwood.dev)
| valine wrote:
| One of the things I really love about rope is that it allows for
| a lot of interesting encoding schemes at inference time without
| model retraining. I've had a lot of fun playing with different
| relative positions. You can elicit a lot of interesting behaviors
| from the model when you use different rotations for keys vs
| queries, they don't always have to match.
|
| For example exact position doesn't matter too much when tokens
| are spaced out. Let's say you use token position 100 for your
| query, you can shift all the keys around position 100, and the
| further they are back in the context the more freedom you have to
| play with the value.
| bhickey wrote:
| Can you describe the behaviors that you can elicit with this
| technique?
| valine wrote:
| One strategy I've been playing around with is to take an
| instruction I want the model to follow and squish the
| positional encodings for the keys down to position zero, and
| the new queries out slightly further in the window. The model
| will still follow the instruction but the behaviors are more
| global. It's behaves more like a fine-tune and less like the
| instruction is part of the conversation.
| zackangelo wrote:
| I'm surprised this is the case! I've been working on a rope
| implementation for my own project (needed to account for
| padding in unique situations) and even an off by one error
| usually causes the model to produce non-sensical output.
| valine wrote:
| You have to be careful to keep the relative positions for
| adjacent and nearby tokens intact. The relative positions of
| distant tokens are less brittle.
| rgovostes wrote:
| Thanks to the author for clarifying something that's been a
| mystery to me for a few years. The positional encoding scheme in
| the "Attention Is All You Need" paper is only given half a page
| and the construction appears to come out of nowhere.
| FL33TW00D wrote:
| Thank you! Seemed like voodoo to me too, hence this post!
| throwawaymaths wrote:
| Maybe someone could answer this for me: it seems like encoding
| the positional embeddings as augmentations to the "natural"
| activations instead of as their own inputs (concatenated onto the
| activations) make things like sliding a window much harder... I
| guess obviously the drawback is you have a somewhat less
| textually derived information.
|
| I recall a early transformers video where they tried both and it
| turned out that adding the position onto the existing vectors was
| no worse so they went with it... No further discussion about
| motivations happened in that video.
|
| Is it worth revisiting that maybe now that activations have a
| gobsmackingly large dimension?
| stephantul wrote:
| They are not concatenated, but summed. I think concatenation
| wouldn't work, as you indicate.
|
| I think you mean the line in the original paper where they say
| compared the learned attention weights with the predefined
| encoding, and it made no difference.
| throwawaymaths wrote:
| > I think concatenation wouldn't work, as you indicate.
|
| Why do you say that?
| donkeyboy wrote:
| Concat could work too although less efficient because you
| need to make a new tensor.
|
| Actually summing might learn a concat on its own. Imagine
| the embedding learned for a token takes up the first N-20
| dimensions and leaves the last 20 dimensions as 0. And the
| positional encoding causes the first N-20 dims to be 0 and
| the last 20 to encode the information. Then when you sum
| you are actually concatenating. So I think of them as
| equivalent except add is more efficient/preserves the dim
| space, while concat would grow the dim space. And for
| something like position, which certainly does not need to
| occupy 1000+ dimensions, it would not make sense to concat
| all of that since it would be wasteful
| throwawaymaths wrote:
| why would you need to make a new tensor?
|
| Suppose you had a 4096 (llama-2) sized activations.
| Maybe, you make do with 3084 activations and concatenate
| 1024 positional activations onto that.
|
| Then you pass that to Mk Mq Mv and generate K, Q, V.
|
| The only thing that would change would be the Mff-out,
| which would now be a (big)x3084 matrix instead of
| (big)x4096
|
| In any case you would be retraining, so changing the dims
| of the tensors I think is not a big deal... In fact in
| this case they would be smaller (at the cost of fewer
| interlayer activations), but you would have the same
| number of tensors.
|
| > Actually summing might learn a concat on its own.
|
| But you see the point? You're forcing the model to learn
| something that maybe it didn't need to. That's like
| saying "well a fully connected network might learn
| convolution on its own". Historically breakthroughs in
| capability have accompanied one of: [more data | more
| layers | smarter constraints on activations]
|
| Unless you have some sort of argument that forcing it to
| learn position has carryover value in generating
| activations, it seems, naively, a bad idea.
| cperciva wrote:
| The binary coding example would have been much better with Gray
| codes.
| jcims wrote:
| I'm effectively a complete layman in this (although I do see some
| parallels to physical positional encoders, which is interesting)
| so at first read this entire thing went WAAAAY over my head. At
| first glance it seemed to be way overcomplicated just to encode
| position, so I figured I was missing something. ChatGPT was super
| helpful in explaining spiking neural networks to me so I just
| spent 20 minutes asking ChatGPT to explain this to me and I feel
| like I actually learned something.
|
| Then at the end I asked ChatGPT how this all relates to how it
| operates and it was interesting to see things like:
|
| >Tokens as Subword Units: I use a tokenization method called Byte
| Pair Encoding (BPE), which breaks text into subword units.
|
| I don't know if it's accurate or not, but it's wild seeing it
| talk about how it works.
| refulgentis wrote:
| 100% accurate
| gloflo wrote:
| The context includes that "it" is ChatGPT. The fact that
| ChatGPT uses Byte Pair Encoding is widely published. It is
| expectable that a LLM can regurgitate this kind of information,
| nothing wild about that.
| astrange wrote:
| Note if you don't have a good system prompt, other LLMs will
| also tell you they're ChatGPT or Claude.
| im3w1l wrote:
| That's kind of interesting. Like they will know they are an
| AI? Just not which one?
| Der_Einzige wrote:
| Similarly, "you" could have designed state of the art LLM
| sampling:
| https://openreview.net/forum?id=FBkpCyujtS&referrer=%5BTasks...
| imjonse wrote:
| I don't think the first code example should work (it indeed says
| false here).
|
| When given a permuted sequence, the attention output will also be
| permuted, not identical. The need for positional encodings is due
| to two tokens resulting in the same value in the final attention
| matrix regardless of the tokens' absolute and relative position;
| that is enough to miss a lot of meaning.
| FL33TW00D wrote:
| The first code example says False because of high precision,
| I've updated the example.
| jmmcd wrote:
| But u/imjonse's reasoning seems right. I haven't run either
| version of the code, but when reading it I expected that to
| be False. The output is still a list with an order.
|
| the dog chased the cat: position 1 in the output is
| attention(dog, everything)
|
| the cat chased the dog: position 1 in the output is
| attention(cat, everything)
| FL33TW00D wrote:
| Run the code and look at the values!
| jmmcd wrote:
| Well, yes, I deserved that reply! And yes the code is
| printing True. It's not that I disbelieved you... but
| something is wrong here. Investigation below, thanks to
| Claude.ai for walking me through it! In
| [10]: o1[0, :, :3] Out[10]: tensor([[
| 0.0053, 0.0017, -0.0012], [ 0.0053, 0.0017,
| -0.0012], [ 0.0053, 0.0017, -0.0012],
| [ 0.0053, 0.0017, -0.0012], [ 0.0053,
| 0.0017, -0.0012], [ 0.0053, 0.0017,
| -0.0012]], grad_fn=<SliceBackward0>)
|
| Every token has the same attention values. I expect
| attention(cat, everything) to differ from attention(dog,
| everything), even without positional encoding.
|
| Further, the attention weights are uniform and identical
| for both sentences: In [46]: o1, aw1 =
| mha(W_q(e1), W_k(e1), W_v(e1)) In [47]: o2, aw2 =
| mha(W_q(e2), W_k(e2), W_v(e2)) In [48]: aw1.shape
| Out[48]: torch.Size([1, 6, 6]) In [49]: aw2.shape
| Out[49]: torch.Size([1, 6, 6]) In [50]: aw1
| Out[50]: tensor([[[0.1667, 0.1667, 0.1667,
| 0.1667, 0.1667, 0.1667], [0.1667, 0.1667,
| 0.1667, 0.1667, 0.1667, 0.1667], [0.1667,
| 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
| [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
| [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
| [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
| grad_fn=<MeanBackward1>) In [51]: aw2
| Out[51]: tensor([[[0.1667, 0.1667, 0.1667,
| 0.1667, 0.1667, 0.1667], [0.1667, 0.1667,
| 0.1667, 0.1667, 0.1667, 0.1667], [0.1667,
| 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
| [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
| [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
| [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
| grad_fn=<MeanBackward1>)
|
| That is not expected. It's because the Linear layers are
| initialised with such small values. And the softmax
| causes a collapse.
|
| Trying random weights on a larger scale:
| In [52]: W_q.weight.data *= 100
| W_k.weight.data *= 100 W_v.weight.data *=
| 100 In [55]: o1, aw1 = mha(W_q(e1), W_k(e1),
| W_v(e1)) In [56]: o2, aw2 = mha(W_q(e2), W_k(e2),
| W_v(e2)) In [57]: aw1 Out[57]:
| tensor([[[0.2049, 0.1606, 0.1256, 0.1095, 0.1723,
| 0.2270], [0.0883, 0.2047, 0.1544, 0.2776,
| 0.1405, 0.1345], [0.1196, 0.1719, 0.1831,
| 0.1541, 0.1374, 0.2339], [0.1413, 0.2399,
| 0.1617, 0.2056, 0.1634, 0.0880], [0.1455,
| 0.1432, 0.2432, 0.1239, 0.1494, 0.1948],
| [0.1897, 0.1817, 0.1920, 0.1478, 0.1618, 0.1270]]],
| grad_fn=<MeanBackward1>) In [58]: aw2
| Out[58]: tensor([[[0.2049, 0.1606, 0.2270,
| 0.1095, 0.1723, 0.1256], [0.0883, 0.2047,
| 0.1345, 0.2776, 0.1405, 0.1544], [0.1897,
| 0.1817, 0.1270, 0.1478, 0.1618, 0.1920],
| [0.1413, 0.2399, 0.0880, 0.2056, 0.1634, 0.1617],
| [0.1455, 0.1432, 0.1948, 0.1239, 0.1494, 0.2432],
| [0.1196, 0.1719, 0.2339, 0.1541, 0.1374, 0.1831]]],
| grad_fn=<MeanBackward1>) In [60]: o1[:, :,
| :5] Out[60]: tensor([[[ 0.0145, 0.3128,
| -0.3659, -0.1884, 0.1724], [-0.2319,
| 0.1407, -0.6010, -0.4064, 0.4259],
| [-0.3231, 0.1622, -0.6351, -0.1711, 0.4014],
| [-0.0596, 0.2610, -0.7388, -0.2987, 0.3214],
| [-0.2750, 0.0676, -0.4140, -0.2024, 0.3383],
| [-0.1434, 0.0871, -0.3154, -0.0755, 0.3314]]],
| grad_fn=<SliceBackward0>) In [61]: o2[:, :,
| :5] Out[61]: tensor([[[ 0.0145, 0.3128,
| -0.3659, -0.1884, 0.1724], [-0.2319,
| 0.1407, -0.6010, -0.4064, 0.4259],
| [-0.1434, 0.0871, -0.3154, -0.0755, 0.3314],
| [-0.0596, 0.2610, -0.7388, -0.2987, 0.3214],
| [-0.2750, 0.0676, -0.4140, -0.2024, 0.3383],
| [-0.3231, 0.1622, -0.6351, -0.1711, 0.4014]]],
| grad_fn=<SliceBackward0>) In [62]:
| print("Matches: ", torch.allclose(o1, o2, atol=1e-6))
| Matches: False
| FL33TW00D wrote:
| Hm! Very interesting! Thank you for taking the time to
| debug that.
|
| I'm going to have to think hard about how to rewrite the
| motivating example to explain this best.
|
| Edit: updated the post, thanks for pointing out the
| pernicious init values!
| aconz2 wrote:
| To add on since this took me a while to understand: for a
| single token, self attention is permutation invariant because
| we take the qK (one query dot all the other keys) weighted sum
| of all the values; that sum is what gives the invariance
| because + is commutative. But for all the tokens, the mha
| output matrix will not be invariant, but rather equivariant,
| where you apply the same permutation to the output matrix as
| you did to the input tokens. What might be a more useful
| example is to take one position, like the last one, and compute
| its mha for every permutation of the previous tokens; those
| will/should all be the same.
| elieb44 wrote:
| How about context encoding more generally ? Are there techniques
| to do that. I.E, during training, I want the string "Dubito ergo
| cogito, cogito ergo sum, sum ergo Deus est." to have embedded
| Rene Descartes as main author, year 1637 as date of writing and
| "Discours de la methode" as global context of writing.
|
| So that when trained again another part of the same book, the
| model can learn they were from same context.
| jmmcd wrote:
| This is a good idea! The answer to my knowledge is no-one does
| this, we just the simplest, stupidest, possible method, which
| is to concatenate all the text in the world. That is during
| training, of course. At runtime, there is the system prompt.
|
| The second simplest method might indeed use something like a
| system prompt with metadata like that, injected before the
| current window of text. But what would happen at runtime, when
| that metadata is not present? Probably performance would be
| much worse.
| logicchains wrote:
| Does anyone know why 2D rope implementations apply two separate
| 1D rotations to pairs, instead of applying a 2d rotation to
| triplets?
| rini17 wrote:
| No they apply many rotations, same as the number of dimensions
| of the embedding space.
| espadrine wrote:
| > _Furthermore, by rotating the vector, we have absolutely zero
| impact on the norm of the vector, which encodes the semantic
| information of our token._
|
| Doesn't the angle encode semantic information? Cosine similarity
| works for embeddings after all.
| Scene_Cast2 wrote:
| If you're interested in positional embeddings for Transformers,
| check out this repo - https://github.com/gazelle93/Attention-
| Various-Positional-En... - it implements various popular ones.
| breadislove wrote:
| There is this really interesting blog post about making rope (by
| the main author of the paper) multimodal as used by qwen2 vl.
| it's in chinese but google translate does a pretty good job:
| https://spaces.ac.cn/archives/10040
| 1024core wrote:
| I didn't get the sudden leap from "position encodings" to "QKV"
| magic.
|
| What is the connection between the two? Where does "Q" come from?
| What are "K" and "V"? (I know they stand for "Query", "Key",
| "Value"; but what do they have to do with position embeddings?)
| flebron wrote:
| All of them are vectors of embedded representations of tokens.
| In a transformer, you want to compute the inner product between
| a query (the token who is doing the attending) and the key (the
| token who is being attended to). An inductive bias we have is
| that the neural network's performance will be better if this
| inner product depends on the relative distance between the
| query token's position, and the key token's position. We thus
| encode each one with positional information, in such a way that
| (for RoPE at least) the inner product depends only on the
| distance between these tokens, and not their absolute positions
| in the input sentence.
| FL33TW00D wrote:
| "This post intends to limit the mathematical knowledge required
| to follow along, but some basic linear algebra, trigonometry
| and understanding of self attention is expected."
|
| If you're not sure on self attention, the post will be a little
| unclear
| alok-g wrote:
| On a related note, one thing I still do not understand is why are
| positional encodings 'added' to the token embeddings as opposed
| to (having a smaller position encoding vector that is)
| 'concatenated'. It would be great if someone could explain.
___________________________________________________________________
(page generated 2024-11-18 23:02 UTC)