[HN Gopher] You could have designed state of the art positional ...
       ___________________________________________________________________
        
       You could have designed state of the art positional encoding
        
       Author : Philpax
       Score  : 199 points
       Date   : 2024-11-17 20:31 UTC (1 days ago)
        
 (HTM) web link (fleetwood.dev)
 (TXT) w3m dump (fleetwood.dev)
        
       | valine wrote:
       | One of the things I really love about rope is that it allows for
       | a lot of interesting encoding schemes at inference time without
       | model retraining. I've had a lot of fun playing with different
       | relative positions. You can elicit a lot of interesting behaviors
       | from the model when you use different rotations for keys vs
       | queries, they don't always have to match.
       | 
       | For example exact position doesn't matter too much when tokens
       | are spaced out. Let's say you use token position 100 for your
       | query, you can shift all the keys around position 100, and the
       | further they are back in the context the more freedom you have to
       | play with the value.
        
         | bhickey wrote:
         | Can you describe the behaviors that you can elicit with this
         | technique?
        
           | valine wrote:
           | One strategy I've been playing around with is to take an
           | instruction I want the model to follow and squish the
           | positional encodings for the keys down to position zero, and
           | the new queries out slightly further in the window. The model
           | will still follow the instruction but the behaviors are more
           | global. It's behaves more like a fine-tune and less like the
           | instruction is part of the conversation.
        
         | zackangelo wrote:
         | I'm surprised this is the case! I've been working on a rope
         | implementation for my own project (needed to account for
         | padding in unique situations) and even an off by one error
         | usually causes the model to produce non-sensical output.
        
           | valine wrote:
           | You have to be careful to keep the relative positions for
           | adjacent and nearby tokens intact. The relative positions of
           | distant tokens are less brittle.
        
       | rgovostes wrote:
       | Thanks to the author for clarifying something that's been a
       | mystery to me for a few years. The positional encoding scheme in
       | the "Attention Is All You Need" paper is only given half a page
       | and the construction appears to come out of nowhere.
        
         | FL33TW00D wrote:
         | Thank you! Seemed like voodoo to me too, hence this post!
        
       | throwawaymaths wrote:
       | Maybe someone could answer this for me: it seems like encoding
       | the positional embeddings as augmentations to the "natural"
       | activations instead of as their own inputs (concatenated onto the
       | activations) make things like sliding a window much harder... I
       | guess obviously the drawback is you have a somewhat less
       | textually derived information.
       | 
       | I recall a early transformers video where they tried both and it
       | turned out that adding the position onto the existing vectors was
       | no worse so they went with it... No further discussion about
       | motivations happened in that video.
       | 
       | Is it worth revisiting that maybe now that activations have a
       | gobsmackingly large dimension?
        
         | stephantul wrote:
         | They are not concatenated, but summed. I think concatenation
         | wouldn't work, as you indicate.
         | 
         | I think you mean the line in the original paper where they say
         | compared the learned attention weights with the predefined
         | encoding, and it made no difference.
        
           | throwawaymaths wrote:
           | > I think concatenation wouldn't work, as you indicate.
           | 
           | Why do you say that?
        
             | donkeyboy wrote:
             | Concat could work too although less efficient because you
             | need to make a new tensor.
             | 
             | Actually summing might learn a concat on its own. Imagine
             | the embedding learned for a token takes up the first N-20
             | dimensions and leaves the last 20 dimensions as 0. And the
             | positional encoding causes the first N-20 dims to be 0 and
             | the last 20 to encode the information. Then when you sum
             | you are actually concatenating. So I think of them as
             | equivalent except add is more efficient/preserves the dim
             | space, while concat would grow the dim space. And for
             | something like position, which certainly does not need to
             | occupy 1000+ dimensions, it would not make sense to concat
             | all of that since it would be wasteful
        
               | throwawaymaths wrote:
               | why would you need to make a new tensor?
               | 
               | Suppose you had a 4096 (llama-2) sized activations.
               | Maybe, you make do with 3084 activations and concatenate
               | 1024 positional activations onto that.
               | 
               | Then you pass that to Mk Mq Mv and generate K, Q, V.
               | 
               | The only thing that would change would be the Mff-out,
               | which would now be a (big)x3084 matrix instead of
               | (big)x4096
               | 
               | In any case you would be retraining, so changing the dims
               | of the tensors I think is not a big deal... In fact in
               | this case they would be smaller (at the cost of fewer
               | interlayer activations), but you would have the same
               | number of tensors.
               | 
               | > Actually summing might learn a concat on its own.
               | 
               | But you see the point? You're forcing the model to learn
               | something that maybe it didn't need to. That's like
               | saying "well a fully connected network might learn
               | convolution on its own". Historically breakthroughs in
               | capability have accompanied one of: [more data | more
               | layers | smarter constraints on activations]
               | 
               | Unless you have some sort of argument that forcing it to
               | learn position has carryover value in generating
               | activations, it seems, naively, a bad idea.
        
       | cperciva wrote:
       | The binary coding example would have been much better with Gray
       | codes.
        
       | jcims wrote:
       | I'm effectively a complete layman in this (although I do see some
       | parallels to physical positional encoders, which is interesting)
       | so at first read this entire thing went WAAAAY over my head. At
       | first glance it seemed to be way overcomplicated just to encode
       | position, so I figured I was missing something. ChatGPT was super
       | helpful in explaining spiking neural networks to me so I just
       | spent 20 minutes asking ChatGPT to explain this to me and I feel
       | like I actually learned something.
       | 
       | Then at the end I asked ChatGPT how this all relates to how it
       | operates and it was interesting to see things like:
       | 
       | >Tokens as Subword Units: I use a tokenization method called Byte
       | Pair Encoding (BPE), which breaks text into subword units.
       | 
       | I don't know if it's accurate or not, but it's wild seeing it
       | talk about how it works.
        
         | refulgentis wrote:
         | 100% accurate
        
         | gloflo wrote:
         | The context includes that "it" is ChatGPT. The fact that
         | ChatGPT uses Byte Pair Encoding is widely published. It is
         | expectable that a LLM can regurgitate this kind of information,
         | nothing wild about that.
        
           | astrange wrote:
           | Note if you don't have a good system prompt, other LLMs will
           | also tell you they're ChatGPT or Claude.
        
             | im3w1l wrote:
             | That's kind of interesting. Like they will know they are an
             | AI? Just not which one?
        
       | Der_Einzige wrote:
       | Similarly, "you" could have designed state of the art LLM
       | sampling:
       | https://openreview.net/forum?id=FBkpCyujtS&referrer=%5BTasks...
        
       | imjonse wrote:
       | I don't think the first code example should work (it indeed says
       | false here).
       | 
       | When given a permuted sequence, the attention output will also be
       | permuted, not identical. The need for positional encodings is due
       | to two tokens resulting in the same value in the final attention
       | matrix regardless of the tokens' absolute and relative position;
       | that is enough to miss a lot of meaning.
        
         | FL33TW00D wrote:
         | The first code example says False because of high precision,
         | I've updated the example.
        
           | jmmcd wrote:
           | But u/imjonse's reasoning seems right. I haven't run either
           | version of the code, but when reading it I expected that to
           | be False. The output is still a list with an order.
           | 
           | the dog chased the cat: position 1 in the output is
           | attention(dog, everything)
           | 
           | the cat chased the dog: position 1 in the output is
           | attention(cat, everything)
        
             | FL33TW00D wrote:
             | Run the code and look at the values!
        
               | jmmcd wrote:
               | Well, yes, I deserved that reply! And yes the code is
               | printing True. It's not that I disbelieved you... but
               | something is wrong here. Investigation below, thanks to
               | Claude.ai for walking me through it!                   In
               | [10]: o1[0, :, :3]         Out[10]:         tensor([[
               | 0.0053,  0.0017, -0.0012],             [ 0.0053,  0.0017,
               | -0.0012],             [ 0.0053,  0.0017, -0.0012],
               | [ 0.0053,  0.0017, -0.0012],             [ 0.0053,
               | 0.0017, -0.0012],             [ 0.0053,  0.0017,
               | -0.0012]],       grad_fn=<SliceBackward0>)
               | 
               | Every token has the same attention values. I expect
               | attention(cat, everything) to differ from attention(dog,
               | everything), even without positional encoding.
               | 
               | Further, the attention weights are uniform and identical
               | for both sentences:                   In [46]: o1, aw1 =
               | mha(W_q(e1), W_k(e1), W_v(e1))         In [47]: o2, aw2 =
               | mha(W_q(e2), W_k(e2), W_v(e2))         In [48]: aw1.shape
               | Out[48]: torch.Size([1, 6, 6])         In [49]: aw2.shape
               | Out[49]: torch.Size([1, 6, 6])         In [50]: aw1
               | Out[50]:         tensor([[[0.1667, 0.1667, 0.1667,
               | 0.1667, 0.1667, 0.1667],              [0.1667, 0.1667,
               | 0.1667, 0.1667, 0.1667, 0.1667],              [0.1667,
               | 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
               | [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
               | [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
               | [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
               | grad_fn=<MeanBackward1>)              In [51]: aw2
               | Out[51]:         tensor([[[0.1667, 0.1667, 0.1667,
               | 0.1667, 0.1667, 0.1667],              [0.1667, 0.1667,
               | 0.1667, 0.1667, 0.1667, 0.1667],              [0.1667,
               | 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
               | [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
               | [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667],
               | [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667]]],
               | grad_fn=<MeanBackward1>)
               | 
               | That is not expected. It's because the Linear layers are
               | initialised with such small values. And the softmax
               | causes a collapse.
               | 
               | Trying random weights on a larger scale:
               | In [52]: W_q.weight.data *= 100
               | W_k.weight.data *= 100              W_v.weight.data *=
               | 100              In [55]: o1, aw1 = mha(W_q(e1), W_k(e1),
               | W_v(e1))         In [56]: o2, aw2 = mha(W_q(e2), W_k(e2),
               | W_v(e2))         In [57]: aw1         Out[57]:
               | tensor([[[0.2049, 0.1606, 0.1256, 0.1095, 0.1723,
               | 0.2270],              [0.0883, 0.2047, 0.1544, 0.2776,
               | 0.1405, 0.1345],              [0.1196, 0.1719, 0.1831,
               | 0.1541, 0.1374, 0.2339],              [0.1413, 0.2399,
               | 0.1617, 0.2056, 0.1634, 0.0880],              [0.1455,
               | 0.1432, 0.2432, 0.1239, 0.1494, 0.1948],
               | [0.1897, 0.1817, 0.1920, 0.1478, 0.1618, 0.1270]]],
               | grad_fn=<MeanBackward1>)              In [58]: aw2
               | Out[58]:         tensor([[[0.2049, 0.1606, 0.2270,
               | 0.1095, 0.1723, 0.1256],              [0.0883, 0.2047,
               | 0.1345, 0.2776, 0.1405, 0.1544],              [0.1897,
               | 0.1817, 0.1270, 0.1478, 0.1618, 0.1920],
               | [0.1413, 0.2399, 0.0880, 0.2056, 0.1634, 0.1617],
               | [0.1455, 0.1432, 0.1948, 0.1239, 0.1494, 0.2432],
               | [0.1196, 0.1719, 0.2339, 0.1541, 0.1374, 0.1831]]],
               | grad_fn=<MeanBackward1>)              In [60]: o1[:, :,
               | :5]         Out[60]:         tensor([[[ 0.0145,  0.3128,
               | -0.3659, -0.1884,  0.1724],              [-0.2319,
               | 0.1407, -0.6010, -0.4064,  0.4259],
               | [-0.3231,  0.1622, -0.6351, -0.1711,  0.4014],
               | [-0.0596,  0.2610, -0.7388, -0.2987,  0.3214],
               | [-0.2750,  0.0676, -0.4140, -0.2024,  0.3383],
               | [-0.1434,  0.0871, -0.3154, -0.0755,  0.3314]]],
               | grad_fn=<SliceBackward0>)              In [61]: o2[:, :,
               | :5]         Out[61]:         tensor([[[ 0.0145,  0.3128,
               | -0.3659, -0.1884,  0.1724],              [-0.2319,
               | 0.1407, -0.6010, -0.4064,  0.4259],
               | [-0.1434,  0.0871, -0.3154, -0.0755,  0.3314],
               | [-0.0596,  0.2610, -0.7388, -0.2987,  0.3214],
               | [-0.2750,  0.0676, -0.4140, -0.2024,  0.3383],
               | [-0.3231,  0.1622, -0.6351, -0.1711,  0.4014]]],
               | grad_fn=<SliceBackward0>)              In [62]:
               | print("Matches: ", torch.allclose(o1, o2, atol=1e-6))
               | Matches:  False
        
               | FL33TW00D wrote:
               | Hm! Very interesting! Thank you for taking the time to
               | debug that.
               | 
               | I'm going to have to think hard about how to rewrite the
               | motivating example to explain this best.
               | 
               | Edit: updated the post, thanks for pointing out the
               | pernicious init values!
        
         | aconz2 wrote:
         | To add on since this took me a while to understand: for a
         | single token, self attention is permutation invariant because
         | we take the qK (one query dot all the other keys) weighted sum
         | of all the values; that sum is what gives the invariance
         | because + is commutative. But for all the tokens, the mha
         | output matrix will not be invariant, but rather equivariant,
         | where you apply the same permutation to the output matrix as
         | you did to the input tokens. What might be a more useful
         | example is to take one position, like the last one, and compute
         | its mha for every permutation of the previous tokens; those
         | will/should all be the same.
        
       | elieb44 wrote:
       | How about context encoding more generally ? Are there techniques
       | to do that. I.E, during training, I want the string "Dubito ergo
       | cogito, cogito ergo sum, sum ergo Deus est." to have embedded
       | Rene Descartes as main author, year 1637 as date of writing and
       | "Discours de la methode" as global context of writing.
       | 
       | So that when trained again another part of the same book, the
       | model can learn they were from same context.
        
         | jmmcd wrote:
         | This is a good idea! The answer to my knowledge is no-one does
         | this, we just the simplest, stupidest, possible method, which
         | is to concatenate all the text in the world. That is during
         | training, of course. At runtime, there is the system prompt.
         | 
         | The second simplest method might indeed use something like a
         | system prompt with metadata like that, injected before the
         | current window of text. But what would happen at runtime, when
         | that metadata is not present? Probably performance would be
         | much worse.
        
       | logicchains wrote:
       | Does anyone know why 2D rope implementations apply two separate
       | 1D rotations to pairs, instead of applying a 2d rotation to
       | triplets?
        
         | rini17 wrote:
         | No they apply many rotations, same as the number of dimensions
         | of the embedding space.
        
       | espadrine wrote:
       | > _Furthermore, by rotating the vector, we have absolutely zero
       | impact on the norm of the vector, which encodes the semantic
       | information of our token._
       | 
       | Doesn't the angle encode semantic information? Cosine similarity
       | works for embeddings after all.
        
       | Scene_Cast2 wrote:
       | If you're interested in positional embeddings for Transformers,
       | check out this repo - https://github.com/gazelle93/Attention-
       | Various-Positional-En... - it implements various popular ones.
        
       | breadislove wrote:
       | There is this really interesting blog post about making rope (by
       | the main author of the paper) multimodal as used by qwen2 vl.
       | it's in chinese but google translate does a pretty good job:
       | https://spaces.ac.cn/archives/10040
        
       | 1024core wrote:
       | I didn't get the sudden leap from "position encodings" to "QKV"
       | magic.
       | 
       | What is the connection between the two? Where does "Q" come from?
       | What are "K" and "V"? (I know they stand for "Query", "Key",
       | "Value"; but what do they have to do with position embeddings?)
        
         | flebron wrote:
         | All of them are vectors of embedded representations of tokens.
         | In a transformer, you want to compute the inner product between
         | a query (the token who is doing the attending) and the key (the
         | token who is being attended to). An inductive bias we have is
         | that the neural network's performance will be better if this
         | inner product depends on the relative distance between the
         | query token's position, and the key token's position. We thus
         | encode each one with positional information, in such a way that
         | (for RoPE at least) the inner product depends only on the
         | distance between these tokens, and not their absolute positions
         | in the input sentence.
        
         | FL33TW00D wrote:
         | "This post intends to limit the mathematical knowledge required
         | to follow along, but some basic linear algebra, trigonometry
         | and understanding of self attention is expected."
         | 
         | If you're not sure on self attention, the post will be a little
         | unclear
        
       | alok-g wrote:
       | On a related note, one thing I still do not understand is why are
       | positional encodings 'added' to the token embeddings as opposed
       | to (having a smaller position encoding vector that is)
       | 'concatenated'. It would be great if someone could explain.
        
       ___________________________________________________________________
       (page generated 2024-11-18 23:02 UTC)