[HN Gopher] Why the original transformer figure is wrong, and so...
___________________________________________________________________
Why the original transformer figure is wrong, and some other
tidbits about LLMs
Author : rasbt
Score : 182 points
Date : 2023-05-24 12:39 UTC (10 hours ago)
(HTM) web link (magazine.sebastianraschka.com)
(TXT) w3m dump (magazine.sebastianraschka.com)
| andreyk wrote:
| The actual title "Why the Original Transformer Figure Is Wrong,
| and Some Other Interesting Historical Tidbits About LLMs" is way
| more representative of what this post is about...
|
| As to the figure being wrong, it's kind of a nit-pick: "While the
| original transformer figure above (from Attention Is All Your
| Need, https://arxiv.org/abs/1706.03762) is a helpful summary of
| the original encoder-decoder architecture, there is a slight
| discrepancy in this figure.
|
| For instance, it places the layer normalization between the
| residual blocks, which doesn't match the official (updated) code
| implementation accompanying the original transformer paper. The
| variant shown in the Attention Is All Your Need figure is known
| as Post-LN Transformer."
| rasbt wrote:
| So weird, I posted it with almost the original title (only
| slightly abbreviated to make it fit: "Why the Original
| Transformer Figure Is Wrong, and Some Interesting Tidbits About
| LLMs".
|
| Not sure what happened there. Someone must have changed it! So
| weird! And I agree that the current title is a bit awkward and
| less representative.
| ijidak wrote:
| Has anyone bought his book: "Machine Learning Q and AI"?
|
| Is it a helpful read as a cliff notes for the latest in
| Generative AI?
| trivialmath wrote:
| I wonder if for example a function is an example of a
| transformer. So the phrase "argument one is cat" and argument two
| is dog and operation is join so the result is the word catdog is
| operated by the transformer as the function concat(cat,dog). Here
| the query is the function and the keys are the argument for the
| function and the value is a function from word to words.
| visarga wrote:
| They can intelligently parse the unstructured input into a
| structured internal form, apply a transform, and then format
| the result back into unstructured text. Even the transform
| itself can be an argument.
| amelius wrote:
| Are there any still-human-readable pictures where the entire
| transformer is shown in expanded form?
| minihat wrote:
| Try this: https://jalammar.github.io/illustrated-transformer/
|
| Attention is explained separately. I have not seen an all-in-
| one diagram and cannot imagine one being helpful, since there's
| too much going on.
| aptitude_moo wrote:
| I find this one as the most complete [1] [2]
|
| [1]: https://github.com/ajhalthor/Transformer-Neural-
| Network/blob...
|
| [2]:
| https://www.youtube.com/watch?v=Nw_PJdmydZY&list=PLTl9hO2Oob...
| visarga wrote:
| As someone who started reading ML papers 10 years ago, I find
| transformers pretty simple compared to most architectures. For
| example Google's Inception was a complex mesh of various sized
| convolutions, other models have an internal algorithm like Non
| Maximum Suppression for object detection, or Connectionist
| Temporal Classification for OCR, GANs use complicated
| probability theory for the loss function. Even LSTM is more
| complicated.
|
| If anything, we have been abandoning exotic neural nets in
| favour of a single architecture and that one is pretty simple,
| just linear layers (vector-matrix product), key-value products
| (matrix-matrix product), softmax (fancy normalisation),
| weighted averaging (a sum of products) and skip connections (an
| addition). Maybe it's become hard for me to see what is
| complicated about it, I'd be curious to know what part is
| difficult. Is it the embeddings, masking, multiple heads,
| gradient descent, ...? Embeddings have been famous for 10
| years, ever since the king - man + woman = queen paper. You
| don't need to be able to derive the gradients for the network
| by hand to understand it.
|
| In short, a transformer is mixing information between tokens in
| a sequence and computing updates. The mixing part is the "self
| attention" or "cross attention". The updating part is the feed-
| forward sublayer. It has skip connections (adds the input to
| the output) in order to keep training stable.
| chaxor wrote:
| Even when someone understands the architecture very well,
| there's still great utility in having a full graph
| representation of the full NN architecture. This could be for
| teaching students, or for doing analyses on the structure of
| the full network, etc.
| Buttons840 wrote:
| Simple is good. Especially in machine learning where a bug
| usually means that it kinda works, but not as well as it
| could. Also, when an off-the-shelf algorithm half works, it's
| good to be able to add you own tweaks to it, and again, this
| requires simplicity.
|
| For a complicated architecture to succeed, it's going to need
| to reliably achieve state of the art performance on
| everything without requiring any adjustment or tweaks.
| kk58 wrote:
| In CNN the layers seem to learn geometric primitive and
| deeper layers seem to learn more complex geometric patterns
| loosely speaking.
|
| In transformer what do query key matrices learn? How are
| their weights somehow working to extract context no matter
| which word appears in which position?
| visarga wrote:
| The transformer doesn't have the nice pyramid shape of
| CNNs, but it still needs multiple layers. There have been
| papers showing non-trivial interactions between successive
| layers, forming more complex circuits.
|
| https://transformer-circuits.pub/2021/framework/index.html
| (warning, advanced difficulty)
|
| The Q an K matrices learn how to relate tokens. Each of the
| heads will learn to extract a different relation. For
| example, one will link to the next token, another will link
| pronouns to their references, another would be matching
| brackets, etc. Check out the cute diagrams here:
|
| https://www.arxiv-vanity.com/papers/1904.02679/
|
| So each head (Q and K pair) is like a program doing a
| specific pattern of lookup.
| lalaithion wrote:
| The things that still confuse me about transformers is:
|
| 1. Why do we _add_ the positional embedding to the semantic
| embedding? It seems like it means certain semantic directions
| are irreversibly with certain positions.
|
| 2. I don't understand why the attention head (which I can
| implement and follow the math of) is described as "key query
| value lookup". Specifically, the Q and K matrices aren't
| structurally distinct - the projections into them will learn
| different weights, but one doesn't start out biased key-ward
| and the other query-ward.
| necroforest wrote:
| To answer (2): You are token i. In order to see how much of
| a token j's value v_j you update yourself with, you compare
| your query q_i with token j's key k_j. This gives you the
| asymmetry between queries and keys.
|
| This is even more apparent in a cross-attention setting,
| where one stream of tokens will have only queries
| associated with it and the other will have only
| keys/values.
| visarga wrote:
| Good questions.
|
| The first one: transformers are "permutation invariant" by
| nature, so if you permute the input and apply the opposite
| permutation to the output you get the exact same thing. The
| transformer itself has no positional information. RNNs by
| comparison have positional information by design, they go
| token by token, but the transformer is parallel and all
| tokens are just independent "channels". So what can be
| done? You put positional embeddings in it - either by
| adding them to the tokens (concatenation was also ok, but
| less efficient) or by inserting relative distance biases in
| the attention matrix. It's a fix to make it understand
| time. It's still puzzling this works, because mixing text
| tokens with position tokens seems to cause a conflict, but
| it doesn't in practice. The model will learn to use the
| embedding vector for both, maybe specialising a part for
| semantics and another for position.
|
| The second question. Neural nets find a way to
| differentiate the keys from queries by simply doing
| gradient descent. If we tell the model it should generate a
| specific token here, then it needs to fix the keys and
| queries to make it happen. The architecture is pretty dumb,
| the secret is the training data - everything the
| transformer learns comes from the training set. We should
| think about the training data when we marvel at what
| transformers can do. The architecture doesn't tell us why
| they work so well.
| amelius wrote:
| Regarding the positional encoding, why not include a
| scalar in the range (0..1) with every token where the
| scalar encodes the position of the token? This adds a
| small amount of complexity to the network, but it could
| aid comprehensibility which to me seems preferable if
| you're still doing research on these networks.
| uh_uh wrote:
| I'm still not clear on the second question. If
| lalaithion's original statement "the Q and K matrices
| aren't structurally distinct" is true, then once the
| neural network is trained, how can we look at the two
| matrices and confidently say that one is the query matrix
| instead of it being the key matrix (or vice versa)? To
| put it another way: is the distinction between query and
| key roles "real" or is it just an analogy for humans?
| ntonozzi wrote:
| I am not an expert, but I think that they are
| structurally identical only in decoder only transformers
| like GPT. The original transformers were used for
| translation, and so the encoder-decoder layers use Q from
| the decoder layer and K from the encoder layer. The
| attention is all you need paper has an explanation:
|
| > In "encoder-decoder attention" layers, the queries come
| from the previous decoder layer, and the memory keys and
| values come from the output of the encoder. This allows
| every position in the decoder to attend over all
| positions in the input sequence. This mimics the typical
| encoder-decoder attention mechanisms in sequence-to-
| sequence models such as...
| lostmsu wrote:
| > transformers are "permutation invariant" by nature
|
| Surely that does not apply to GPT models which use causal
| masking.
| metanonsense wrote:
| With regards to the "It's still puzzling this works" wrt
| positional encoding, I have developed an intuition (that
| may be very wrong ;-). If you take the fourier transform
| of a linear or sawtooth function (akin to the the
| progress of time), I think you get something that
| resembles the positional encoding in the original
| transformer. EDIT: fixed typo
| anonymousDan wrote:
| Would this not imply that if I encrypt the input and then
| decrypt the output I would get the correct result (i.e.
| what I would have gotten if I used the plaintext input)?
| FartyMcFarter wrote:
| > The architecture is pretty dumb, the secret is the
| training data
|
| If this were true, we could throw the same training data
| at any other "dumb" architecture and it would learn
| language at least as well/fast as transformers do. But we
| don't see that happening, so the architecture must be
| smartly designed for this purpose.
| whimsicalism wrote:
| Other dumb architectures don't parallelize as well. Other
| architectures that parallelize at similar levels (RNN-
| RWKV, H3, S4, etc.) do perform well at similar parameter
| counts and data sizes.
| homarp wrote:
| RNN-RWKV - https://news.ycombinator.com/item?id=36038868
|
| H3 - https://news.ycombinator.com/item?id=34673535
|
| S4 - https://srush.github.io/annotated-s4/
| visarga wrote:
| Actually there are alternatives by the hundreds, with
| similar results. Reformer, Linformer, Performer,
| Longformer... none is better than vanilla overall, they
| all have an edge in some use case.
|
| And then we have MLP-mixer which just doesn't do
| "attention" at all, MLP is all you need. A good solution
| for edge models.
| homarp wrote:
| MLP-mixer: https://news.ycombinator.com/item?id=28581570
|
| and https://towardsdatascience.com/mlp-mixer-in-a-
| nutshell-eccff...
| samvher wrote:
| I recently had the same questions and here is how I
| understand it:
|
| 1. You could concatenate the positional embedding and the
| semantic embedding and that way isolate them from each
| other. But if that separation is necessary, the model can
| learn the separation itself as well (it can make positional
| embeddings and semantic embeddings orthogonal to each
| other), so using addition is strictly more general.
|
| 2. My sense is that you could merge the Q and K matrices
| and everything would work mostly the same, but with multi-
| headed attention this will typically result in a larger
| matrix than the combined sizes of Q and K. It's basically a
| more efficient matrix factorization.
|
| Curious to see if I got this right and if there is more to
| it.
| tlb wrote:
| Yes, that's my understanding.
|
| One advantage of summing is that the lower frequency
| terms hardly change for a small text, so effectively
| there is more capacity for embeddings with short texts,
| while still encoding order in long texts.
| arugulum wrote:
| 1. It works, the direct alternative (concatenation)
| allocates a smaller dimension to the initial embedding, and
| also added positional embeddings are no longer commonly
| used in newer Transformers. Schemes like RoPE and ALiBi are
| more common.
|
| 2. I'm not 100% sure I understand your question. The Ks
| correspond to the Vs, and so is used to compute the
| weighted sum over Vs. This is easiest to understand when
| you think of an encoder-decoder model (Qs come from the
| decoder, KVs come from the encoder), or decoding in a
| decoder (there is 1Q and multiple KVs)
| dontwearitout wrote:
| 1. High dimensional embedding space is way more vast than
| you'd think, so adding two vectors together doesn't really
| destroy information in the same way as addition does in low
| dimensional cartesian space - the semantic and position
| information remains separable.
|
| 2. I find the QKV nomenclature unintuitive too. Cross
| attention explains it a bit, where the Q and K come from
| different places. For self attention they are the same, but
| the terminology stuck.
| rasbt wrote:
| Agreed, compared to other architectures, transformers are
| actually quite straight-forward. The complicated part comes
| more from training it in distributed setups, making the data
| loading and tensor parallelism work due to the large size
| etc. Like the vanilla architecture is simple, but the
| practical implementation for large-scale training can be a
| bit complicated.
| qumpis wrote:
| King-man got me thinking about the VAE paper for a second :)
| chaxor wrote:
| Do you mean a graph that contains all neurons from the network
| in one structure? Similar to these:
| https://gfycat.com/BonyTotalArthropods
| https://gfycat.com/BitesizedWeeBlacklemur
|
| That would be wonderful and I have been trying to do this.
| However, unfortunately some 'assumptions'/shortcuts have to be
| made. For example, the attention matrix is not known without
| input, so if just the structure of the network (weighted by the
| weights) is wanted, you have to put in some value 'p' ('1',
| '-1', w/e) to these edges. Also skip connections have to be
| dealt with explicitly instead of just adding them to a block
| diagonal matrix as one would with an MLP.
|
| I am very interested if someone has a good solution that
| already has done these things though.
| amelius wrote:
| Those are nice diagrams. Yes, well actually I'm interested in
| anything that's between the abstracted form in the paper and
| the fully expanded form where you see all neurons.
| chaxor wrote:
| https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Y4
| G...
|
| Does that work?
| amelius wrote:
| Another cool diagram. But one thing it misses is that you
| can't follow the arrows from input to output. For
| example, you might be tempted to think that the Keys or
| Queries are inputs to the neural net since there is no
| arrow going into them.
| dnw wrote:
| Not quite what you're asking for but perhaps in the direction
| https://openai.com/research/language-models-can-explain-neur...
| canjobear wrote:
| The original Transformer wasn't user in an LLM.
| YetAnotherNick wrote:
| This is the commit that changed it:
| https://github.com/tensorflow/tensor2tensor/commit/d5bdfcc85...
| fn-mote wrote:
| This note contains four papers for "historical perspective"...
| which would usually mean "no longer directly relevant", although
| I'm not sure that's really what the author means.
|
| You might be looking for the author's "Understanding Large
| Language Models" post [1] instead.
|
| Misspelling "Attention is All Your Need" twice in one paragraph
| makes for a rough start to the linked post.
|
| [1] https://magazine.sebastianraschka.com/p/understanding-
| large-...
| rasbt wrote:
| > Misspelling "Attention is All Your Need" twice in one
| paragraph makes for a rough start to the linked post.
|
| 100%! LOL. I was traveling and typing this on a mobile device.
| Must have been some weird autocorrect/autocomplete. Strange.
| And I didn't even notice. Thanks!
| visarga wrote:
| > which would usually mean "no longer directly relevant"
|
| Or it could mean the lesson from these papers has been
| assimilated and spread wide and far, thus they are no longer
| "news". The pre-layernorm is one.
| homarp wrote:
| also one of these papers is from Schmidhuber
|
| and https://news.ycombinator.com/item?id=23649542 gives some
| context to the "For instance, in 1991, which is about two-
| and-a-half decades before the original transformer paper
| above ("Attention Is All You Need")"
___________________________________________________________________
(page generated 2023-05-24 23:00 UTC)