[HN Gopher] Ask HN: Can someone ELI5 transformers and the "Atten...
___________________________________________________________________
Ask HN: Can someone ELI5 transformers and the "Attention is all we
need" paper?
I have zero AI/ML knowledge but Steve Yegge on Medium thinks that
the team behind Transformers deserves a Nobel. Makes me want to
better understand this tech. Edit: thank you for some amazing top
level responses and links to valuable content on this subject.
Author : Sai_
Score : 293 points
Date : 2023-05-17 16:43 UTC (6 hours ago)
| olaulaja wrote:
| Not quite ELI5 and there's a few partially overlapping answers
| around already but here goes.
|
| The key part is the attention mechanism, as title of the paper
| may have spoiled. It works moreless like this:
|
| - Start with an input sequence X1, X2 ... Xn. These are all
| vectors.
|
| - Map the input sequence X into 3 new sequences of vectors: query
| (Q), key (K), and value(V), all of the same length as the input
| X. This is done using learnable mappings for each of the
| sequences (so one for X->Q, another for X->K and one for X->V).
|
| - Compare similarity of every query with every key. This gives
| you a weight for each query/key pair. Call them W(Q1, V2) and so
| forth.
|
| - Compute output Z as sum of every _value_ weighted by the weight
| for the respective query/key pair (so Z1 = V1 _W(Q1,K1) + V2_
| W(Q1,K2) + ... + Vn _W(Q1,Kn), Z2 = V1_ W(Q2,K1) +
| V2*W(Q2,K2)...)
|
| - and that's about it!
|
| As throwawaymaths mentions, this is quite similar to a learnable
| hash table with the notable difference that the value fetched is
| also changed, so that it doesn't fetch "input at an index like i"
| but "whatever is important at an index like i".
|
| Now a few implementation details on top of this:
|
| - The description is for a single "attention head". Normally
| several, each with their own mappings for Q/K/V, are used, so the
| transformer can look at different "things" simultaneously. 8
| attention heads seems pretty common.
|
| - The description doesn't take the position in the sequence into
| account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To
| account for ordering, "positional encoding" is normally used.
| Usually this is just adding a bunch of scaled sine/cosine waves
| to the input. Works surprisingly well.
|
| - The transformer architecture has a number of these "attention
| layers" stacked one after the other and also 2 different stacks
| (encoder, decoder). The paper is about machine translation, so
| the encoder is for the input text and the decoder for the output.
| Attention layers work just fine in other configurations as well.
|
| The rest of the architecture is fairly standard stuff
| vikp wrote:
| Transformers are about converting some input data (usually text)
| to numeric representations, then modifying those representations
| through several layers to generate a target representation.
|
| In LLMs, this means go from prompt to answer. I'll cover
| inference only, not training.
|
| I can't quite ELI5, but process is roughly: -
| Write a prompt - Convert each token in the prompt (roughly
| a word) into numbers. So "the" might map to the number 45.
| - Get a vector representation of each word - go from 45 to [.1,
| -1, -2, ...]. These vector representations are how a transformer
| understands words. - Combine vectors into a matrix, so
| the transformer can "see" the whole prompt at once. -
| Repeat the following several times (once for each layer): -
| Multiply the vectors by the other vectors. This is attention -
| it's the magic of transformers, that enables combining
| information from multiple tokens together. This generates a new
| matrix. - Feed the matrix into a linear regression.
| Basically multiply each number in each vector by another number,
| then add them all together. This will generate a new matrix, but
| with "projected" values. - Apply a nonlinear transformation
| like relu. This helps model more complex functions (like text
| input -> output!)
|
| Note that I really oversimplified the last few steps, and the
| ordering.
|
| At the end, you'll have a matrix. You then convert this back into
| numbers, then into text.
| throwawaymaths wrote:
| I don't think this description of attention is correct.
| vikp wrote:
| You mean "Multiply the vectors by the other vectors. This is
| attention - it's the magic of transformers, that enables
| combining information from multiple tokens together. This
| generates a new matrix."?
|
| It's really oversimplified, as I mentioned. A more granular
| look is: - Project the vectors with a linear
| regression. In decoder-only attention (what we usually use),
| we project the same vectors twice with different
| coefficients. We call the first projection queries, and the
| second keys. This transforms the vectors linearly. -
| Find the dot product of each query vector against the key
| vectors (multiply them) - (training only) Mask out
| future vectors, so a token can't look at tokens that come
| after it - At this point, you will have a matrix
| indicating how important each query vector considers each
| other vector (how important each token considers the other
| tokens) - Take the softmax, which both ensures all of
| the attention values for a vector sum to 1, and penalizes
| small attention values - Use the softmax values to get
| a weighted sum of tokens according to the attention calc.
| - This will turn one vector into the weighted sum of the
| other vectors it considers important.
|
| The goal of this is to incorporate information from multiple
| tokens into a single representation.
| cma wrote:
| From the Yegge post:
|
| > LoRA makes LLMs composable, piecewise, mathematically, so that
| if there are 10,000 LLMs in the wild, they will all eventually
| converge on having the same knowledge. This is what Geoffrey
| Hinton was referring to on his SkyNet tour.
|
| I don't think that's right at all, LoRA freeze lots of the large
| model part and wouldn't let you just simply combine large models.
| Instead. I'm pretty sure Hinton is referring to data parallel
| training with batching:
|
| > DataParallel (DP) - the same setup is replicated multiple
| times, and each being fed a slice of the data. The processing is
| done in parallel and all setups are synchronized at the end of
| each training step.
|
| https://huggingface.co/docs/transformers/v4.15.0/parallelism
|
| You can have many instances of the model training on different
| bits of data, and then just average the modified weights back
| together at the end. This combining of weights is what Hinton
| means when he says parallel copies of brains can learn things
| independently and then recombine them later a huge bandwidth
| speeds, whereas humans are far more limited to sharing separate
| experiences verbally or with like a multimedia presentation or
| something.
| EuropeOverlords wrote:
| [flagged]
| schizo89 wrote:
| [flagged]
| dumbfounder wrote:
| GPT-4 Prompt: Can you ELI5 Transformers and the "Attention is
| all we need" paper
|
| Sure! Imagine you're in school, and your teacher asks you and
| your friends to work on a group project. Each of you has a
| different set of skills, and you need to work together to
| complete the project successfully. In this analogy, the group
| project is like a sentence or a piece of text that a language
| model is trying to understand.
|
| Transformers are a type of model that helps computers
| understand and generate text, like in our group project. The
| key idea behind transformers is something called "attention."
| Attention helps the model figure out which words in a sentence
| are the most important to focus on, just like you and your
| friends pay attention to each other's skills to complete your
| project.
|
| The "Attention is all we need" paper introduced transformers
| and showed that by using attention, the model can learn how to
| understand and generate text more effectively. Instead of
| focusing on each word one at a time, transformers can look at
| all the words together, decide which ones are the most
| important, and use that information to understand the text
| better.
|
| So, in our group project example, transformers help the model
| work like a team, where everyone pays attention to each other's
| strengths and helps each other out. This makes the model more
| powerful and able to do a better job at understanding and
| generating text.
| mitthrowaway2 wrote:
| GPT-4 seems to take "explain like I'm five" quite seriously.
| samspenc wrote:
| I just asked ChatGPT and this is what I got:
|
| "Imagine you have a really smart computer friend that can
| understand and talk to you just like a person. This computer
| friend is called a language transformer.
|
| The language transformer is like a super good reader. It reads
| lots and lots of books, stories, and articles to learn how
| people talk and write. It pays attention to how words go
| together and what they mean in different situations. It becomes
| really good at understanding what people are saying or asking.
|
| Once the language transformer has learned so much, it can help
| with many things. For example, if you want to write a story, it
| can help you come up with ideas and write sentences that make
| sense. If you have a question, it can try to find the answer
| for you. It can also help translate words from one language to
| another, like when you want to understand what someone is
| saying in a different language.
|
| The language transformer is like a big brain that knows a lot
| about words and how they fit together. It uses all that
| knowledge to talk and help you with different things."
| ly3xqhl8g9 wrote:
| Asked to provide some code samples:
|
| """ Imagine you have a sentence: "The cat sat on the mat." In
| traditional models, each word is processed one by one, and the
| model might miss important relationships between words that are
| far apart. Transformers solve this problem using self-
| attention.
|
| In self-attention, the model pays attention to all the words in
| the sentence simultaneously, giving more importance to
| different words depending on their relevance to each other.
| Here's a code snippet to illustrate self-attention:
| import torch import torch.nn as nn #
| Assume we have an input sentence with 5 words
| input_sentence = torch.tensor([[0.1, 0.2, 0.3, 0.4, 0.5]])
| # Define the self-attention layer self_attention =
| nn.MultiheadAttention(embed_dim=5, num_heads=1) #
| Apply self-attention to the input sentence
| attention_output, _ = self_attention(input_sentence,
| input_sentence, input_sentence) # The
| attention_output represents the transformed sentence with
| attention print(attention_output)
|
| In this code, we use a simple example with one self-attention
| head. The input sentence is represented as a tensor with shape
| (1, 5), where each number represents the word's features. We
| pass the input sentence to the self-attention layer twice,
| using it as the query, key, and value inputs.
|
| The self-attention layer computes the attention scores between
| each word in the sentence, determining how much attention each
| word should receive from the other words. It then combines the
| input sentence with these attention weights to produce the
| attention_output.
|
| The attention_output contains the transformed sentence, where
| each word now carries information from other words based on
| their importance. This allows the model to capture
| relationships and dependencies between words effectively.
|
| The "Attention is All You Need" paper introduced this self-
| attention mechanism as a fundamental building block in the
| Transformer model. The Transformer architecture then
| incorporates multiple layers of self-attention, along with
| other components like positional encoding, to process sequences
| and achieve impressive results in various NLP tasks.
|
| Remember, this code snippet is just a simplified illustration
| to help you understand the concept of self-attention. The
| actual implementation of Transformers is more complex,
| involving multiple layers, feed-forward networks, and
| additional optimizations. """
|
| The nn.MultiheadAttention documentation [1], the output of
| print(attention_output) is: tensor([[ 0.1756,
| -0.2273, -0.0787, 0.0383, -0.0779]],
| grad_fn=<SqueezeBackward1>)
|
| If you badger ChatGPT it will give you an example with
| different query, key, and value inputs #
| Define distinct query, key, and value inputs query =
| torch.tensor([[0.2, 0.3, 0.4, 0.5, 0.6]]) key =
| torch.tensor([[0.3, 0.4, 0.5, 0.6, 0.7]]) value =
| torch.tensor([[0.4, 0.5, 0.6, 0.7, 0.8]]) # Apply
| self-attention to the input sentence using distinct inputs
| attention_output, _ = self_attention(query, key, value)
|
| outputting: tensor([[-0.2294, 0.1336,
| -0.1445, -0.7275, -0.1232]], grad_fn=<SqueezeBackward1>)
|
| Pushed it then further to explain how does
| nn.MultiheadAttention works:
|
| """ In PyTorch, nn.MultiheadAttention is a module that
| implements the multi-head attention mechanism. Multi-head
| attention is a key component of the Transformer architecture,
| which is widely used in various natural language processing
| (NLP) tasks.
|
| The nn.MultiheadAttention module allows you to apply multi-head
| attention to a set of queries, keys, and values. Here's a
| simplified explanation of what it does:
|
| Inputs: The module takes three inputs: queries, keys, and
| values. These inputs are typically tensors that represent the
| encoded representations of the input sequence.
|
| Linear Projections: Before applying attention, the module
| performs linear projections on the queries, keys, and values.
| This step projects the input tensors into different subspaces
| to capture different aspects of the data. These projections are
| learnable parameters.
|
| Attention Scores: Next, the module computes attention scores
| between the queries and the keys. The attention scores measure
| the relevance or importance of each key with respect to each
| query. Higher attention scores indicate a stronger relationship
| between the query and the corresponding key.
|
| Attention Weights: The attention scores are transformed into
| attention weights by applying a softmax function. The softmax
| operation normalizes the attention scores, ensuring that the
| weights sum up to 1 and represent a valid distribution.
|
| Weighted Sum: The attention weights are used to compute a
| weighted sum of the values. This step aggregates the values
| based on their importance, as determined by the attention
| weights. The weighted sum produces the final output of the
| attention mechanism.
|
| Multi-Head Attention: The module performs the above steps
| multiple times in parallel, each time using different sets of
| learned linear projections. These parallel attention operations
| are called "attention heads." The outputs of the attention
| heads are concatenated and linearly transformed to obtain the
| final output of the nn.MultiheadAttention module.
|
| The multi-head attention mechanism allows the model to capture
| different types of relationships and dependencies in the input
| sequence. By using multiple attention heads with different
| linear projections, the model can attend to different parts of
| the input space and learn diverse representations.
|
| Overall, nn.MultiheadAttention in PyTorch provides a convenient
| way to incorporate the multi-head attention mechanism into your
| neural network architecture, facilitating effective modeling of
| complex dependencies in sequence data. """
|
| [1]
| https://pytorch.org/docs/stable/generated/torch.nn.Multihead...
| PaulHoule wrote:
| There is no Nobel prize for math or computer science, though...
|
| I worked on a few projects that were trying to develop foundation
| models for health care, aviation, and other domains. In
| particular I trained an LSTM model to write fake abstracts for
| clinical case reports.
|
| We ran into many problems, but maybe the worst one with the LSTM
| is that a real document repeats itself a lot. For instance,
| somebody's name might turn up multiple times and the LSTM was
| very bad at that kind of copying. The LSTM community was arguing
| about solutions to this problem, but the attention mechanism in
| transformers makes it easy.
| eternalban wrote:
| Good overview of the two approaches:
|
| https://medium.com/analytics-vidhya/why-are-lstms-struggling...
| abudabi123 wrote:
| One prize for math is the Fields Medal, for cs is the Turing
| Award... perhaps there should be a JMC prize for advances in
| A.I.
|
| , https://en.wikipedia.org/wiki/Fields_Medal
|
| , https://en.wikipedia.org/wiki/Turing_Award
|
| ,
| https://en.wikipedia.org/wiki/John_McCarthy_(computer_scient...
| data_maan wrote:
| What is ELI5?
|
| The idea behind the Transformer is nice - but by far not Nobel
| prize deserving.
|
| Don't believe the hype or people like Yegge, whoever that is - in
| a few years a new architecture will be the "Nobel candidate".
|
| Also, the original Transformer paper, if you read is, is horribly
| written.
| inconceivable wrote:
| explain like i'm 5 (years old)
| kenjackson wrote:
| Nobel prize for the neurel network or perceptron maybe?
| garbagecoder wrote:
| Thanks to everyone in this thread for all the great links and
| explanations!!
| goy wrote:
| I'd like to read the article. What are the required maths ?
| QuantumGood wrote:
| The "Attention is All You Need" paper introduced a new way for AI
| to read and understand language, much like how we might read a
| comic book.
|
| As you read each panel of a comic book, you don't just look at
| the words in the speech bubbles, but you also pay attention to
| who's talking, what they're doing, and what happened in the
| previous panels. You might pay more attention to some parts than
| others. This is sort of like what the Transformer model does with
| text.
|
| When the Transformer reads a sentence, it doesn't just look at
| one word at a time. It looks at all the words at once, and
| figures out which ones are most important to understand each
| other. This is called "attention." For example, in the sentence
| "The cat, which is black, sat on the mat," the Transformer model
| would understand that "cat" is connected to "black" and "sat on
| the mat."
|
| The "attention" part is very helpful because, like in a comic
| book, understanding one part of a sentence often depends on
| understanding other parts. This makes the Transformer model
| really good at understanding and generating language.
|
| Also, because the Transformer pays attention to all parts of the
| sentence at the same time, it can be faster than other models
| that read one word at a time. This is like being able to read a
| whole page of your comic book at once, instead of having to read
| each panel one by one.
| QuantumGood wrote:
| Explaining it for a slightly older audience, a transformer is a
| type of artificial neural network designed for processing
| sequences, like sentences in a text. It's especially known for
| its use in natural language processing (NLP), which is the
| field of AI that deals with understanding and generating human
| language.
|
| The Transformer is unique because it uses a mechanism called
| "attention" to understand the relationships between words in a
| sentence, which works like this:
|
| (1) Encoding: First, the Transformer turns each word in a
| sentence into a list of numbers, called a vector. These vectors
| capture information about the word's meaning.
|
| (2) Self-Attention: Next, for each word, the Transformer
| calculates a score for every other word in the sentence. These
| scores determine how much each word should contribute to the
| understanding of the current word. This is the "attention"
| part. For example, in the sentence "The cat, which is black,
| sat on the mat," the words "cat" and "black" would get high
| scores when trying to understand the word "black" because they
| are closely related.
|
| (3) Aggregation: The Transformer then combines the vectors of
| all the words, weighted by their attention scores, to create a
| new vector for each word. This new vector captures both the
| meaning of the word itself and the context provided by the
| other words in the sentence.
|
| (4) Decoding: Finally, in a task like translation, the
| Transformer uses the vectors from the encoding phase to
| generate a sentence in the target language. It again uses
| attention to decide which words in the original sentence are
| most relevant for each word it's trying to generate in the new
| sentence.
|
| One key advantage of the Transformer is that it can calculate
| the attention scores for all pairs of words at the same time,
| rather than one at a time like previous models. This allows it
| to process sentences more quickly, which is important for large
| tasks like translating a whole book.
| rosebay wrote:
| [dead]
| maxdoop wrote:
| You guys are cracking me up lol most comments here must have the
| smartest damn 5-year olds in mind!
|
| "It's simple ! Just tokenize the context and allow differentiated
| hash maps to best map the embeddings matrix. Duh!"
| detrites wrote:
| Explain Like I'm Five... _Years Into A Doctorate_
| amelius wrote:
| Simplest explanation: you start with a fully connected network of
| N layers. This already gives you the functionality. Then you
| remove a lot of unnecessary connections. This gives you the
| performance.
| hackandthink wrote:
| Cosma Shalizi explains it very well.
|
| "Again: Calling this "attention" at best a joke."
|
| http://bactra.org/notebooks/nn-attention-and-transformers.ht...
| MichaelRazum wrote:
| Thanks a lot for sharing. I liked the explanation, although it
| seems to be not a perfect matching between attention and kernel
| (as far as I understand). Since as I undestand x_o = Qx and x_i
| = Qx and y_i = Vx but it doesn't map to Wu dot Wv.
|
| Anyway just curious do you or someone else have more sources in
| this format?
|
| At least this way it is very helpfull to think about the
| methods. Kind of agree that the formular does look very very
| similar to well known methods. But on the other hand, the
| author doesn't explain the transformers in a similar simple way
| since it is not obvious why you could stack kernels and get
| better results.
| Kiro wrote:
| Has ELI5 lost its literal meaning? What five-year-old would
| understand anything in this thread?
| krapp wrote:
| It was never meant to be taken literally, it's just an
| idiomatic way of asking that something be explained as simply
| as possible.
|
| An answer to OP's question targeted at a five year old level
| would likely be impossible given the subject matter, or else
| need to be so general and simplistic as to be useless on this
| forum of not actual five year olds.
| Izkata wrote:
| The current top comment managed it in a way that is actually
| useful: https://news.ycombinator.com/item?id=35981106
| krapp wrote:
| I don't know. It's simple and straightforward, but still
| seems like it would go over the head of an actual five year
| old.
| runetech wrote:
| Not sure I can give an ELI5 explanation, but for me this the
| closest to what you are after in two short videos on the topic.
| At least they helped me get up to speed fast.
|
| Hope they will do the same for you ;-)
|
| Large Language Models from scratch
| https://www.youtube.com/watch?v=lnA9DMvHtfI
|
| Large Language Models: Part 2
| https://www.youtube.com/watch?v=YDiSFS-yHwk
| probably_wrong wrote:
| I'll throw my hat in the ring.
|
| A transformer is a type of neural network that, like many
| networks before, is composed of two parts: the "encoder" that
| receives a text and builds an internal representation of what the
| text "means"[1], and the "decoder" that uses the internal
| representation built by the encoder to generate an output text.
| Let's say you want to translate the sentence "The train is
| arriving" to Spanish.
|
| Both the encoder and decoder are built like Lego, with identical
| layers stacked on top of each other. The lowest lever of the
| encoder looks at the input text and identifies the role of
| individual words and how they interact with each other. This is
| passed to the layer above, which does the same but at a higher
| level. In our example it would be as if the first layer
| identified that "train" and "arrive" are important, then the
| second one identifies that "the train" and "is arriving" are core
| concepts, the third one links both concepts together, and so on.
|
| All of these internal representations are then passed to the
| decoder (all of them, not just the last ones) which uses them to
| generate a single word, in this case "El". This word is then fed
| back to the decoder, that now needs to generate an appropriate
| continuation for "El", which in this case would be "tren". You
| repeat this procedure over and over until the transformer says
| "I'm done", hopefully having generated "El tren esta llegando" in
| the process.
|
| The attention mechanism already existed before transformers,
| typically coupled with an RNN. The key concept of the transformer
| was building an architecture that removed the RNN completely. The
| negative side is that it is a computationally inefficient
| architecture as there are plenty of n^2 operations on the length
| of the input [2]. Luckily for us, a bunch of companies started
| releasing for free giant models trained on lots of data,
| researchers learned how to "fine tune" them to specific tasks
| using way less data than what it would have taken to train from
| scratch, and transformers exploded in popularity.
|
| [1] I use "mean" in quotes here because the transformer can only
| learn from word co-occurrences. It knows that "grass" and "green"
| go well together, but it doesn't have the data to properly say
| why. The paper "Climbing towards NLU" is a nice read if you care
| about the topic, but be aware that some people disagree with this
| point of view.
|
| [2] The transformer is less efficient that an LSTM in the total
| number of operations but, simultaneously, it is easier to
| parallelize. If you are Google this is the kind of problem you
| can easily solve by throwing a data center or two at the problem.
| jimbokun wrote:
| > The negative side is that it is a computationally inefficient
| architecture as there are plenty of n^2 operations on the
| length of the input
|
| Is this the reason for the limited token windows?
| AlphaWeaver wrote:
| This link was posted here recently, and was the most
| understandable explanation I've found so far:
| https://e2eml.school/transformers.html
| dang wrote:
| Discussed here:
|
| _Transformers from Scratch (2021)_ -
| https://news.ycombinator.com/item?id=35697627 - April 2023 (46
| comments)
|
| _Transformers from Scratch_ -
| https://news.ycombinator.com/item?id=29315107 - Nov 2021 (17
| comments)
| Analemma_ wrote:
| It helps to start with recurrent neural networks first, since
| those were the previous standard way of doing next-token-
| prediction. They worked, but training them was extremely slow
| because it couldn't be parallelized. Transformers are a way of
| getting mostly the same capabilities as RNNs but with a
| parallelizable architecture so you can actually train it with
| huge parameter numbers in a reasonable amount of time.
| danieldk wrote:
| I agree it's very useful to look at RNNs. Even more because
| attention mechanisms were already getting popular in LSTM/GRU
| models before the transformers paper. One of the main insights
| of the transformer papers is that you can do with just an
| attention attention and you do not need the recurrent part of
| the network anymore (hence the title "Attention is all you
| need").
| hprotagonist wrote:
| https://jalammar.github.io/illustrated-transformer/
| Imnimo wrote:
| Here is an attempt that sacrifices technical precision for
| intuition.
|
| We're going to represent words as vectors (a sequence of
| numbers). We would like it to be the case that the value of the
| numbers reflects the meaning of the words. Words that mean
| similar things should be near each other. We also want to
| represent higher level ideas, ideas that take multiple words to
| express, in the same way. You can think of all the possible
| vectors as the entire space of ideas.
|
| To begin with, though, we just have a vector for each word. This
| is insufficient - does the word "bank" mean the edge of a river
| or a place to store money? Is it a noun or a verb? In order to
| figure out the correct vector for a particular instance of this
| word, we need to take into account its context.
|
| A natural idea might be to look at the words next to it. This
| works okay, but it's not the best. In the sentence "I needed some
| money so I got in my car and took a drive down to the bank", the
| word that really tells me the most about "bank" is "money", even
| though its far away in the sentence. What I really want is to
| find informative words based on their meaning.
|
| This is what transformers and attention are for. The process
| works like this: For each word, I compose a "query" - in hand-
| wavy terms, this says "I'm looking for any other words out there
| that are X". X could be "related to money" or "near the end of
| the sentence" or "are adjectives". Next, for each word I also
| compute a "key", this is the counterpart of the query, and says
| "I have Y". For each query, I compare it to all the keys, and
| find which ones are most similar. This tells me which words
| (queries) should pay attention to which other words (keys).
| Finally, for each word I compute a "value". Whereas the "key" was
| sort of an advertisement saying what sort of information the word
| has, the "value" is the information itself. Under the hood, the
| "query", "key" and "value" are all just vectors. A query and a
| key match if their vectors are similar.
|
| So, as an example, suppose that my sentence is "Steve has a green
| thumb". We want to understand the meaning of the word "thumb".
| Perhaps a useful step for understanding any noun would be to look
| for adjectives that modify it. We compute a "query" that says
| "I'm looking for words near the end of the sentence that are
| adjectives". When computing a "key" for the word green, maybe we
| compute "I'm near the end of the sentence, I'm a color, I'm an
| adjective or a noun". These match pretty well, so "thumb" attends
| to "green". We then compute a "value" for "green" that
| communicates its meaning.
|
| By combining the information we got from the word "green" with
| the information for the word "thumb", we can have a better
| understanding of what it means in this particular sentence. If we
| repeat this process many times, we can build up stronger
| understanding of the whole sentence. We could also have a special
| empty word at the end that represents "what might come next?",
| and use that to generate more text.
|
| But how did we know which queries, keys and values to compute?
| How did we know how to represent a word's meaning as numbers at
| all? These seemingly impossible questions are what is being
| "learned". How exactly that happens would require an equally big
| explanation of its own.
|
| Keep in mind that this explanation is very fuzzy, and is only
| intended to convey the loose intuition of what is going on. It
| leaves out many technical details and even gets some details
| intentionally wrong to avoid confusion.
| jerpint wrote:
| Let the model decide which words to focus on when producing a
| response
| senko wrote:
| Not really an ELI5, but it you really want to understand the
| topic, I've gathered the best resources I could find to help me
| do the same: https://blog.senko.net/learn-ai
|
| If you're a programmer, start with Karpathy's video series. For a
| somewhat gentler intro, take a look at the MIT intro lectures
| first to build up on the fundamentals.
|
| Then you're ready for The Illustrated Transformer, and
| afterwards, if you're brave enough, the Annotated Transformer.
|
| It's a fascinating subject, more so when you have a solid grasp!
| And you'll be able to quickly spot people who kinda stumble they
| way through but have big gaps in understanding.
| throwawaymaths wrote:
| The Yannic kilcher review is quite good.
|
| https://youtu.be/iDulhoQ2pro
|
| I can't ELI5 but I can ELI-junior-dev. Tl;dw:
|
| Transformers work by basically being a differentiable lookup/hash
| table. First your input is tokenized and (N) tokens (this
| constitutes the attention frame) are encoded both based on token
| identity _and position_ in the attention frame.
|
| Then there is an NxN matrix that is applied to your attention
| frame "performing the lookup query" over all other tokens in the
| attention frame, so every token gets a "contextual semantic
| understanding" that takes in both all the other stuff in the
| attention frame and it's relative position.
|
| Gpt is impressive because the N is really huge _and_ it has many
| layers. A big N means you can potentially access information
| farther away. Each layer gives more opportunities to summarize
| and integrate long range information in a fractal process.
|
| Two key takeaways:
|
| - differentiable hash tables
|
| - encoding relative position using periodic functions
|
| NB: the attention frame tokens are actually K-vectors (so the
| frame is a KxN matrix) and the query matrix is an NxNxK tensor
| IIRC but it's easier to describe it this way
| sva_ wrote:
| One thing that might be worth pointing out is that the
| transformer architecture owes a great deal of its success to
| the fact that it can be implemented in a way that it can be
| massively parallelized in a very efficient manner.
| throwawaymaths wrote:
| Compared to rnns... maybe? The big nxn is really a killer.
|
| I don't know how to judge parallelizability of different DNN
| models, you're comparing apples to oranges
| Salgat wrote:
| When you train a transformer, you're training what the next
| expected token is. You can train all positions of the
| sequence each in parallel rather than having to
| sequentially build up the memory state as you generate the
| sequence with an LSTM. Mind you the inference portion of a
| transformer is still sequentially bottlenecked since you
| don't know what the output sequence is supposed to be.
| saynay wrote:
| I believe the other big thing is it allowed you to
| parallelize the training as well, so you could split
| those colossal training sets across many machines in a
| way you could not do with LSTMs.
| throwawaymaths wrote:
| I don't really believe this because in theory it's
| possible to split up training of arbitrary neural nets
| using decoupled neural interfaces (not that it's easy
| mind you) but at the point where you have openAI scale
| resources the bottleneck to achieving that is the
| difficulty of Python.
| rekttrader wrote:
| I came here to post this video. It's a great primer on the
| topic and it gives you ideas to prompt gpt and have it output
| more.
|
| It's how I got an understanding of beam search, a technique
| employed in some of the response building.
| zorr wrote:
| How does N relate to the number of parameters that is
| frequently mentioned?
| throwawaymaths wrote:
| In my screed, N is the attention width. (How many token it
| looks at at a time) number of parameters is O(KxNxNxL) where
| k is the vector size of your tokens, and l is the # of
| layers. There are other parameters floating around, like in
| the encoder and decoder matrices, but the NXN matrix
| dominates.
| 8thcross wrote:
| This is an awesome explanation. You guys are the real
| heroes
| legalizemoney wrote:
| Having read the paper myself, I'm impressed with the quality of
| your explanation. Well done!
| ogoparootbbo2 wrote:
| what is a differentiable hash table? I understand
| differentiation, I don't understand differential hash table ...
| does that mean for every tiny gradient of a key, a
| corresponding gradient in value can be expected?
| kenjackson wrote:
| What does it mean for a lookup/hash table to be differentiable?
| theGnuMe wrote:
| The lookup/hashtable can be viewed as a matrix which is
| accessed by multiplying it with a vector.
| meowkit wrote:
| Going to go out on a limb and say they are probably referring
| to the gradient calculus required for updating the model.
|
| https://en.wikipedia.org/wiki/Differentiable_programming
|
| See automatic differentiation.
| throwawaymaths wrote:
| Correct, but note that if you subject a standard hash table
| algo to AD it won't magically become a transformer. (Hashes
| in the "normal construction" are discrete functions and
| thus aren't really continuous or differentiable, neither
| are lookup tables)
| tomp wrote:
| I'm not a ML expert but I know a bit about math.
|
| It's "differentiable" in the same way that e.g. the "jump
| function" ( _Heaviside step function_ ) is differentiable
| (not as a _function_ from real numbers to real numbers, but
| as a _distribution_ ). It's derivative is the "point impulse
| function" ( _Dirac delta function_ ), which, again, is a
| _distribution_ , not a real function.
|
| Distributions are nicely defined in math, but can't really be
| operated with numerically (at least not in the same way as
| real/float functions), but you can approximate them using
| continuous functions. So instead of having a function jump
| from 0 to 1, you "spread" the jump and implement it as a
| continuous transition from e.g. `0-epsilon` to `0+epsilon`
| for some tiny epsilon. Then you can differentiate it as
| usual, even numerically.
|
| Similarly, hash table lookup is a dis-continuous function -
| the result of `hash.get(lookup)` is just `value` (or `null`).
| To make it continuous, you "spread" the value, so that nearby
| keys (for some definition of "nearby") will return nearby
| values.
|
| One way to do this, is to use the scalar product between
| `lookup` and _all_ keys in the hashtable (normalized, the
| scalar product is close to 1 if the arguments are "nearby"),
| and use the result as weights to multiply with _all_ values
| in the hashtable. That 's what the transformer does.
| visarga wrote:
| It's not really doing hashing, which is random, but instead
| it makes meaningful connections, like connecting an adjective
| to its noun, or a pronoun to a previous reference of the
| name. That is why it is called "Attention is all you need"
| and not "Hashing is all you need". Attention includes some
| meaning, it is attentive to some aspect or another.
| throwawaymaths wrote:
| Hashing does not have to be random, it just has to compress
| the domain. Modulo number is a perfectly valid hash for
| some use cases.
|
| Edit: just looked it up, doesn't even have to compress the
| domain, identity hash is a thing.
| taylorius wrote:
| Differentiable in this context, means that it can be learned
| by gradient descent, which uses derivatives to adjust a
| neural network model's parameters to minimise some error
| measure. As for how that applies to a hash function, I think
| the lookup gives some sort of weighting for each possible
| output, for each input, with the largest weightings
| corresponding to a "match". But tbh I'm not certain on that
| last part...
| throwawaymaths wrote:
| > As for how that applies to a hash function, I think the
| lookup gives some sort of weighting for each possible
| output, for each input, with the largest weightings
| corresponding to a "match"
|
| Note that in general one "matches" more than one thing in
| which case (IIRC) the result will be a weighted linear
| combination of the query results.
|
| I do think that this collapses to a datastructure
| equivalent in function to the "normal hash table" (but not
| in performance, lol) in the degenerate case where every
| input yields strictly a single result with zero activation
| in all other results, so it's not invalid to call it a
| differentiable lookup table.
| throwawaymaths wrote:
| You can take a (calculus) derivative of the mathematical
| construct that represents the lookup table. And yeah, it is
| very not obvious how a "lookup/hash table" could be
| differentiable, based on how it's implemented say, in a
| leetcode exercise. That's part of the genius of the
| transformer.
| contravariant wrote:
| Well, a hash table _is_ basically a function.
| srcreigh wrote:
| Not a continuous function
| weinzierl wrote:
| I wanted to ask the same and especially I've always been
| wondering: How is the meaning of aforementioned
| 'differentiable' related to the same term in math?
| kccqzy wrote:
| Same thing. You need everything to be differentiable in
| order to run gradient descent. To first approximation,
| training a neural network is just gradient descent.
| dcre wrote:
| Not an expert (so this could well be slightly off), but
| here is a physics analogy:
|
| You're rolling a ball down a plank of wood and you want it
| to roll off the plank and continue to a certain spot on the
| ground. You know that if the ball rolls past the target,
| you have to angle the ramp lower. If it fails to reach the
| target, you angle the ramp higher. If it goes past by a
| lot, you make a big change to the ramp angle. If it goes
| past by a little, you make a small change to the ramp
| angle. In this way your error tells you something about the
| adjustment you need to make to the system to produce the
| desired output.
|
| Think of a function from the ramp angle (network weights)
| to how far the ball lands from the target (the error). The
| derivative of this function tells you what kind of change
| in angle results in what kind of change in the error, and
| you can use that to update the ramp until you hit the
| target. To say that the model is differentiable is to say
| that it's possible to infer from the error what kind of
| changes you need to make to the weights.
| visarga wrote:
| The neural net is just a math function, continuous even,
| fully differentiable in all input points. In order to
| "learn" anything we compute gradients towards the function
| parameters. They get "nudged" slightly towards a better
| response, and we do this billions of times. It's like
| carving a raw stone block into a complex scene. If you put
| your data into the system it flows towards the desired
| output because the right path has been engraved during
| training.
|
| <rant> This explains a bit how neural nets work but from
| this to chatGPT is another whole leap. You'd have to assign
| some of the merits of the AI to the training data itself,
| it's not just the algorithm for learning, but what is being
| learned that matters. The neural net is the same, but using
| 1T tokens of text is making it smart. What's so magic about
| this data that it can turn a random init into a language
| programmable system? And the same language information make
| a baby into a modern human, instead of just another animal.
| </>
| bjourne wrote:
| No, it's the _loss function_ we differentiate. The input
| to the loss function are the network weights. The input
| to the network are samples and those we do _not_
| differentiate.
| lhnz wrote:
| While it's true that we don't differentiate the input
| samples, we do differentiate the loss function's output
| with respect to each of the network weight. We use the
| chain rule to calculate each of these "gradients" and
| that process is known as backpropagation.
| dpcx wrote:
| I appreciate the explanation, but I don't know what junior-dev
| would understand most of this. I may be just a web developer,
| but I couldn't understand most of this. I'd still have to read
| for 30m to grok it all.
| throwawaymaths wrote:
| Yeah sorry, it still requires math and probably some exposure
| to ML basics.
| shadowgovt wrote:
| I think one hole in the description for simplicity is that
| "differentiable" it's not an adjective that applies to hash
| tables.
|
| Differentiable relative to what? What is (x) in the
| d(hashtable)/d(x) equation?
| throwawaymaths wrote:
| That's exactly the point, though! It's surprising. A
| hashtable is a map from keys to values. Making it
| differentiable means that a small change in the key also
| makes a small change in the value!
| lhnz wrote:
| I think it applies because lookups can be done by
| multiplying one-hot vectors (or masks) with matrices;
| this is roughly analogous to what happens when we
| multiply Q with K^T in a self-attention head.
|
| Read this:
| https://e2eml.school/transformers.html#table_lookup
|
| And then read:
| https://e2eml.school/transformers.html#attention
|
| Matrix multiplication is differentiable as it is
| continuous. Therefore you can calculate partial
| derivatives of these operations. The ability to do that
| is what allows gradient descent optimization via the
| chain rule. > Differentiable relative to
| what? > What is (x) in the d(hashtable)/d(x)
| equation?
|
| I think the derivative we actually care about is
| `[?]L/[?](hashtable-lookup)` but here the `hashtable-
| lookup` is actually the matrix multiplication mentioned
| above. We want to know how much the "loss" (or error)
| changes with respect to each of our "lookups". Knowing
| how each of our "lookups" causes our model to err from
| the output we expect, allows us to optimise it.
|
| Note: it's not really a "lookup" in the normal sense of
| the word. Hashtable lookups are discontinuous since we
| either have a value for a particular key or we do not.
| Our matrix multiplication is sort of like a
| probabilistic, continuous form of lookup in which nearby
| keys get partially activated based on their similar to
| the query.
| KyeRussell wrote:
| [dead]
| jimkleiber wrote:
| I had to look up "tl;dw" and realized it meant "too long;
| didn't watch" and not my first AI-laden instinct of "too long,
| didn't write" :-D
| [deleted]
| ftxbro wrote:
| I have some background but I'm probably not the best person in
| the world to explain.
|
| The important thing about the transformers model is that it's the
| first one we have found which keeps unlocking more and more
| powerful and general cognitive abilities the more resources we
| throw at it (parameters, exaflops, datasets). I saw some
| interview with Ilya Sutskever where he says this; it almost
| certainly won't be the last or best one, but it was the first
| one.
|
| --
|
| Why was it the first one? How were these guys so clever and other
| ones couldn't figure it out?
|
| OK so first you need some context. There is a lot of 'Newton
| standing on the shoulders of giants' going on here. If all of
| these giants were around in the 1970s, it probably would have
| been invented then. Heck for all we know something as good _was_
| invented in the 1970s but our computers were too smol to benefit
| from it. This is what John Carmack is currently looking into.
|
| To really notice the scaling benefits of the transformer
| architecture, they needed to run billion parameter transformer
| models on linear-algebra-accelerating GPU chips using
| differentiable programming frameworks. These are some of the
| giants we are standing on. The research and development pipeline
| for these amazing GPUs like [thousands of tech companies -> ASML
| -> TSMC -> NVIDIA] didn't exist until not so long ago. The
| special properties of transformers wouldn't have been discovered
| so soon without this hardware stack.
|
| Another giant we are standing on is the differentiable
| programming linear algebra libraries and frameworks similar to
| theano or tensorflow or pytorch or jax. They have had things like
| this under the name 'mathematical programming' like CPLEX but it
| wasn't as accessible. 'Differentiable programming' is a newish
| terminology for what used to be called 'automatic
| differentiation' where 'differentiation' means essentially the
| same as calculus derivative. Informally it means that these
| libraries can predict any tiny output effect of any tiny input
| change as a computationally cheap side-effect of computing the
| given output, even for complicated calculations. This capability
| makes optimization easier, in particular it generalizes the
| 'backpropagation' algorithm of traditional artificial neural
| networks.
|
| --
|
| What is the transformer model in more nerdy terms.
|
| At one level, it's just a complicatedly parameterized function,
| where you can fit the parameters by training on data. This
| viewpoint puts the importance on the computational power applied
| to training the model with the advantage of differentiable
| programming. Some will probably guess that the details of the
| model architecture don't really matter as long as it has
| sickening amount of parameters and exaflops and dataset. Some
| version of this viewpoint is probably true in my opinion.
|
| More specifically, the transformer architecture is like a chain
| of black box differentiable 'soft' lookup tables. The soft
| queries and keys and values are each lists of floating point
| numbers (for example a single soft query is a list of numbers,
| called a vector) and these vectors are stacked into matrices and
| the soft lookup is processed quickly with fast matrix
| multiplication tricks. Importantly, all of this is happening
| inside of a differentiable programming framework which lets you
| cheaply answer questions about how any small change to the input
| will affect the output. This capability is used for training, by
| making trillions of billions of tiny changes to the floating
| point numbers in the multiplication matrices in the boxes. At the
| end, the fully trained chain of black box functions can be used
| to compute a probability distribution over the next token in the
| message, which lets you generate messages or translate between
| languages or whatever.
| FranklinMaillot wrote:
| Those Computerphile videos[0] by Rob Miles helped me understand
| transformers. He specifically references the "Attention is all
| you need" paper.
|
| And for a deeper dive, Andrej Kharpaty has this hands-on video[1]
| where he builds a transformer from scratch. You can check-out his
| other videos on NLP as well they are all excellent.
|
| [0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk
|
| [1] https://youtu.be/kCc8FmEb1nY
| WithinReason wrote:
| Well here is my (a bit cynical) take on it.
|
| In the beginning, there was the matrix multiply. A simple neural
| network is a chain of matrix multiplies. Let's say you have your
| data A1 and weights W1 in a matrix. You produce A2 as A1xW1. Then
| you produce A3 as A2xW2, and so on. There are other operations in
| there like non-linearities (so that you can actually learn
| something interesting) and fancy batch norms, but let's forget
| about those for now. The problem with this is, it's not very
| expressive. Let's say your A1 matrix has just 2 values, and you
| want the output to be their product. Can you learn a weight
| matrix that performs multiplication of these inputs? No you
| can't. Multiplication must be simulated by piecing together
| piecewise linear functions. To perform multiplication, the weight
| matrix W would also need to be produced by the network.
| Transformers do basically that. In the product A*W you replace A
| with (AxW1), W with (AxW2), and multiply those together:
| (AxW1)x(AxW2) And then do it once more for good measure:
| (AxW1)x(AxW2)x(AxW3). Boom, Nobel prize. Now your network can
| multiply, not just add. OK it's actually a bit more complicated,
| there is for example a softmax in the middle to perform
| normalisation, which in general helps during numerical
| optimisation: softmax((AxW1)x(AxW2))x(AxW3). There are then fancy
| explanations that try to retrospectively justify this as a
| "differentiable lookup table" or somesuch nonsense, calling the 3
| parts "key", "query" and "value", which help make your paper more
| popular. But the basic idea is not so complicated. A Transformer
| then uses this operation as a building block (running them in
| parallel an in sequence) to build giant networks that can do
| really cool things. Maybe you can teach networks to divide next
| and then you get the next Nobel prize.
| jvm___ wrote:
| So, if you got a time machine back to the year 2000 and knew how
| to code Transformers - could you do it? Your CPU/cloud/parallel
| processing hardware would be limited, but as I understand it
| there's nothing revolutionary being done here, just a lot of
| matrix math that produces results (and we're still fully
| understanding why that all happens).
| Izkata wrote:
| "Matrix math" itself is very old, but I'm guessing what you're
| referring to here is probably neural networks or Markov chains,
| both of which are... a half-century to a century old, if I
| remember right?
|
| Transformers on the other hand are new, less than a decade old.
| jgotti92 wrote:
| Check this explanation from Karpathy which provides really good
| intuition https://youtu.be/9uw3F6rndnA
|
| Basically tokens "talk" to each other and say this is what i have
| and this is what i look for.
| saaaaaam wrote:
| Your summary of this is very good. It's a really simple and
| intuitive way to express something very complex.
| codedokode wrote:
| I cannot do ELI5, but can do ELI14 for you.
|
| Transformer is a building block (a part) of a language model.
| "Language model" is an algorithm that can predict words following
| given words. For example, you can give a text to a model and get
| a summary of this text, or an answer to the question in the text,
| or a translation of the text.
|
| Language models are often made of two parts - encoder and
| decoder. The encoder reads input text (each word is encoded as a
| bunch of numbers, for example, as list of 512 floating-point
| numbers) and produces a "state" (also a large list of numbers)
| which is expected to encode the meaning of the text. Then the
| decoder reads the state and produces the output as words (to be
| exact, as probabilities for every possible word in the dictionary
| to be at a certain position in the output).
|
| Before Transformers, people tended to use so called "recurrent
| neural networks" architecture. With this approach, the encoder
| processes the text word by word and updates the state after every
| word: state = some initial state for
| word in text: state = model(state, word)
|
| model(...) here is a complicated mathematical function, often
| with millions of operations and parameters.
|
| As I have written above, after reading the text, the state should
| encode the meaning of the text.
|
| But it turned out that this approach doesn't scale well with long
| or complicated texts because the information from beginning of
| the text gets lost. The model tends to "forget" what it had read
| before. So a new architecture, "Transformers", was proposed. The
| difference is that now we give entire text (each word encoded as
| bunch of numbers) to the model: state =
| model(input text)
|
| Now the model processes the text at once. But implementing this
| naively would result in a very large model with too many
| parameters that would require too much memory and computing time.
| So developers used a trick here - most of the time each input
| word is processed separately from others (as in recurrent model),
| but there are stages, called "attention" where the words are
| processed together (and those stages are relatively light), so it
| looks like this: # stage where all text is
| processed at once # using quick algorithm state1
| = attention(input text) # stage where each part of state
| is processed independently # with lot of heavy
| calculations state2 = map(some function, state1)
| state3 = attention(state2) state4 = map(some function,
| state3) ...
|
| To summarize, in Transformers the model processes the text at
| once, but we have to employ tricks and split processing into
| stages to make calculation feasible. Probably that is why some
| people believe the authors should receive a reward for their
| work.
|
| I think this explanation is as far as one can get without
| learning ML.
| codedokode wrote:
| Also I think this thread is a good place to complain about the
| paper. The model is not described clearly. For example, try to
| find the size of input data vector for the model in the paper -
| it is not specified. There is also a misleading phrase
| All sub-layers in the model, as well as the embedding layers,
| produce outputs of dimension d_model = 512.
|
| which makes the reader think that each block (Transformer) gets
| 512-dimensional vector as input and produces 512 numbers at the
| output. But this is wrong. 512 numbers is just a single word,
| not entire text or internal state. I could not understand this
| from reading just the original paper.
|
| Also it is not written where do keys, queries and values for
| attention come from.
| s-xyz wrote:
| Technically speaking, the breakthrough was also the fact that it
| allowed for parallelization of running the computation. Instead
| of going word by word in a sequence, and optimizing for the next
| word, the approach shifted to looking at words independently and
| then applying the same statistical approach of finding the next
| word relative to that word or sequence. Then the final outcome
| was a weighted sum of these independent pieces.
| benjismith wrote:
| Okay, here's my attempt!
|
| First, we take a sequence of words and represent it as a grid of
| numbers: each column of the grid is a separate word, and each row
| of the grid is a measurement of some property of that word. Words
| with similar meanings are likely to have similar numerical values
| on a row-by-row basis.
|
| (During the training process, we create a dictionary of all
| possible words, with a column of numbers for each of those words.
| More on this later!)
|
| This grid is called the "context". Typical systems will have a
| context that spans several thousand columns and several thousand
| rows. Right now, context length (column count) is rapidly
| expanding (1k to 2k to 8k to 32k to 100k+!!) while the
| dimensionality of each word in the dictionary (row count) is
| pretty static at around 4k to 8k...
|
| Anyhow, the Transformer architecture takes that grid and passes
| it through a multi-layer transformation algorithm. The
| functionality of each layer is identical: receive the grid of
| numbers as input, then perform a mathematical transformation on
| the grid of numbers, and pass it along to the next layer.
|
| Most systems these days have around 64 or 96 layers.
|
| After the grid of numbers has passed through all the layers, we
| can use it to generate a new column of numbers that predicts the
| properties of some word that would maximize the coherence of the
| sequence if we add it to the end of the grid. We take that new
| column of numbers and comb through our dictionary to find the
| actual word that most-closely matches the properties we're
| looking for.
|
| That word is the winner! We add it to the sequence as a new
| column, remove the first-column, and run the whole process again!
| That's how we generate long text-completions on word at a time :D
|
| So the interesting bits are located within that stack of layers.
| This is why it's called "deep learning".
|
| The mathematical transformation in each layer is called "self-
| attention", and it involves a lot of matrix multiplications and
| dot-product calculations with a learned set of "Query, Key and
| Value" matrixes.
|
| It can be hard to understand what these layers are doing
| linguistically, but we can use image-processing and computer-
| vision as a good metaphor, since images are also grids of
| numbers, and we've all seen how photo-filters can transform that
| entire grid in lots of useful ways...
|
| You can think of each layer in the transformer as being like a
| "mask" or "filter" that selects various interesting features from
| the grid, and then tweaks the image with respect to those masks
| and filters.
|
| In image processing, you might apply a color-channel mask (chroma
| key) to select all the green pixels in the background, so that
| you can erase the background and replace it with other footage.
| Or you might apply a "gaussian blur" that mixes each pixel with
| its nearest neighbors, to create a blurring effect. Or you might
| do the inverse of a gaussian blur, to create a "sharpening"
| operation that helps you find edges...
|
| But the basic idea is that you have a library of operations that
| you can apply to a grid of pixels, in order to transform the
| image (or part of the image) for a desired effect. And you can
| stack these transforms to create arbitrarily-complex effects.
|
| The same thing is true in a linguistic transformer, where a text
| sequence is modeled as a matrix.
|
| The language-model has a library of "Query, Key and Value"
| matrixes (which were learned during training) that are roughly
| analogous to the "Masks and Filters" we use on images.
|
| Each layer in the Transformer architecture attempts to identify
| some features of the incoming linguistic data, an then having
| identified those features, it can subtract those features from
| the matrix, so that the next layer sees only the transformation,
| rather than the original.
|
| We don't know exactly what each of these layers is doing in a
| linguistic model, but we can imagine it's probably doing things
| like: performing part-of-speech identification (in this context,
| is the word "ring" a noun or a verb?), reference resolution (who
| does the word "he" refer to in this sentence?), etc, etc.
|
| And the "dot-product" calculations in each attention layer are
| there to make each word "entangled" with its neighbors, so that
| we can discover all the ways that each word is connected to all
| the other words in its context.
|
| So... that's how we generate word-predictions (aka "inference")
| at runtime!
|
| By why does it work?
|
| To understand why it's so effective, you have to understand a bit
| about the training process.
|
| The flow of data during inference always flows in the same
| direction. It's called a "feed-forward" network.
|
| But during training, there's another step called "back-
| propagation".
|
| For each document in our training corpus, we go through all the
| steps I described above, passing each word into our feed-forward
| neural network and making word-predictions. We start out with a
| completely randomized set of QKV matrixes, so the results are
| often really bad!
|
| During training, when we make a prediction, we KNOW what word is
| supposed to come next. And we have a numerical representation of
| each word (4096 numbers in a column!) so we can measure the error
| between our predictions and the actual next word. Those "error"
| measurements are also represented as columns of 4096 numbers
| (because we measure the error in every dimension).
|
| So we take that error vector and pass it backward through the
| whole system! Each layer needs to take the back-propagated error
| matrix and perform tiny adjustments to its Query, Key, and Value
| matrixes. Having compensated for those errors, it reverses its
| calculations based on the new QKV, and passes the resultant
| matrix backward to the previous layer. So we make tiny
| corrections on all 96 layers, and eventually to the word-vectors
| in the dictionary itself!
|
| Like I said earlier, we don't know exactly what those layers are
| doing. But we know that they're performing a hierarchical
| decomposition of concepts.
|
| Hope that helps!
| Me1000 wrote:
| This was a very helpful visualization, thank you!
|
| The "entanglement" part intuitively makes sense to me, but one
| bit I always get caught up on the key, query, and value
| matrices. In every self-attention explanation I've read/watched
| they tend to get thrown out there and similar to what you did
| here but leave their usage/purpose a little vague.
|
| Would you mind trying to explain those in more detail? I've
| heard the database analogy where you start with a query to get
| a set of keys which you then use to lookup a value, but that
| doesn't really compute with my mental model of neural networks.
|
| Is it accurate to say that these separate QKV matrices are
| layers in the network? That doesn't seem exactly right since I
| think the self-attention layer as a whole contains these three
| different matrices. I would assume they got their names for a
| reason that should make it somewhat easy to explain their
| individual purposes and what they try to represent in the NN.
| detrites wrote:
| That was incredible. Thank you! If you made it into an article
| with images showing the mask/filter analogy, it might be one of
| the best/most unique explanations I've seen. Love the ground-up
| approach beginning with data's shape.
|
| Reminded me of the style of a book on machine learning. If
| anyone liked this explanation, you may appreciate this book:
|
| https://www.amazon.com/Applied-Machine-Learning-Engineers-Al...
| nil-sec wrote:
| Feedforward: y=Wx
|
| Attention: y=W(x)x
|
| W is Matrix, x & y Are vectors. In the second case, W is a
| function of the input.
| Sai_ wrote:
| You must be from a planet with very long years!
|
| There is no way I can even begin to digest what you have said
| in your comment.
| nil-sec wrote:
| Sorry maybe I should have added more explanation. One way to
| think about attention, which is the main distinguishing
| element in a transformer, is as an adaptable matrix. A
| feedforward layer is a matrix with static entries that do not
| change at inference time (only during training). The
| attention mechanism offers a way to have adaptable weight
| matrices at inference time (this is implemented by using
| three different matrices, K,Q & V called keys query and value
| in case you want to dig deeper).
| oneearedrabbit wrote:
| I think in your notation it should have been:
|
| y=Wx_0
|
| y=W(x)x_0
| nil-sec wrote:
| I guess I was more thinking about self attention, so yes. The
| more general case is covered by your notation!
| [deleted]
| devit wrote:
| It works like this:
|
| First, convert the input text to a sequence of token numbers
| (2048 tokens with 50257 possible token values in GPT-3) by using
| a dictionary and for each token, create a vector with 1 at the
| token index and 0 elsewhere, transform it with a learned
| "embedding" matrix (50257x12288 in GPT-3) and sum it with a
| vector of sine and cosine functions with several different
| periodicities.
|
| Then, for each layer, and each attention head (96 layers and 96
| heads per layer in GPT-3), transform the input vector by query,
| key and value matrices (12288x128 in GPT-3) to obtain a query,
| key and value vector for each token. Then for each token, compute
| the dot product of its query vector with the key vectors of all
| previous tokens, scale by 1/sqrt of the vector dimension and
| normalize the results so they sum to 1 by using softmax (i.e.
| applying e^x and dividing by the sum), giving the attention
| coefficients; then, compute the attention head output by summing
| the value vectors of previous tokens weighted by the attention
| coefficients. Now, for each token, glue the outputs for all
| attention heads in the layer (each with its own key/query/value
| learned matrices), add the input and normalize (normalizing means
| that the vector values are biased and scaled so they have mean 0
| and variance 1).
|
| Next, for the feedforward layer, apply a learned matrix, add a
| learned vector and apply a ReLU (which is f(x) = x for positive x
| and f(x) = kx with k near 0 for negative x), and do that again
| (12288x49152 and 49152x12288 matrices in GPT-3, these actually
| account for around 70% of the parameters in GPT-3), then add the
| input before the feedforward layer and normalize.
|
| Repeat the process for each layer, each with their own matrices,
| passing the output of the previous layer as input. Finally, apply
| the inverse of the initial embedding matrix and use softmax to
| get probabilities for the next token for each position. For
| training, train the network so that they are close to the actual
| next token in the text. For inference, output a next token
| according to the top K tokens in the probability distribution
| over a cutoff and repeat the whole thing to generate tokens until
| an end of text token is generated.
| [deleted]
| godelski wrote:
| > I have zero AI/ML knowledge
|
| This may make it difficult to explain and I already see many
| incorrect explanations here and even more lazy ones (why post the
| first Google result? You're just adding noise)
|
| > Steve Yegge on Medium thinks that the team behind Transformers
| deserves a Nobel
|
| First, Yegge needs to be able to tell me what Attention and
| Transformers are. More importantly, he needs to tell me who
| invented them.
|
| That actually gets to our important point and why there are so
| many bad answers here and elsewhere. Because you're both missing
| a lot of context as well as there being murky definitions. This
| is also what makes it difficult to ELI5. I'll try, then try to
| give you resources to get an actually good answer.
|
| == Bad Answer (ELI5) ==
|
| A transformer is an algorithm that considers the relationship of
| all parts of a piece of data. It does this through 4 mechanisms
| and in two parts. The first part is composed of a normalization
| block and an attention block. The normalization block scales the
| data and ensures that the data is not too large. Then the
| attention mechanism takes all the data handed to it and considers
| how it is all related to one another. This is called "self-
| attention" when we only consider one input and it is called
| "cross-attention" when we have multiple inputs and compare. Both
| of these create a relationship that are similar to creating a
| lookup table. The second block is also composed of a
| normalization block followed by a linear layer. The linear layer
| reprocesses all the relationships it just learned and gives it
| context. But we haven't stated the 4th mechanism! This is called
| a residual layer or "skip" layer. This allows the data to pass
| right on by each of the above parts without being processed and
| this little side path is key to getting things to train
| efficiently.
|
| Now that doesn't really do the work justice or give a good
| explanation of why or how things actually work. ELI5 isn't a good
| way to understand things for usage, but it is an okay place to
| start and learn abstract concepts. For the next level up I
| suggest Training Compact Transformers[0]. It'll give some
| illustrations and code to help you follow along. It is focused on
| vision transformers, but it is all the same. The next level I
| suggest Karpathy's video on GPT[1], where you will build
| transformers and he goes in a bit more depth. Both these are good
| for novices and people with little mathematical background. For
| more lore and understanding why we got here and the confusion
| over the definition of attention I suggest Lilian Wang's blog[2]
| (everything she does is gold). For a lecture and more depth I
| suggest Pascal Poupart's class. Lecture 19[3] is the one on
| attention and transformers but you need to at minimum watch
| Lecture 18 but if you actually have no ML experience or knowledge
| then you should probably start from the beginning.
|
| The truth is that not everything can be explained in simple
| terms, at least not if one wants an adequate understanding. That
| misquotation of Einstein (probably originating from Nelson) is
| far from accurate and I wouldn't expect someone that introduced a
| highly abstract concept with complex mathematics (to such a
| degree that physicists argued he was a mathematician) would say
| something so silly. There is a lot lost when distilling a concept
| and neither the listener nor speaker should fool themselves into
| believing this makes them knowledgeable (armchair expertise is a
| frustrating point on the internet and has gotten our society in a
| lot of trouble).
|
| [0] https://medium.com/pytorch/training-compact-transformers-
| fro...
|
| [1] https://www.youtube.com/watch?v=kCc8FmEb1nY
|
| [2] https://lilianweng.github.io/posts/2018-06-24-attention/
|
| [3] https://www.youtube.com/watch?v=OyFJWRnt_AY
| Buttons840 wrote:
| I've been working through [0]. Like a lot of math, the notation
| is daunting, but once you become familiar with it, it really is a
| nice tool for thought.
|
| [0]: https://arxiv.org/abs/2207.09238
| PeterisP wrote:
| ELI5 is tricky as details have to be sacrificed, but I'll try.
|
| An attention mechanism is when you want a neural network to learn
| the function of how much attention to allocate to each item in a
| sequence, to learn which items should be looked at.
|
| Transformers is a self-attention mechanism, where you ask the
| neural network to 'transform' each element by looking at its
| potential combination with every other element and using this
| (learnable, trainable) attention function to decide which
| combination(s) to apply.
|
| And it turns out that this very general mechanism, although
| compute-intensive (it considers everything linking with
| everything, so complexity quadratic to sequence length) and data-
| intensive (it has lots and lots of parameters, so needs huge
| amounts of data to be useful) can actually represent many of
| things we care about in a manner which can be trained with the
| deep learning algorithms we already had.
|
| And, really, that's the two big things ML needs, a model
| structure where there exists some configuration of parameters
| which can actually represent the thing you want to calculate, and
| that this configuration can actually be determined from training
| data reasonably.
| boppo1 wrote:
| Who is Yegge and why should I care? All I found on google is that
| he wrote code for google and blogs.
| simonw wrote:
| He's been a popular tech writer for a couple of decades. He's
| written some very influential pieces in the past - here's a
| copy of one of his most famous (I don't think the original is
| available any more): https://gist.github.com/chitchcock/1281611
| sundarurfriend wrote:
| > (I don't think the original is available any more)
|
| That's because this one was not a blog post, it was a Google+
| post that he intended to share internally with Google only.
| He had other great rants on his actual blog though, about
| (vague memory here) Emacs, game development, language design,
| etc.
| wsgeorge wrote:
| He's an engineers who's been involved (or close to) some of the
| most influential technologies/products in use in the last 10
| years, and his blog posts are quite celebrated and respected.
|
| Think of him as an essayist. Like Paul Graham (pg on this site)
| sundarurfriend wrote:
| 1 part pg and 3 parts _why maybe.
| Sai_ wrote:
| I first heard about him after an internal post/rant of his
| about Google+ got leaked and went viral.
|
| My impression is that he is a Kardashian - famous for being
| famous - in the software world.
| zh3 wrote:
| [flagged]
| blackbear_ wrote:
| Absolute BS. Transformers were literally introduced by that
| paper.
| Sai_ wrote:
| Maybe ELI5 was the wrong level for my ask. I still want to
| understand it from a technical perspective.
|
| Maybe ELI am a software engineer but not in AI/ML?
| redox99 wrote:
| *Prompt:* I'm a software engineer who has never done anything
| with AI/ML. Explain to me transformers and the "Attention is
| all you need" paper.
|
| *GPT4:* "Attention is All You Need" is a seminal paper
| published in 2017 by Vaswani et al. that introduced the
| Transformer architecture, a breakthrough in the field of AI
| and natural language processing (NLP). Transformers have
| since become the foundation for many state-of-the-art models,
| including GPT (Generative Pre-trained Transformer) series,
| BERT (Bidirectional Encoder Representations from
| Transformers), and more.
|
| At a high level, the Transformer architecture is designed to
| handle sequential data, like text, in parallel rather than
| sequentially, as was common with earlier models like
| Recurrent Neural Networks (RNNs) and Long Short-Term Memory
| (LSTM) networks. This parallelization allows for faster
| training and better scalability. The key innovation in
| Transformers is the attention mechanism, which is used to
| weigh the importance of different input tokens when
| generating an output token.
|
| Here's a simplified explanation of the main components of a
| Transformer: Input Embeddings: Transformers
| take input data in the form of text tokens. These tokens are
| converted into continuous vectors through an embedding layer.
| Positional Encoding: Unlike RNNs and LSTMs, Transformers
| don't have an inherent sense of sequence order. To address
| this, positional encoding is added to the input embeddings to
| provide information about the position of each token in the
| sequence. Encoder and Decoder: The Transformer
| architecture consists of an encoder and a decoder. The
| encoder is a stack of identical layers, each containing two
| primary components: a multi-head self-attention mechanism and
| a position-wise feed-forward network. The decoder is also
| composed of a stack of identical layers, but it has an
| additional multi-head attention mechanism that attends to the
| encoder's output. Multi-Head Attention: The
| attention mechanism is the core innovation in the Transformer
| architecture. It computes a weighted average of input tokens
| to generate context-aware representations. Multi-head
| attention allows the model to focus on different aspects of
| the input by employing multiple attention heads, each with
| its own set of learnable parameters. Feed-
| Forward Networks: These are simple fully connected layers
| that follow the attention mechanism in each layer of the
| encoder and decoder. Output: Finally, the
| decoder generates the output sequence token by token. This is
| done by applying a linear layer followed by a softmax
| activation function to get the probabilities of each possible
| output token.
|
| In summary, the Transformer architecture introduced in the
| "Attention is All You Need" paper revolutionized NLP by
| enabling more efficient and powerful models. The key
| innovation is the attention mechanism, which helps the model
| learn complex relationships between input tokens and produce
| context-aware representations. This has led to a wide array
| of applications, from machine translation and text
| summarization to sentiment analysis and language modeling.
|
| *Me:* It's okayish but you're probably interested in a more
| detailed low level explanation. Also transformers are widely
| used for computer vision nowadays (a bit forgivable because
| of ChatGPT 2021 cutoff, but they had already taken off by
| then)
| adamnemecek wrote:
| I argue that the attention mechanism is just Hopf algebra
| convolution. The QK mechanism selects possible candidates and the
| OV mechanism picks one of these possible candidates.
|
| I have recently written a paper on this
| https://arxiv.org/abs/2302.01834
|
| I also have a discord channel https://discord.cofunctional.ai.
|
| Ironically, it's the same mechanism as what renormalization in
| QFT does. I'm getting increasingly convinced that it's also how
| the brain works.
| detrites wrote:
| Could you ELI5 (or maybe 15, with high-school maths) this
| concept? Particularly how it may relate to how our brains might
| work? Most interested.
| raydiatian wrote:
| It's not really something you need to understand unless you're an
| ML researcher.
|
| I guess the ELI5 (with a BUNCH of details left out) is
| "Transformers: what if you didn't have to process sentences as a
| sequence of words, but rather as a picture of words."
| edgyquant wrote:
| >t's not really something you need to understand unless you're
| an ML researcher
|
| What kind of an answer is this? No it's something you should
| understand if you are at all interested.
| raydiatian wrote:
| It's like saying "I have no basis in differential equations
| or complex numbers or Newtonian mechanics or experimental
| findings on the wavelike nature of light but could somebody
| explain the Schrodinger equation derivation like I'm five?"
|
| Like, there's a reason we don't explain things to five year
| olds.
|
| Also I have a good ELI5 answer, so I don't know what else you
| want dude lol so petty
| rafi25 wrote:
| ELI5: In neural networks we could take data numbers, multiply by
| a constant, add a constant, add them to each other, and then do
| the same to the results again and again. And you could learn the
| right constants using labeled data. But we couldn't multiply data
| numbers against each other. So in transformers you can. But you
| must multiply all numbers against all numbers in a specific
| sample, so it's heavy.
| tworats wrote:
| The Illustrated Transfomer (
| https://jalammar.github.io/illustrated-transformer/ ) and
| Visualizing attention (
| https://towardsdatascience.com/deconstructing-bert-part-2-vi...
| ), are both really good resources. For a more ELI5 approach this
| non-technical explainer ( https://www.parand.com/a-non-technical-
| explanation-of-chatgp... ) covers it at a high level.
| kartayyar wrote:
| - You can develop a very deep understanding of a sequence by
| observing how each element interacts with each other over many
| sequences.
|
| - This understanding can be encapsulated in "compressed" low
| dimensional vector representation of a sequences.
|
| - You can use this understanding for many different downstream
| tasks, especially predicting the next item in a sequence.
|
| - This approach scales really well with lots of GPUs and data and
| is super applicable to generating text.
| cratermoon wrote:
| One or both of these might help
|
| Transformers, Explained: Understand the Model Behind GPT-3, BERT,
| and T5: https://daleonai.com/transformers-explained
|
| Transformers from Scratch: https://e2eml.school/transformers.html
|
| The first link covers Attention well.
| factorymoo wrote:
| "Transformers" and "Attention is All You Need" refer to an
| important development in machine learning and artificial
| intelligence, particularly in the field of natural language
| processing (NLP). I'll try to explain them in a simple way.
|
| Think of a conversation you had with a friend. While they were
| talking, you were probably not just listening to the words they
| were saying right now, but also remembering what they said a few
| minutes ago. Your brain was connecting the dots between different
| parts of the conversation to understand the full meaning. Now,
| imagine if you could only understand each word in isolation and
| couldn't remember anything from a few seconds ago. Conversations
| would be pretty hard to understand, right?
|
| In early NLP models, this was a big problem. They couldn't easily
| look at the "context" of a conversation or a sentence. They could
| only look at a few words at a time, so they were a bit like our
| forgetful person. They were good at understanding the meaning of
| individual words, but not so good at understanding how those
| words fit together to create meaning.
| Sai_ wrote:
| Enlightening example of having a conversation. Makes thing
| clearer.
| tikkun wrote:
| Did you use GPT to write this? (Not a bad thing! It's a decent
| answer)
| mcbuilder wrote:
| I copy pasted the attention is all you need paper into
| ChatGPT4 and gave it the prompt "Explain like I'm 5 years
| old".
|
| _The Transformer is a new type of computer program that
| helps translate languages and understand sentences. It works
| by paying attention to different parts of a sentence at the
| same time, instead of looking at one word after another like
| older programs. This makes it faster and better at
| understanding complicated sentences. It has been tested on
| translating English to German and English to French and did a
| really good job._
|
| (Edit)
| jasonjmcghee wrote:
| Immediately thought this was gpt as well.
|
| Assuming prompt was "Explain Transformers and 'Attention is
| all you need' in a simple way"
| drexlspivey wrote:
| AFAIK Transformers and context size are orthogonal concepts.
| You could have large token contexts before. The transformer
| directs the "attention" to a specific word/token inside the
| context.
| notfried wrote:
| Prompt: "ELI5 the Attention is All We Need paper."
|
| Imagine you have a box of toys. Some toys are more interesting to
| you than others, so you pay more attention to those toys and play
| with them more. The same thing happens in the "Attention is All
| You Need" paper, but instead of toys, we have words in a
| sentence.
|
| Before this paper, when computers tried to translate one language
| to another, they would look at one word, translate it, then move
| to the next word. This works okay, but it's not great because
| sometimes the meaning of a word can depend on other words in the
| sentence.
|
| The clever thing about the "Attention is All You Need" paper is
| that it taught the computer to pay attention to all the words in
| a sentence at the same time, but give more importance
| ("attention") to the words that matter most for understanding and
| translation.
|
| This new way of translating languages using "attention" made
| computers much better at understanding and translating languages.
| It's like if you could become better at playing by understanding
| all your toys at once! This paper was a big deal in the field of
| machine learning and artificial intelligence because it improved
| how machines understand languages.
| fkarg wrote:
| Uhm nope but here's a ~70min talk I held about it:
| https://github.com/fkarg/things-to-talk-about/blob/master/tr...
|
| Not sure how well the slides can be understood by themselves,
| though I tried to be accommodating for that
___________________________________________________________________
(page generated 2023-05-17 23:00 UTC)