[HN Gopher] Ask HN: Can someone ELI5 transformers and the "Atten...
       ___________________________________________________________________
        
       Ask HN: Can someone ELI5 transformers and the "Attention is all we
       need" paper?
        
       I have zero AI/ML knowledge but Steve Yegge on Medium thinks that
       the team behind Transformers deserves a Nobel.  Makes me want to
       better understand this tech.  Edit: thank you for some amazing top
       level responses and links to valuable content on this subject.
        
       Author : Sai_
       Score  : 293 points
       Date   : 2023-05-17 16:43 UTC (6 hours ago)
        
       | olaulaja wrote:
       | Not quite ELI5 and there's a few partially overlapping answers
       | around already but here goes.
       | 
       | The key part is the attention mechanism, as title of the paper
       | may have spoiled. It works moreless like this:
       | 
       | - Start with an input sequence X1, X2 ... Xn. These are all
       | vectors.
       | 
       | - Map the input sequence X into 3 new sequences of vectors: query
       | (Q), key (K), and value(V), all of the same length as the input
       | X. This is done using learnable mappings for each of the
       | sequences (so one for X->Q, another for X->K and one for X->V).
       | 
       | - Compare similarity of every query with every key. This gives
       | you a weight for each query/key pair. Call them W(Q1, V2) and so
       | forth.
       | 
       | - Compute output Z as sum of every _value_ weighted by the weight
       | for the respective query/key pair (so Z1 = V1 _W(Q1,K1) + V2_
       | W(Q1,K2) + ... + Vn _W(Q1,Kn), Z2 = V1_ W(Q2,K1) +
       | V2*W(Q2,K2)...)
       | 
       | - and that's about it!
       | 
       | As throwawaymaths mentions, this is quite similar to a learnable
       | hash table with the notable difference that the value fetched is
       | also changed, so that it doesn't fetch "input at an index like i"
       | but "whatever is important at an index like i".
       | 
       | Now a few implementation details on top of this:
       | 
       | - The description is for a single "attention head". Normally
       | several, each with their own mappings for Q/K/V, are used, so the
       | transformer can look at different "things" simultaneously. 8
       | attention heads seems pretty common.
       | 
       | - The description doesn't take the position in the sequence into
       | account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To
       | account for ordering, "positional encoding" is normally used.
       | Usually this is just adding a bunch of scaled sine/cosine waves
       | to the input. Works surprisingly well.
       | 
       | - The transformer architecture has a number of these "attention
       | layers" stacked one after the other and also 2 different stacks
       | (encoder, decoder). The paper is about machine translation, so
       | the encoder is for the input text and the decoder for the output.
       | Attention layers work just fine in other configurations as well.
       | 
       | The rest of the architecture is fairly standard stuff
        
       | vikp wrote:
       | Transformers are about converting some input data (usually text)
       | to numeric representations, then modifying those representations
       | through several layers to generate a target representation.
       | 
       | In LLMs, this means go from prompt to answer. I'll cover
       | inference only, not training.
       | 
       | I can't quite ELI5, but process is roughly:                 -
       | Write a prompt       - Convert each token in the prompt (roughly
       | a word) into numbers.  So "the" might map to the number 45.
       | - Get a vector representation of each word - go from 45 to [.1,
       | -1, -2, ...]. These vector representations are how a transformer
       | understands words.         - Combine vectors into a matrix, so
       | the transformer can "see" the whole prompt at once.       -
       | Repeat the following several times (once for each layer):       -
       | Multiply the vectors by the other vectors.  This is attention -
       | it's the magic of transformers, that enables combining
       | information from multiple tokens together.  This generates a new
       | matrix.       - Feed the matrix into a linear regression.
       | Basically multiply each number in each vector by another number,
       | then add them all together.  This will generate a new matrix, but
       | with "projected" values.       - Apply a nonlinear transformation
       | like relu.  This helps model more complex functions (like text
       | input -> output!)
       | 
       | Note that I really oversimplified the last few steps, and the
       | ordering.
       | 
       | At the end, you'll have a matrix. You then convert this back into
       | numbers, then into text.
        
         | throwawaymaths wrote:
         | I don't think this description of attention is correct.
        
           | vikp wrote:
           | You mean "Multiply the vectors by the other vectors. This is
           | attention - it's the magic of transformers, that enables
           | combining information from multiple tokens together. This
           | generates a new matrix."?
           | 
           | It's really oversimplified, as I mentioned. A more granular
           | look is:                 - Project the vectors with a linear
           | regression.  In decoder-only attention (what we usually use),
           | we project the same vectors twice with different
           | coefficients.   We call the first projection queries, and the
           | second keys.  This transforms the vectors linearly.       -
           | Find the dot product of each query vector against the key
           | vectors (multiply them)       - (training only) Mask out
           | future vectors, so a token can't look at tokens that come
           | after it       - At this point, you will have a matrix
           | indicating how important each query vector considers each
           | other vector (how important each token considers the other
           | tokens)       - Take the softmax, which both ensures all of
           | the attention values for a vector sum to 1, and penalizes
           | small attention values       - Use the softmax values to get
           | a weighted sum of tokens according to the attention calc.
           | - This will turn one vector into the weighted sum of the
           | other vectors it considers important.
           | 
           | The goal of this is to incorporate information from multiple
           | tokens into a single representation.
        
       | cma wrote:
       | From the Yegge post:
       | 
       | > LoRA makes LLMs composable, piecewise, mathematically, so that
       | if there are 10,000 LLMs in the wild, they will all eventually
       | converge on having the same knowledge. This is what Geoffrey
       | Hinton was referring to on his SkyNet tour.
       | 
       | I don't think that's right at all, LoRA freeze lots of the large
       | model part and wouldn't let you just simply combine large models.
       | Instead. I'm pretty sure Hinton is referring to data parallel
       | training with batching:
       | 
       | > DataParallel (DP) - the same setup is replicated multiple
       | times, and each being fed a slice of the data. The processing is
       | done in parallel and all setups are synchronized at the end of
       | each training step.
       | 
       | https://huggingface.co/docs/transformers/v4.15.0/parallelism
       | 
       | You can have many instances of the model training on different
       | bits of data, and then just average the modified weights back
       | together at the end. This combining of weights is what Hinton
       | means when he says parallel copies of brains can learn things
       | independently and then recombine them later a huge bandwidth
       | speeds, whereas humans are far more limited to sharing separate
       | experiences verbally or with like a multimedia presentation or
       | something.
        
       | EuropeOverlords wrote:
       | [flagged]
        
       | schizo89 wrote:
       | [flagged]
        
         | dumbfounder wrote:
         | GPT-4 Prompt: Can you ELI5 Transformers and the "Attention is
         | all we need" paper
         | 
         | Sure! Imagine you're in school, and your teacher asks you and
         | your friends to work on a group project. Each of you has a
         | different set of skills, and you need to work together to
         | complete the project successfully. In this analogy, the group
         | project is like a sentence or a piece of text that a language
         | model is trying to understand.
         | 
         | Transformers are a type of model that helps computers
         | understand and generate text, like in our group project. The
         | key idea behind transformers is something called "attention."
         | Attention helps the model figure out which words in a sentence
         | are the most important to focus on, just like you and your
         | friends pay attention to each other's skills to complete your
         | project.
         | 
         | The "Attention is all we need" paper introduced transformers
         | and showed that by using attention, the model can learn how to
         | understand and generate text more effectively. Instead of
         | focusing on each word one at a time, transformers can look at
         | all the words together, decide which ones are the most
         | important, and use that information to understand the text
         | better.
         | 
         | So, in our group project example, transformers help the model
         | work like a team, where everyone pays attention to each other's
         | strengths and helps each other out. This makes the model more
         | powerful and able to do a better job at understanding and
         | generating text.
        
           | mitthrowaway2 wrote:
           | GPT-4 seems to take "explain like I'm five" quite seriously.
        
         | samspenc wrote:
         | I just asked ChatGPT and this is what I got:
         | 
         | "Imagine you have a really smart computer friend that can
         | understand and talk to you just like a person. This computer
         | friend is called a language transformer.
         | 
         | The language transformer is like a super good reader. It reads
         | lots and lots of books, stories, and articles to learn how
         | people talk and write. It pays attention to how words go
         | together and what they mean in different situations. It becomes
         | really good at understanding what people are saying or asking.
         | 
         | Once the language transformer has learned so much, it can help
         | with many things. For example, if you want to write a story, it
         | can help you come up with ideas and write sentences that make
         | sense. If you have a question, it can try to find the answer
         | for you. It can also help translate words from one language to
         | another, like when you want to understand what someone is
         | saying in a different language.
         | 
         | The language transformer is like a big brain that knows a lot
         | about words and how they fit together. It uses all that
         | knowledge to talk and help you with different things."
        
         | ly3xqhl8g9 wrote:
         | Asked to provide some code samples:
         | 
         | """ Imagine you have a sentence: "The cat sat on the mat." In
         | traditional models, each word is processed one by one, and the
         | model might miss important relationships between words that are
         | far apart. Transformers solve this problem using self-
         | attention.
         | 
         | In self-attention, the model pays attention to all the words in
         | the sentence simultaneously, giving more importance to
         | different words depending on their relevance to each other.
         | Here's a code snippet to illustrate self-attention:
         | import torch         import torch.nn as nn              #
         | Assume we have an input sentence with 5 words
         | input_sentence = torch.tensor([[0.1, 0.2, 0.3, 0.4, 0.5]])
         | # Define the self-attention layer         self_attention =
         | nn.MultiheadAttention(embed_dim=5, num_heads=1)              #
         | Apply self-attention to the input sentence
         | attention_output, _ = self_attention(input_sentence,
         | input_sentence, input_sentence)              # The
         | attention_output represents the transformed sentence with
         | attention         print(attention_output)
         | 
         | In this code, we use a simple example with one self-attention
         | head. The input sentence is represented as a tensor with shape
         | (1, 5), where each number represents the word's features. We
         | pass the input sentence to the self-attention layer twice,
         | using it as the query, key, and value inputs.
         | 
         | The self-attention layer computes the attention scores between
         | each word in the sentence, determining how much attention each
         | word should receive from the other words. It then combines the
         | input sentence with these attention weights to produce the
         | attention_output.
         | 
         | The attention_output contains the transformed sentence, where
         | each word now carries information from other words based on
         | their importance. This allows the model to capture
         | relationships and dependencies between words effectively.
         | 
         | The "Attention is All You Need" paper introduced this self-
         | attention mechanism as a fundamental building block in the
         | Transformer model. The Transformer architecture then
         | incorporates multiple layers of self-attention, along with
         | other components like positional encoding, to process sequences
         | and achieve impressive results in various NLP tasks.
         | 
         | Remember, this code snippet is just a simplified illustration
         | to help you understand the concept of self-attention. The
         | actual implementation of Transformers is more complex,
         | involving multiple layers, feed-forward networks, and
         | additional optimizations. """
         | 
         | The nn.MultiheadAttention documentation [1], the output of
         | print(attention_output) is:                   tensor([[ 0.1756,
         | -0.2273, -0.0787,  0.0383, -0.0779]],
         | grad_fn=<SqueezeBackward1>)
         | 
         | If you badger ChatGPT it will give you an example with
         | different query, key, and value inputs                   #
         | Define distinct query, key, and value inputs         query =
         | torch.tensor([[0.2, 0.3, 0.4, 0.5, 0.6]])         key =
         | torch.tensor([[0.3, 0.4, 0.5, 0.6, 0.7]])         value =
         | torch.tensor([[0.4, 0.5, 0.6, 0.7, 0.8]])              # Apply
         | self-attention to the input sentence using distinct inputs
         | attention_output, _ = self_attention(query, key, value)
         | 
         | outputting:                   tensor([[-0.2294,  0.1336,
         | -0.1445, -0.7275, -0.1232]], grad_fn=<SqueezeBackward1>)
         | 
         | Pushed it then further to explain how does
         | nn.MultiheadAttention works:
         | 
         | """ In PyTorch, nn.MultiheadAttention is a module that
         | implements the multi-head attention mechanism. Multi-head
         | attention is a key component of the Transformer architecture,
         | which is widely used in various natural language processing
         | (NLP) tasks.
         | 
         | The nn.MultiheadAttention module allows you to apply multi-head
         | attention to a set of queries, keys, and values. Here's a
         | simplified explanation of what it does:
         | 
         | Inputs: The module takes three inputs: queries, keys, and
         | values. These inputs are typically tensors that represent the
         | encoded representations of the input sequence.
         | 
         | Linear Projections: Before applying attention, the module
         | performs linear projections on the queries, keys, and values.
         | This step projects the input tensors into different subspaces
         | to capture different aspects of the data. These projections are
         | learnable parameters.
         | 
         | Attention Scores: Next, the module computes attention scores
         | between the queries and the keys. The attention scores measure
         | the relevance or importance of each key with respect to each
         | query. Higher attention scores indicate a stronger relationship
         | between the query and the corresponding key.
         | 
         | Attention Weights: The attention scores are transformed into
         | attention weights by applying a softmax function. The softmax
         | operation normalizes the attention scores, ensuring that the
         | weights sum up to 1 and represent a valid distribution.
         | 
         | Weighted Sum: The attention weights are used to compute a
         | weighted sum of the values. This step aggregates the values
         | based on their importance, as determined by the attention
         | weights. The weighted sum produces the final output of the
         | attention mechanism.
         | 
         | Multi-Head Attention: The module performs the above steps
         | multiple times in parallel, each time using different sets of
         | learned linear projections. These parallel attention operations
         | are called "attention heads." The outputs of the attention
         | heads are concatenated and linearly transformed to obtain the
         | final output of the nn.MultiheadAttention module.
         | 
         | The multi-head attention mechanism allows the model to capture
         | different types of relationships and dependencies in the input
         | sequence. By using multiple attention heads with different
         | linear projections, the model can attend to different parts of
         | the input space and learn diverse representations.
         | 
         | Overall, nn.MultiheadAttention in PyTorch provides a convenient
         | way to incorporate the multi-head attention mechanism into your
         | neural network architecture, facilitating effective modeling of
         | complex dependencies in sequence data. """
         | 
         | [1]
         | https://pytorch.org/docs/stable/generated/torch.nn.Multihead...
        
       | PaulHoule wrote:
       | There is no Nobel prize for math or computer science, though...
       | 
       | I worked on a few projects that were trying to develop foundation
       | models for health care, aviation, and other domains. In
       | particular I trained an LSTM model to write fake abstracts for
       | clinical case reports.
       | 
       | We ran into many problems, but maybe the worst one with the LSTM
       | is that a real document repeats itself a lot. For instance,
       | somebody's name might turn up multiple times and the LSTM was
       | very bad at that kind of copying. The LSTM community was arguing
       | about solutions to this problem, but the attention mechanism in
       | transformers makes it easy.
        
         | eternalban wrote:
         | Good overview of the two approaches:
         | 
         | https://medium.com/analytics-vidhya/why-are-lstms-struggling...
        
         | abudabi123 wrote:
         | One prize for math is the Fields Medal, for cs is the Turing
         | Award... perhaps there should be a JMC prize for advances in
         | A.I.
         | 
         | , https://en.wikipedia.org/wiki/Fields_Medal
         | 
         | , https://en.wikipedia.org/wiki/Turing_Award
         | 
         | ,
         | https://en.wikipedia.org/wiki/John_McCarthy_(computer_scient...
        
       | data_maan wrote:
       | What is ELI5?
       | 
       | The idea behind the Transformer is nice - but by far not Nobel
       | prize deserving.
       | 
       | Don't believe the hype or people like Yegge, whoever that is - in
       | a few years a new architecture will be the "Nobel candidate".
       | 
       | Also, the original Transformer paper, if you read is, is horribly
       | written.
        
         | inconceivable wrote:
         | explain like i'm 5 (years old)
        
         | kenjackson wrote:
         | Nobel prize for the neurel network or perceptron maybe?
        
       | garbagecoder wrote:
       | Thanks to everyone in this thread for all the great links and
       | explanations!!
        
       | goy wrote:
       | I'd like to read the article. What are the required maths ?
        
       | QuantumGood wrote:
       | The "Attention is All You Need" paper introduced a new way for AI
       | to read and understand language, much like how we might read a
       | comic book.
       | 
       | As you read each panel of a comic book, you don't just look at
       | the words in the speech bubbles, but you also pay attention to
       | who's talking, what they're doing, and what happened in the
       | previous panels. You might pay more attention to some parts than
       | others. This is sort of like what the Transformer model does with
       | text.
       | 
       | When the Transformer reads a sentence, it doesn't just look at
       | one word at a time. It looks at all the words at once, and
       | figures out which ones are most important to understand each
       | other. This is called "attention." For example, in the sentence
       | "The cat, which is black, sat on the mat," the Transformer model
       | would understand that "cat" is connected to "black" and "sat on
       | the mat."
       | 
       | The "attention" part is very helpful because, like in a comic
       | book, understanding one part of a sentence often depends on
       | understanding other parts. This makes the Transformer model
       | really good at understanding and generating language.
       | 
       | Also, because the Transformer pays attention to all parts of the
       | sentence at the same time, it can be faster than other models
       | that read one word at a time. This is like being able to read a
       | whole page of your comic book at once, instead of having to read
       | each panel one by one.
        
         | QuantumGood wrote:
         | Explaining it for a slightly older audience, a transformer is a
         | type of artificial neural network designed for processing
         | sequences, like sentences in a text. It's especially known for
         | its use in natural language processing (NLP), which is the
         | field of AI that deals with understanding and generating human
         | language.
         | 
         | The Transformer is unique because it uses a mechanism called
         | "attention" to understand the relationships between words in a
         | sentence, which works like this:
         | 
         | (1) Encoding: First, the Transformer turns each word in a
         | sentence into a list of numbers, called a vector. These vectors
         | capture information about the word's meaning.
         | 
         | (2) Self-Attention: Next, for each word, the Transformer
         | calculates a score for every other word in the sentence. These
         | scores determine how much each word should contribute to the
         | understanding of the current word. This is the "attention"
         | part. For example, in the sentence "The cat, which is black,
         | sat on the mat," the words "cat" and "black" would get high
         | scores when trying to understand the word "black" because they
         | are closely related.
         | 
         | (3) Aggregation: The Transformer then combines the vectors of
         | all the words, weighted by their attention scores, to create a
         | new vector for each word. This new vector captures both the
         | meaning of the word itself and the context provided by the
         | other words in the sentence.
         | 
         | (4) Decoding: Finally, in a task like translation, the
         | Transformer uses the vectors from the encoding phase to
         | generate a sentence in the target language. It again uses
         | attention to decide which words in the original sentence are
         | most relevant for each word it's trying to generate in the new
         | sentence.
         | 
         | One key advantage of the Transformer is that it can calculate
         | the attention scores for all pairs of words at the same time,
         | rather than one at a time like previous models. This allows it
         | to process sentences more quickly, which is important for large
         | tasks like translating a whole book.
        
           | rosebay wrote:
           | [dead]
        
       | maxdoop wrote:
       | You guys are cracking me up lol most comments here must have the
       | smartest damn 5-year olds in mind!
       | 
       | "It's simple ! Just tokenize the context and allow differentiated
       | hash maps to best map the embeddings matrix. Duh!"
        
         | detrites wrote:
         | Explain Like I'm Five... _Years Into A Doctorate_
        
       | amelius wrote:
       | Simplest explanation: you start with a fully connected network of
       | N layers. This already gives you the functionality. Then you
       | remove a lot of unnecessary connections. This gives you the
       | performance.
        
       | hackandthink wrote:
       | Cosma Shalizi explains it very well.
       | 
       | "Again: Calling this "attention" at best a joke."
       | 
       | http://bactra.org/notebooks/nn-attention-and-transformers.ht...
        
         | MichaelRazum wrote:
         | Thanks a lot for sharing. I liked the explanation, although it
         | seems to be not a perfect matching between attention and kernel
         | (as far as I understand). Since as I undestand x_o = Qx and x_i
         | = Qx and y_i = Vx but it doesn't map to Wu dot Wv.
         | 
         | Anyway just curious do you or someone else have more sources in
         | this format?
         | 
         | At least this way it is very helpfull to think about the
         | methods. Kind of agree that the formular does look very very
         | similar to well known methods. But on the other hand, the
         | author doesn't explain the transformers in a similar simple way
         | since it is not obvious why you could stack kernels and get
         | better results.
        
       | Kiro wrote:
       | Has ELI5 lost its literal meaning? What five-year-old would
       | understand anything in this thread?
        
         | krapp wrote:
         | It was never meant to be taken literally, it's just an
         | idiomatic way of asking that something be explained as simply
         | as possible.
         | 
         | An answer to OP's question targeted at a five year old level
         | would likely be impossible given the subject matter, or else
         | need to be so general and simplistic as to be useless on this
         | forum of not actual five year olds.
        
           | Izkata wrote:
           | The current top comment managed it in a way that is actually
           | useful: https://news.ycombinator.com/item?id=35981106
        
             | krapp wrote:
             | I don't know. It's simple and straightforward, but still
             | seems like it would go over the head of an actual five year
             | old.
        
       | runetech wrote:
       | Not sure I can give an ELI5 explanation, but for me this the
       | closest to what you are after in two short videos on the topic.
       | At least they helped me get up to speed fast.
       | 
       | Hope they will do the same for you ;-)
       | 
       | Large Language Models from scratch
       | https://www.youtube.com/watch?v=lnA9DMvHtfI
       | 
       | Large Language Models: Part 2
       | https://www.youtube.com/watch?v=YDiSFS-yHwk
        
       | probably_wrong wrote:
       | I'll throw my hat in the ring.
       | 
       | A transformer is a type of neural network that, like many
       | networks before, is composed of two parts: the "encoder" that
       | receives a text and builds an internal representation of what the
       | text "means"[1], and the "decoder" that uses the internal
       | representation built by the encoder to generate an output text.
       | Let's say you want to translate the sentence "The train is
       | arriving" to Spanish.
       | 
       | Both the encoder and decoder are built like Lego, with identical
       | layers stacked on top of each other. The lowest lever of the
       | encoder looks at the input text and identifies the role of
       | individual words and how they interact with each other. This is
       | passed to the layer above, which does the same but at a higher
       | level. In our example it would be as if the first layer
       | identified that "train" and "arrive" are important, then the
       | second one identifies that "the train" and "is arriving" are core
       | concepts, the third one links both concepts together, and so on.
       | 
       | All of these internal representations are then passed to the
       | decoder (all of them, not just the last ones) which uses them to
       | generate a single word, in this case "El". This word is then fed
       | back to the decoder, that now needs to generate an appropriate
       | continuation for "El", which in this case would be "tren". You
       | repeat this procedure over and over until the transformer says
       | "I'm done", hopefully having generated "El tren esta llegando" in
       | the process.
       | 
       | The attention mechanism already existed before transformers,
       | typically coupled with an RNN. The key concept of the transformer
       | was building an architecture that removed the RNN completely. The
       | negative side is that it is a computationally inefficient
       | architecture as there are plenty of n^2 operations on the length
       | of the input [2]. Luckily for us, a bunch of companies started
       | releasing for free giant models trained on lots of data,
       | researchers learned how to "fine tune" them to specific tasks
       | using way less data than what it would have taken to train from
       | scratch, and transformers exploded in popularity.
       | 
       | [1] I use "mean" in quotes here because the transformer can only
       | learn from word co-occurrences. It knows that "grass" and "green"
       | go well together, but it doesn't have the data to properly say
       | why. The paper "Climbing towards NLU" is a nice read if you care
       | about the topic, but be aware that some people disagree with this
       | point of view.
       | 
       | [2] The transformer is less efficient that an LSTM in the total
       | number of operations but, simultaneously, it is easier to
       | parallelize. If you are Google this is the kind of problem you
       | can easily solve by throwing a data center or two at the problem.
        
         | jimbokun wrote:
         | > The negative side is that it is a computationally inefficient
         | architecture as there are plenty of n^2 operations on the
         | length of the input
         | 
         | Is this the reason for the limited token windows?
        
       | AlphaWeaver wrote:
       | This link was posted here recently, and was the most
       | understandable explanation I've found so far:
       | https://e2eml.school/transformers.html
        
         | dang wrote:
         | Discussed here:
         | 
         |  _Transformers from Scratch (2021)_ -
         | https://news.ycombinator.com/item?id=35697627 - April 2023 (46
         | comments)
         | 
         |  _Transformers from Scratch_ -
         | https://news.ycombinator.com/item?id=29315107 - Nov 2021 (17
         | comments)
        
       | Analemma_ wrote:
       | It helps to start with recurrent neural networks first, since
       | those were the previous standard way of doing next-token-
       | prediction. They worked, but training them was extremely slow
       | because it couldn't be parallelized. Transformers are a way of
       | getting mostly the same capabilities as RNNs but with a
       | parallelizable architecture so you can actually train it with
       | huge parameter numbers in a reasonable amount of time.
        
         | danieldk wrote:
         | I agree it's very useful to look at RNNs. Even more because
         | attention mechanisms were already getting popular in LSTM/GRU
         | models before the transformers paper. One of the main insights
         | of the transformer papers is that you can do with just an
         | attention attention and you do not need the recurrent part of
         | the network anymore (hence the title "Attention is all you
         | need").
        
       | hprotagonist wrote:
       | https://jalammar.github.io/illustrated-transformer/
        
       | Imnimo wrote:
       | Here is an attempt that sacrifices technical precision for
       | intuition.
       | 
       | We're going to represent words as vectors (a sequence of
       | numbers). We would like it to be the case that the value of the
       | numbers reflects the meaning of the words. Words that mean
       | similar things should be near each other. We also want to
       | represent higher level ideas, ideas that take multiple words to
       | express, in the same way. You can think of all the possible
       | vectors as the entire space of ideas.
       | 
       | To begin with, though, we just have a vector for each word. This
       | is insufficient - does the word "bank" mean the edge of a river
       | or a place to store money? Is it a noun or a verb? In order to
       | figure out the correct vector for a particular instance of this
       | word, we need to take into account its context.
       | 
       | A natural idea might be to look at the words next to it. This
       | works okay, but it's not the best. In the sentence "I needed some
       | money so I got in my car and took a drive down to the bank", the
       | word that really tells me the most about "bank" is "money", even
       | though its far away in the sentence. What I really want is to
       | find informative words based on their meaning.
       | 
       | This is what transformers and attention are for. The process
       | works like this: For each word, I compose a "query" - in hand-
       | wavy terms, this says "I'm looking for any other words out there
       | that are X". X could be "related to money" or "near the end of
       | the sentence" or "are adjectives". Next, for each word I also
       | compute a "key", this is the counterpart of the query, and says
       | "I have Y". For each query, I compare it to all the keys, and
       | find which ones are most similar. This tells me which words
       | (queries) should pay attention to which other words (keys).
       | Finally, for each word I compute a "value". Whereas the "key" was
       | sort of an advertisement saying what sort of information the word
       | has, the "value" is the information itself. Under the hood, the
       | "query", "key" and "value" are all just vectors. A query and a
       | key match if their vectors are similar.
       | 
       | So, as an example, suppose that my sentence is "Steve has a green
       | thumb". We want to understand the meaning of the word "thumb".
       | Perhaps a useful step for understanding any noun would be to look
       | for adjectives that modify it. We compute a "query" that says
       | "I'm looking for words near the end of the sentence that are
       | adjectives". When computing a "key" for the word green, maybe we
       | compute "I'm near the end of the sentence, I'm a color, I'm an
       | adjective or a noun". These match pretty well, so "thumb" attends
       | to "green". We then compute a "value" for "green" that
       | communicates its meaning.
       | 
       | By combining the information we got from the word "green" with
       | the information for the word "thumb", we can have a better
       | understanding of what it means in this particular sentence. If we
       | repeat this process many times, we can build up stronger
       | understanding of the whole sentence. We could also have a special
       | empty word at the end that represents "what might come next?",
       | and use that to generate more text.
       | 
       | But how did we know which queries, keys and values to compute?
       | How did we know how to represent a word's meaning as numbers at
       | all? These seemingly impossible questions are what is being
       | "learned". How exactly that happens would require an equally big
       | explanation of its own.
       | 
       | Keep in mind that this explanation is very fuzzy, and is only
       | intended to convey the loose intuition of what is going on. It
       | leaves out many technical details and even gets some details
       | intentionally wrong to avoid confusion.
        
       | jerpint wrote:
       | Let the model decide which words to focus on when producing a
       | response
        
       | senko wrote:
       | Not really an ELI5, but it you really want to understand the
       | topic, I've gathered the best resources I could find to help me
       | do the same: https://blog.senko.net/learn-ai
       | 
       | If you're a programmer, start with Karpathy's video series. For a
       | somewhat gentler intro, take a look at the MIT intro lectures
       | first to build up on the fundamentals.
       | 
       | Then you're ready for The Illustrated Transformer, and
       | afterwards, if you're brave enough, the Annotated Transformer.
       | 
       | It's a fascinating subject, more so when you have a solid grasp!
       | And you'll be able to quickly spot people who kinda stumble they
       | way through but have big gaps in understanding.
        
       | throwawaymaths wrote:
       | The Yannic kilcher review is quite good.
       | 
       | https://youtu.be/iDulhoQ2pro
       | 
       | I can't ELI5 but I can ELI-junior-dev. Tl;dw:
       | 
       | Transformers work by basically being a differentiable lookup/hash
       | table. First your input is tokenized and (N) tokens (this
       | constitutes the attention frame) are encoded both based on token
       | identity _and position_ in the attention frame.
       | 
       | Then there is an NxN matrix that is applied to your attention
       | frame "performing the lookup query" over all other tokens in the
       | attention frame, so every token gets a "contextual semantic
       | understanding" that takes in both all the other stuff in the
       | attention frame and it's relative position.
       | 
       | Gpt is impressive because the N is really huge _and_ it has many
       | layers. A big N means you can potentially access information
       | farther away. Each layer gives more opportunities to summarize
       | and integrate long range information in a fractal process.
       | 
       | Two key takeaways:
       | 
       | - differentiable hash tables
       | 
       | - encoding relative position using periodic functions
       | 
       | NB: the attention frame tokens are actually K-vectors (so the
       | frame is a KxN matrix) and the query matrix is an NxNxK tensor
       | IIRC but it's easier to describe it this way
        
         | sva_ wrote:
         | One thing that might be worth pointing out is that the
         | transformer architecture owes a great deal of its success to
         | the fact that it can be implemented in a way that it can be
         | massively parallelized in a very efficient manner.
        
           | throwawaymaths wrote:
           | Compared to rnns... maybe? The big nxn is really a killer.
           | 
           | I don't know how to judge parallelizability of different DNN
           | models, you're comparing apples to oranges
        
             | Salgat wrote:
             | When you train a transformer, you're training what the next
             | expected token is. You can train all positions of the
             | sequence each in parallel rather than having to
             | sequentially build up the memory state as you generate the
             | sequence with an LSTM. Mind you the inference portion of a
             | transformer is still sequentially bottlenecked since you
             | don't know what the output sequence is supposed to be.
        
               | saynay wrote:
               | I believe the other big thing is it allowed you to
               | parallelize the training as well, so you could split
               | those colossal training sets across many machines in a
               | way you could not do with LSTMs.
        
               | throwawaymaths wrote:
               | I don't really believe this because in theory it's
               | possible to split up training of arbitrary neural nets
               | using decoupled neural interfaces (not that it's easy
               | mind you) but at the point where you have openAI scale
               | resources the bottleneck to achieving that is the
               | difficulty of Python.
        
         | rekttrader wrote:
         | I came here to post this video. It's a great primer on the
         | topic and it gives you ideas to prompt gpt and have it output
         | more.
         | 
         | It's how I got an understanding of beam search, a technique
         | employed in some of the response building.
        
         | zorr wrote:
         | How does N relate to the number of parameters that is
         | frequently mentioned?
        
           | throwawaymaths wrote:
           | In my screed, N is the attention width. (How many token it
           | looks at at a time) number of parameters is O(KxNxNxL) where
           | k is the vector size of your tokens, and l is the # of
           | layers. There are other parameters floating around, like in
           | the encoder and decoder matrices, but the NXN matrix
           | dominates.
        
             | 8thcross wrote:
             | This is an awesome explanation. You guys are the real
             | heroes
        
         | legalizemoney wrote:
         | Having read the paper myself, I'm impressed with the quality of
         | your explanation. Well done!
        
         | ogoparootbbo2 wrote:
         | what is a differentiable hash table? I understand
         | differentiation, I don't understand differential hash table ...
         | does that mean for every tiny gradient of a key, a
         | corresponding gradient in value can be expected?
        
         | kenjackson wrote:
         | What does it mean for a lookup/hash table to be differentiable?
        
           | theGnuMe wrote:
           | The lookup/hashtable can be viewed as a matrix which is
           | accessed by multiplying it with a vector.
        
           | meowkit wrote:
           | Going to go out on a limb and say they are probably referring
           | to the gradient calculus required for updating the model.
           | 
           | https://en.wikipedia.org/wiki/Differentiable_programming
           | 
           | See automatic differentiation.
        
             | throwawaymaths wrote:
             | Correct, but note that if you subject a standard hash table
             | algo to AD it won't magically become a transformer. (Hashes
             | in the "normal construction" are discrete functions and
             | thus aren't really continuous or differentiable, neither
             | are lookup tables)
        
           | tomp wrote:
           | I'm not a ML expert but I know a bit about math.
           | 
           | It's "differentiable" in the same way that e.g. the "jump
           | function" ( _Heaviside step function_ ) is differentiable
           | (not as a _function_ from real numbers to real numbers, but
           | as a _distribution_ ). It's derivative is the "point impulse
           | function" ( _Dirac delta function_ ), which, again, is a
           | _distribution_ , not a real function.
           | 
           | Distributions are nicely defined in math, but can't really be
           | operated with numerically (at least not in the same way as
           | real/float functions), but you can approximate them using
           | continuous functions. So instead of having a function jump
           | from 0 to 1, you "spread" the jump and implement it as a
           | continuous transition from e.g. `0-epsilon` to `0+epsilon`
           | for some tiny epsilon. Then you can differentiate it as
           | usual, even numerically.
           | 
           | Similarly, hash table lookup is a dis-continuous function -
           | the result of `hash.get(lookup)` is just `value` (or `null`).
           | To make it continuous, you "spread" the value, so that nearby
           | keys (for some definition of "nearby") will return nearby
           | values.
           | 
           | One way to do this, is to use the scalar product between
           | `lookup` and _all_ keys in the hashtable (normalized, the
           | scalar product is close to 1 if the arguments are  "nearby"),
           | and use the result as weights to multiply with _all_ values
           | in the hashtable. That 's what the transformer does.
        
           | visarga wrote:
           | It's not really doing hashing, which is random, but instead
           | it makes meaningful connections, like connecting an adjective
           | to its noun, or a pronoun to a previous reference of the
           | name. That is why it is called "Attention is all you need"
           | and not "Hashing is all you need". Attention includes some
           | meaning, it is attentive to some aspect or another.
        
             | throwawaymaths wrote:
             | Hashing does not have to be random, it just has to compress
             | the domain. Modulo number is a perfectly valid hash for
             | some use cases.
             | 
             | Edit: just looked it up, doesn't even have to compress the
             | domain, identity hash is a thing.
        
           | taylorius wrote:
           | Differentiable in this context, means that it can be learned
           | by gradient descent, which uses derivatives to adjust a
           | neural network model's parameters to minimise some error
           | measure. As for how that applies to a hash function, I think
           | the lookup gives some sort of weighting for each possible
           | output, for each input, with the largest weightings
           | corresponding to a "match". But tbh I'm not certain on that
           | last part...
        
             | throwawaymaths wrote:
             | > As for how that applies to a hash function, I think the
             | lookup gives some sort of weighting for each possible
             | output, for each input, with the largest weightings
             | corresponding to a "match"
             | 
             | Note that in general one "matches" more than one thing in
             | which case (IIRC) the result will be a weighted linear
             | combination of the query results.
             | 
             | I do think that this collapses to a datastructure
             | equivalent in function to the "normal hash table" (but not
             | in performance, lol) in the degenerate case where every
             | input yields strictly a single result with zero activation
             | in all other results, so it's not invalid to call it a
             | differentiable lookup table.
        
           | throwawaymaths wrote:
           | You can take a (calculus) derivative of the mathematical
           | construct that represents the lookup table. And yeah, it is
           | very not obvious how a "lookup/hash table" could be
           | differentiable, based on how it's implemented say, in a
           | leetcode exercise. That's part of the genius of the
           | transformer.
        
           | contravariant wrote:
           | Well, a hash table _is_ basically a function.
        
             | srcreigh wrote:
             | Not a continuous function
        
           | weinzierl wrote:
           | I wanted to ask the same and especially I've always been
           | wondering: How is the meaning of aforementioned
           | 'differentiable' related to the same term in math?
        
             | kccqzy wrote:
             | Same thing. You need everything to be differentiable in
             | order to run gradient descent. To first approximation,
             | training a neural network is just gradient descent.
        
             | dcre wrote:
             | Not an expert (so this could well be slightly off), but
             | here is a physics analogy:
             | 
             | You're rolling a ball down a plank of wood and you want it
             | to roll off the plank and continue to a certain spot on the
             | ground. You know that if the ball rolls past the target,
             | you have to angle the ramp lower. If it fails to reach the
             | target, you angle the ramp higher. If it goes past by a
             | lot, you make a big change to the ramp angle. If it goes
             | past by a little, you make a small change to the ramp
             | angle. In this way your error tells you something about the
             | adjustment you need to make to the system to produce the
             | desired output.
             | 
             | Think of a function from the ramp angle (network weights)
             | to how far the ball lands from the target (the error). The
             | derivative of this function tells you what kind of change
             | in angle results in what kind of change in the error, and
             | you can use that to update the ramp until you hit the
             | target. To say that the model is differentiable is to say
             | that it's possible to infer from the error what kind of
             | changes you need to make to the weights.
        
             | visarga wrote:
             | The neural net is just a math function, continuous even,
             | fully differentiable in all input points. In order to
             | "learn" anything we compute gradients towards the function
             | parameters. They get "nudged" slightly towards a better
             | response, and we do this billions of times. It's like
             | carving a raw stone block into a complex scene. If you put
             | your data into the system it flows towards the desired
             | output because the right path has been engraved during
             | training.
             | 
             | <rant> This explains a bit how neural nets work but from
             | this to chatGPT is another whole leap. You'd have to assign
             | some of the merits of the AI to the training data itself,
             | it's not just the algorithm for learning, but what is being
             | learned that matters. The neural net is the same, but using
             | 1T tokens of text is making it smart. What's so magic about
             | this data that it can turn a random init into a language
             | programmable system? And the same language information make
             | a baby into a modern human, instead of just another animal.
             | </>
        
               | bjourne wrote:
               | No, it's the _loss function_ we differentiate. The input
               | to the loss function are the network weights. The input
               | to the network are samples and those we do _not_
               | differentiate.
        
               | lhnz wrote:
               | While it's true that we don't differentiate the input
               | samples, we do differentiate the loss function's output
               | with respect to each of the network weight. We use the
               | chain rule to calculate each of these "gradients" and
               | that process is known as backpropagation.
        
         | dpcx wrote:
         | I appreciate the explanation, but I don't know what junior-dev
         | would understand most of this. I may be just a web developer,
         | but I couldn't understand most of this. I'd still have to read
         | for 30m to grok it all.
        
           | throwawaymaths wrote:
           | Yeah sorry, it still requires math and probably some exposure
           | to ML basics.
        
             | shadowgovt wrote:
             | I think one hole in the description for simplicity is that
             | "differentiable" it's not an adjective that applies to hash
             | tables.
             | 
             | Differentiable relative to what? What is (x) in the
             | d(hashtable)/d(x) equation?
        
               | throwawaymaths wrote:
               | That's exactly the point, though! It's surprising. A
               | hashtable is a map from keys to values. Making it
               | differentiable means that a small change in the key also
               | makes a small change in the value!
        
               | lhnz wrote:
               | I think it applies because lookups can be done by
               | multiplying one-hot vectors (or masks) with matrices;
               | this is roughly analogous to what happens when we
               | multiply Q with K^T in a self-attention head.
               | 
               | Read this:
               | https://e2eml.school/transformers.html#table_lookup
               | 
               | And then read:
               | https://e2eml.school/transformers.html#attention
               | 
               | Matrix multiplication is differentiable as it is
               | continuous. Therefore you can calculate partial
               | derivatives of these operations. The ability to do that
               | is what allows gradient descent optimization via the
               | chain rule.                 > Differentiable relative to
               | what?       > What is (x) in the d(hashtable)/d(x)
               | equation?
               | 
               | I think the derivative we actually care about is
               | `[?]L/[?](hashtable-lookup)` but here the `hashtable-
               | lookup` is actually the matrix multiplication mentioned
               | above. We want to know how much the "loss" (or error)
               | changes with respect to each of our "lookups". Knowing
               | how each of our "lookups" causes our model to err from
               | the output we expect, allows us to optimise it.
               | 
               | Note: it's not really a "lookup" in the normal sense of
               | the word. Hashtable lookups are discontinuous since we
               | either have a value for a particular key or we do not.
               | Our matrix multiplication is sort of like a
               | probabilistic, continuous form of lookup in which nearby
               | keys get partially activated based on their similar to
               | the query.
        
           | KyeRussell wrote:
           | [dead]
        
         | jimkleiber wrote:
         | I had to look up "tl;dw" and realized it meant "too long;
         | didn't watch" and not my first AI-laden instinct of "too long,
         | didn't write" :-D
        
       | [deleted]
        
       | ftxbro wrote:
       | I have some background but I'm probably not the best person in
       | the world to explain.
       | 
       | The important thing about the transformers model is that it's the
       | first one we have found which keeps unlocking more and more
       | powerful and general cognitive abilities the more resources we
       | throw at it (parameters, exaflops, datasets). I saw some
       | interview with Ilya Sutskever where he says this; it almost
       | certainly won't be the last or best one, but it was the first
       | one.
       | 
       | --
       | 
       | Why was it the first one? How were these guys so clever and other
       | ones couldn't figure it out?
       | 
       | OK so first you need some context. There is a lot of 'Newton
       | standing on the shoulders of giants' going on here. If all of
       | these giants were around in the 1970s, it probably would have
       | been invented then. Heck for all we know something as good _was_
       | invented in the 1970s but our computers were too smol to benefit
       | from it. This is what John Carmack is currently looking into.
       | 
       | To really notice the scaling benefits of the transformer
       | architecture, they needed to run billion parameter transformer
       | models on linear-algebra-accelerating GPU chips using
       | differentiable programming frameworks. These are some of the
       | giants we are standing on. The research and development pipeline
       | for these amazing GPUs like [thousands of tech companies -> ASML
       | -> TSMC -> NVIDIA] didn't exist until not so long ago. The
       | special properties of transformers wouldn't have been discovered
       | so soon without this hardware stack.
       | 
       | Another giant we are standing on is the differentiable
       | programming linear algebra libraries and frameworks similar to
       | theano or tensorflow or pytorch or jax. They have had things like
       | this under the name 'mathematical programming' like CPLEX but it
       | wasn't as accessible. 'Differentiable programming' is a newish
       | terminology for what used to be called 'automatic
       | differentiation' where 'differentiation' means essentially the
       | same as calculus derivative. Informally it means that these
       | libraries can predict any tiny output effect of any tiny input
       | change as a computationally cheap side-effect of computing the
       | given output, even for complicated calculations. This capability
       | makes optimization easier, in particular it generalizes the
       | 'backpropagation' algorithm of traditional artificial neural
       | networks.
       | 
       | --
       | 
       | What is the transformer model in more nerdy terms.
       | 
       | At one level, it's just a complicatedly parameterized function,
       | where you can fit the parameters by training on data. This
       | viewpoint puts the importance on the computational power applied
       | to training the model with the advantage of differentiable
       | programming. Some will probably guess that the details of the
       | model architecture don't really matter as long as it has
       | sickening amount of parameters and exaflops and dataset. Some
       | version of this viewpoint is probably true in my opinion.
       | 
       | More specifically, the transformer architecture is like a chain
       | of black box differentiable 'soft' lookup tables. The soft
       | queries and keys and values are each lists of floating point
       | numbers (for example a single soft query is a list of numbers,
       | called a vector) and these vectors are stacked into matrices and
       | the soft lookup is processed quickly with fast matrix
       | multiplication tricks. Importantly, all of this is happening
       | inside of a differentiable programming framework which lets you
       | cheaply answer questions about how any small change to the input
       | will affect the output. This capability is used for training, by
       | making trillions of billions of tiny changes to the floating
       | point numbers in the multiplication matrices in the boxes. At the
       | end, the fully trained chain of black box functions can be used
       | to compute a probability distribution over the next token in the
       | message, which lets you generate messages or translate between
       | languages or whatever.
        
       | FranklinMaillot wrote:
       | Those Computerphile videos[0] by Rob Miles helped me understand
       | transformers. He specifically references the "Attention is all
       | you need" paper.
       | 
       | And for a deeper dive, Andrej Kharpaty has this hands-on video[1]
       | where he builds a transformer from scratch. You can check-out his
       | other videos on NLP as well they are all excellent.
       | 
       | [0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk
       | 
       | [1] https://youtu.be/kCc8FmEb1nY
        
       | WithinReason wrote:
       | Well here is my (a bit cynical) take on it.
       | 
       | In the beginning, there was the matrix multiply. A simple neural
       | network is a chain of matrix multiplies. Let's say you have your
       | data A1 and weights W1 in a matrix. You produce A2 as A1xW1. Then
       | you produce A3 as A2xW2, and so on. There are other operations in
       | there like non-linearities (so that you can actually learn
       | something interesting) and fancy batch norms, but let's forget
       | about those for now. The problem with this is, it's not very
       | expressive. Let's say your A1 matrix has just 2 values, and you
       | want the output to be their product. Can you learn a weight
       | matrix that performs multiplication of these inputs? No you
       | can't. Multiplication must be simulated by piecing together
       | piecewise linear functions. To perform multiplication, the weight
       | matrix W would also need to be produced by the network.
       | Transformers do basically that. In the product A*W you replace A
       | with (AxW1), W with (AxW2), and multiply those together:
       | (AxW1)x(AxW2) And then do it once more for good measure:
       | (AxW1)x(AxW2)x(AxW3). Boom, Nobel prize. Now your network can
       | multiply, not just add. OK it's actually a bit more complicated,
       | there is for example a softmax in the middle to perform
       | normalisation, which in general helps during numerical
       | optimisation: softmax((AxW1)x(AxW2))x(AxW3). There are then fancy
       | explanations that try to retrospectively justify this as a
       | "differentiable lookup table" or somesuch nonsense, calling the 3
       | parts "key", "query" and "value", which help make your paper more
       | popular. But the basic idea is not so complicated. A Transformer
       | then uses this operation as a building block (running them in
       | parallel an in sequence) to build giant networks that can do
       | really cool things. Maybe you can teach networks to divide next
       | and then you get the next Nobel prize.
        
       | jvm___ wrote:
       | So, if you got a time machine back to the year 2000 and knew how
       | to code Transformers - could you do it? Your CPU/cloud/parallel
       | processing hardware would be limited, but as I understand it
       | there's nothing revolutionary being done here, just a lot of
       | matrix math that produces results (and we're still fully
       | understanding why that all happens).
        
         | Izkata wrote:
         | "Matrix math" itself is very old, but I'm guessing what you're
         | referring to here is probably neural networks or Markov chains,
         | both of which are... a half-century to a century old, if I
         | remember right?
         | 
         | Transformers on the other hand are new, less than a decade old.
        
       | jgotti92 wrote:
       | Check this explanation from Karpathy which provides really good
       | intuition https://youtu.be/9uw3F6rndnA
       | 
       | Basically tokens "talk" to each other and say this is what i have
       | and this is what i look for.
        
         | saaaaaam wrote:
         | Your summary of this is very good. It's a really simple and
         | intuitive way to express something very complex.
        
       | codedokode wrote:
       | I cannot do ELI5, but can do ELI14 for you.
       | 
       | Transformer is a building block (a part) of a language model.
       | "Language model" is an algorithm that can predict words following
       | given words. For example, you can give a text to a model and get
       | a summary of this text, or an answer to the question in the text,
       | or a translation of the text.
       | 
       | Language models are often made of two parts - encoder and
       | decoder. The encoder reads input text (each word is encoded as a
       | bunch of numbers, for example, as list of 512 floating-point
       | numbers) and produces a "state" (also a large list of numbers)
       | which is expected to encode the meaning of the text. Then the
       | decoder reads the state and produces the output as words (to be
       | exact, as probabilities for every possible word in the dictionary
       | to be at a certain position in the output).
       | 
       | Before Transformers, people tended to use so called "recurrent
       | neural networks" architecture. With this approach, the encoder
       | processes the text word by word and updates the state after every
       | word:                   state = some initial state         for
       | word in text:             state = model(state, word)
       | 
       | model(...) here is a complicated mathematical function, often
       | with millions of operations and parameters.
       | 
       | As I have written above, after reading the text, the state should
       | encode the meaning of the text.
       | 
       | But it turned out that this approach doesn't scale well with long
       | or complicated texts because the information from beginning of
       | the text gets lost. The model tends to "forget" what it had read
       | before. So a new architecture, "Transformers", was proposed. The
       | difference is that now we give entire text (each word encoded as
       | bunch of numbers) to the model:                   state =
       | model(input text)
       | 
       | Now the model processes the text at once. But implementing this
       | naively would result in a very large model with too many
       | parameters that would require too much memory and computing time.
       | So developers used a trick here - most of the time each input
       | word is processed separately from others (as in recurrent model),
       | but there are stages, called "attention" where the words are
       | processed together (and those stages are relatively light), so it
       | looks like this:                   # stage where all text is
       | processed at once         # using quick algorithm         state1
       | = attention(input text)         # stage where each part of state
       | is processed independently         # with lot of heavy
       | calculations         state2 = map(some function, state1)
       | state3 = attention(state2)         state4 = map(some function,
       | state3)         ...
       | 
       | To summarize, in Transformers the model processes the text at
       | once, but we have to employ tricks and split processing into
       | stages to make calculation feasible. Probably that is why some
       | people believe the authors should receive a reward for their
       | work.
       | 
       | I think this explanation is as far as one can get without
       | learning ML.
        
         | codedokode wrote:
         | Also I think this thread is a good place to complain about the
         | paper. The model is not described clearly. For example, try to
         | find the size of input data vector for the model in the paper -
         | it is not specified. There is also a misleading phrase
         | All sub-layers in the model, as well as the embedding layers,
         | produce outputs of dimension d_model = 512.
         | 
         | which makes the reader think that each block (Transformer) gets
         | 512-dimensional vector as input and produces 512 numbers at the
         | output. But this is wrong. 512 numbers is just a single word,
         | not entire text or internal state. I could not understand this
         | from reading just the original paper.
         | 
         | Also it is not written where do keys, queries and values for
         | attention come from.
        
       | s-xyz wrote:
       | Technically speaking, the breakthrough was also the fact that it
       | allowed for parallelization of running the computation. Instead
       | of going word by word in a sequence, and optimizing for the next
       | word, the approach shifted to looking at words independently and
       | then applying the same statistical approach of finding the next
       | word relative to that word or sequence. Then the final outcome
       | was a weighted sum of these independent pieces.
        
       | benjismith wrote:
       | Okay, here's my attempt!
       | 
       | First, we take a sequence of words and represent it as a grid of
       | numbers: each column of the grid is a separate word, and each row
       | of the grid is a measurement of some property of that word. Words
       | with similar meanings are likely to have similar numerical values
       | on a row-by-row basis.
       | 
       | (During the training process, we create a dictionary of all
       | possible words, with a column of numbers for each of those words.
       | More on this later!)
       | 
       | This grid is called the "context". Typical systems will have a
       | context that spans several thousand columns and several thousand
       | rows. Right now, context length (column count) is rapidly
       | expanding (1k to 2k to 8k to 32k to 100k+!!) while the
       | dimensionality of each word in the dictionary (row count) is
       | pretty static at around 4k to 8k...
       | 
       | Anyhow, the Transformer architecture takes that grid and passes
       | it through a multi-layer transformation algorithm. The
       | functionality of each layer is identical: receive the grid of
       | numbers as input, then perform a mathematical transformation on
       | the grid of numbers, and pass it along to the next layer.
       | 
       | Most systems these days have around 64 or 96 layers.
       | 
       | After the grid of numbers has passed through all the layers, we
       | can use it to generate a new column of numbers that predicts the
       | properties of some word that would maximize the coherence of the
       | sequence if we add it to the end of the grid. We take that new
       | column of numbers and comb through our dictionary to find the
       | actual word that most-closely matches the properties we're
       | looking for.
       | 
       | That word is the winner! We add it to the sequence as a new
       | column, remove the first-column, and run the whole process again!
       | That's how we generate long text-completions on word at a time :D
       | 
       | So the interesting bits are located within that stack of layers.
       | This is why it's called "deep learning".
       | 
       | The mathematical transformation in each layer is called "self-
       | attention", and it involves a lot of matrix multiplications and
       | dot-product calculations with a learned set of "Query, Key and
       | Value" matrixes.
       | 
       | It can be hard to understand what these layers are doing
       | linguistically, but we can use image-processing and computer-
       | vision as a good metaphor, since images are also grids of
       | numbers, and we've all seen how photo-filters can transform that
       | entire grid in lots of useful ways...
       | 
       | You can think of each layer in the transformer as being like a
       | "mask" or "filter" that selects various interesting features from
       | the grid, and then tweaks the image with respect to those masks
       | and filters.
       | 
       | In image processing, you might apply a color-channel mask (chroma
       | key) to select all the green pixels in the background, so that
       | you can erase the background and replace it with other footage.
       | Or you might apply a "gaussian blur" that mixes each pixel with
       | its nearest neighbors, to create a blurring effect. Or you might
       | do the inverse of a gaussian blur, to create a "sharpening"
       | operation that helps you find edges...
       | 
       | But the basic idea is that you have a library of operations that
       | you can apply to a grid of pixels, in order to transform the
       | image (or part of the image) for a desired effect. And you can
       | stack these transforms to create arbitrarily-complex effects.
       | 
       | The same thing is true in a linguistic transformer, where a text
       | sequence is modeled as a matrix.
       | 
       | The language-model has a library of "Query, Key and Value"
       | matrixes (which were learned during training) that are roughly
       | analogous to the "Masks and Filters" we use on images.
       | 
       | Each layer in the Transformer architecture attempts to identify
       | some features of the incoming linguistic data, an then having
       | identified those features, it can subtract those features from
       | the matrix, so that the next layer sees only the transformation,
       | rather than the original.
       | 
       | We don't know exactly what each of these layers is doing in a
       | linguistic model, but we can imagine it's probably doing things
       | like: performing part-of-speech identification (in this context,
       | is the word "ring" a noun or a verb?), reference resolution (who
       | does the word "he" refer to in this sentence?), etc, etc.
       | 
       | And the "dot-product" calculations in each attention layer are
       | there to make each word "entangled" with its neighbors, so that
       | we can discover all the ways that each word is connected to all
       | the other words in its context.
       | 
       | So... that's how we generate word-predictions (aka "inference")
       | at runtime!
       | 
       | By why does it work?
       | 
       | To understand why it's so effective, you have to understand a bit
       | about the training process.
       | 
       | The flow of data during inference always flows in the same
       | direction. It's called a "feed-forward" network.
       | 
       | But during training, there's another step called "back-
       | propagation".
       | 
       | For each document in our training corpus, we go through all the
       | steps I described above, passing each word into our feed-forward
       | neural network and making word-predictions. We start out with a
       | completely randomized set of QKV matrixes, so the results are
       | often really bad!
       | 
       | During training, when we make a prediction, we KNOW what word is
       | supposed to come next. And we have a numerical representation of
       | each word (4096 numbers in a column!) so we can measure the error
       | between our predictions and the actual next word. Those "error"
       | measurements are also represented as columns of 4096 numbers
       | (because we measure the error in every dimension).
       | 
       | So we take that error vector and pass it backward through the
       | whole system! Each layer needs to take the back-propagated error
       | matrix and perform tiny adjustments to its Query, Key, and Value
       | matrixes. Having compensated for those errors, it reverses its
       | calculations based on the new QKV, and passes the resultant
       | matrix backward to the previous layer. So we make tiny
       | corrections on all 96 layers, and eventually to the word-vectors
       | in the dictionary itself!
       | 
       | Like I said earlier, we don't know exactly what those layers are
       | doing. But we know that they're performing a hierarchical
       | decomposition of concepts.
       | 
       | Hope that helps!
        
         | Me1000 wrote:
         | This was a very helpful visualization, thank you!
         | 
         | The "entanglement" part intuitively makes sense to me, but one
         | bit I always get caught up on the key, query, and value
         | matrices. In every self-attention explanation I've read/watched
         | they tend to get thrown out there and similar to what you did
         | here but leave their usage/purpose a little vague.
         | 
         | Would you mind trying to explain those in more detail? I've
         | heard the database analogy where you start with a query to get
         | a set of keys which you then use to lookup a value, but that
         | doesn't really compute with my mental model of neural networks.
         | 
         | Is it accurate to say that these separate QKV matrices are
         | layers in the network? That doesn't seem exactly right since I
         | think the self-attention layer as a whole contains these three
         | different matrices. I would assume they got their names for a
         | reason that should make it somewhat easy to explain their
         | individual purposes and what they try to represent in the NN.
        
         | detrites wrote:
         | That was incredible. Thank you! If you made it into an article
         | with images showing the mask/filter analogy, it might be one of
         | the best/most unique explanations I've seen. Love the ground-up
         | approach beginning with data's shape.
         | 
         | Reminded me of the style of a book on machine learning. If
         | anyone liked this explanation, you may appreciate this book:
         | 
         | https://www.amazon.com/Applied-Machine-Learning-Engineers-Al...
        
       | nil-sec wrote:
       | Feedforward: y=Wx
       | 
       | Attention: y=W(x)x
       | 
       | W is Matrix, x & y Are vectors. In the second case, W is a
       | function of the input.
        
         | Sai_ wrote:
         | You must be from a planet with very long years!
         | 
         | There is no way I can even begin to digest what you have said
         | in your comment.
        
           | nil-sec wrote:
           | Sorry maybe I should have added more explanation. One way to
           | think about attention, which is the main distinguishing
           | element in a transformer, is as an adaptable matrix. A
           | feedforward layer is a matrix with static entries that do not
           | change at inference time (only during training). The
           | attention mechanism offers a way to have adaptable weight
           | matrices at inference time (this is implemented by using
           | three different matrices, K,Q & V called keys query and value
           | in case you want to dig deeper).
        
         | oneearedrabbit wrote:
         | I think in your notation it should have been:
         | 
         | y=Wx_0
         | 
         | y=W(x)x_0
        
           | nil-sec wrote:
           | I guess I was more thinking about self attention, so yes. The
           | more general case is covered by your notation!
        
       | [deleted]
        
       | devit wrote:
       | It works like this:
       | 
       | First, convert the input text to a sequence of token numbers
       | (2048 tokens with 50257 possible token values in GPT-3) by using
       | a dictionary and for each token, create a vector with 1 at the
       | token index and 0 elsewhere, transform it with a learned
       | "embedding" matrix (50257x12288 in GPT-3) and sum it with a
       | vector of sine and cosine functions with several different
       | periodicities.
       | 
       | Then, for each layer, and each attention head (96 layers and 96
       | heads per layer in GPT-3), transform the input vector by query,
       | key and value matrices (12288x128 in GPT-3) to obtain a query,
       | key and value vector for each token. Then for each token, compute
       | the dot product of its query vector with the key vectors of all
       | previous tokens, scale by 1/sqrt of the vector dimension and
       | normalize the results so they sum to 1 by using softmax (i.e.
       | applying e^x and dividing by the sum), giving the attention
       | coefficients; then, compute the attention head output by summing
       | the value vectors of previous tokens weighted by the attention
       | coefficients. Now, for each token, glue the outputs for all
       | attention heads in the layer (each with its own key/query/value
       | learned matrices), add the input and normalize (normalizing means
       | that the vector values are biased and scaled so they have mean 0
       | and variance 1).
       | 
       | Next, for the feedforward layer, apply a learned matrix, add a
       | learned vector and apply a ReLU (which is f(x) = x for positive x
       | and f(x) = kx with k near 0 for negative x), and do that again
       | (12288x49152 and 49152x12288 matrices in GPT-3, these actually
       | account for around 70% of the parameters in GPT-3), then add the
       | input before the feedforward layer and normalize.
       | 
       | Repeat the process for each layer, each with their own matrices,
       | passing the output of the previous layer as input. Finally, apply
       | the inverse of the initial embedding matrix and use softmax to
       | get probabilities for the next token for each position. For
       | training, train the network so that they are close to the actual
       | next token in the text. For inference, output a next token
       | according to the top K tokens in the probability distribution
       | over a cutoff and repeat the whole thing to generate tokens until
       | an end of text token is generated.
        
       | [deleted]
        
       | godelski wrote:
       | > I have zero AI/ML knowledge
       | 
       | This may make it difficult to explain and I already see many
       | incorrect explanations here and even more lazy ones (why post the
       | first Google result? You're just adding noise)
       | 
       | > Steve Yegge on Medium thinks that the team behind Transformers
       | deserves a Nobel
       | 
       | First, Yegge needs to be able to tell me what Attention and
       | Transformers are. More importantly, he needs to tell me who
       | invented them.
       | 
       | That actually gets to our important point and why there are so
       | many bad answers here and elsewhere. Because you're both missing
       | a lot of context as well as there being murky definitions. This
       | is also what makes it difficult to ELI5. I'll try, then try to
       | give you resources to get an actually good answer.
       | 
       | == Bad Answer (ELI5) ==
       | 
       | A transformer is an algorithm that considers the relationship of
       | all parts of a piece of data. It does this through 4 mechanisms
       | and in two parts. The first part is composed of a normalization
       | block and an attention block. The normalization block scales the
       | data and ensures that the data is not too large. Then the
       | attention mechanism takes all the data handed to it and considers
       | how it is all related to one another. This is called "self-
       | attention" when we only consider one input and it is called
       | "cross-attention" when we have multiple inputs and compare. Both
       | of these create a relationship that are similar to creating a
       | lookup table. The second block is also composed of a
       | normalization block followed by a linear layer. The linear layer
       | reprocesses all the relationships it just learned and gives it
       | context. But we haven't stated the 4th mechanism! This is called
       | a residual layer or "skip" layer. This allows the data to pass
       | right on by each of the above parts without being processed and
       | this little side path is key to getting things to train
       | efficiently.
       | 
       | Now that doesn't really do the work justice or give a good
       | explanation of why or how things actually work. ELI5 isn't a good
       | way to understand things for usage, but it is an okay place to
       | start and learn abstract concepts. For the next level up I
       | suggest Training Compact Transformers[0]. It'll give some
       | illustrations and code to help you follow along. It is focused on
       | vision transformers, but it is all the same. The next level I
       | suggest Karpathy's video on GPT[1], where you will build
       | transformers and he goes in a bit more depth. Both these are good
       | for novices and people with little mathematical background. For
       | more lore and understanding why we got here and the confusion
       | over the definition of attention I suggest Lilian Wang's blog[2]
       | (everything she does is gold). For a lecture and more depth I
       | suggest Pascal Poupart's class. Lecture 19[3] is the one on
       | attention and transformers but you need to at minimum watch
       | Lecture 18 but if you actually have no ML experience or knowledge
       | then you should probably start from the beginning.
       | 
       | The truth is that not everything can be explained in simple
       | terms, at least not if one wants an adequate understanding. That
       | misquotation of Einstein (probably originating from Nelson) is
       | far from accurate and I wouldn't expect someone that introduced a
       | highly abstract concept with complex mathematics (to such a
       | degree that physicists argued he was a mathematician) would say
       | something so silly. There is a lot lost when distilling a concept
       | and neither the listener nor speaker should fool themselves into
       | believing this makes them knowledgeable (armchair expertise is a
       | frustrating point on the internet and has gotten our society in a
       | lot of trouble).
       | 
       | [0] https://medium.com/pytorch/training-compact-transformers-
       | fro...
       | 
       | [1] https://www.youtube.com/watch?v=kCc8FmEb1nY
       | 
       | [2] https://lilianweng.github.io/posts/2018-06-24-attention/
       | 
       | [3] https://www.youtube.com/watch?v=OyFJWRnt_AY
        
       | Buttons840 wrote:
       | I've been working through [0]. Like a lot of math, the notation
       | is daunting, but once you become familiar with it, it really is a
       | nice tool for thought.
       | 
       | [0]: https://arxiv.org/abs/2207.09238
        
       | PeterisP wrote:
       | ELI5 is tricky as details have to be sacrificed, but I'll try.
       | 
       | An attention mechanism is when you want a neural network to learn
       | the function of how much attention to allocate to each item in a
       | sequence, to learn which items should be looked at.
       | 
       | Transformers is a self-attention mechanism, where you ask the
       | neural network to 'transform' each element by looking at its
       | potential combination with every other element and using this
       | (learnable, trainable) attention function to decide which
       | combination(s) to apply.
       | 
       | And it turns out that this very general mechanism, although
       | compute-intensive (it considers everything linking with
       | everything, so complexity quadratic to sequence length) and data-
       | intensive (it has lots and lots of parameters, so needs huge
       | amounts of data to be useful) can actually represent many of
       | things we care about in a manner which can be trained with the
       | deep learning algorithms we already had.
       | 
       | And, really, that's the two big things ML needs, a model
       | structure where there exists some configuration of parameters
       | which can actually represent the thing you want to calculate, and
       | that this configuration can actually be determined from training
       | data reasonably.
        
       | boppo1 wrote:
       | Who is Yegge and why should I care? All I found on google is that
       | he wrote code for google and blogs.
        
         | simonw wrote:
         | He's been a popular tech writer for a couple of decades. He's
         | written some very influential pieces in the past - here's a
         | copy of one of his most famous (I don't think the original is
         | available any more): https://gist.github.com/chitchcock/1281611
        
           | sundarurfriend wrote:
           | > (I don't think the original is available any more)
           | 
           | That's because this one was not a blog post, it was a Google+
           | post that he intended to share internally with Google only.
           | He had other great rants on his actual blog though, about
           | (vague memory here) Emacs, game development, language design,
           | etc.
        
         | wsgeorge wrote:
         | He's an engineers who's been involved (or close to) some of the
         | most influential technologies/products in use in the last 10
         | years, and his blog posts are quite celebrated and respected.
         | 
         | Think of him as an essayist. Like Paul Graham (pg on this site)
        
           | sundarurfriend wrote:
           | 1 part pg and 3 parts _why maybe.
        
         | Sai_ wrote:
         | I first heard about him after an internal post/rant of his
         | about Google+ got leaked and went viral.
         | 
         | My impression is that he is a Kardashian - famous for being
         | famous - in the software world.
        
       | zh3 wrote:
       | [flagged]
        
         | blackbear_ wrote:
         | Absolute BS. Transformers were literally introduced by that
         | paper.
        
         | Sai_ wrote:
         | Maybe ELI5 was the wrong level for my ask. I still want to
         | understand it from a technical perspective.
         | 
         | Maybe ELI am a software engineer but not in AI/ML?
        
           | redox99 wrote:
           | *Prompt:* I'm a software engineer who has never done anything
           | with AI/ML. Explain to me transformers and the "Attention is
           | all you need" paper.
           | 
           | *GPT4:* "Attention is All You Need" is a seminal paper
           | published in 2017 by Vaswani et al. that introduced the
           | Transformer architecture, a breakthrough in the field of AI
           | and natural language processing (NLP). Transformers have
           | since become the foundation for many state-of-the-art models,
           | including GPT (Generative Pre-trained Transformer) series,
           | BERT (Bidirectional Encoder Representations from
           | Transformers), and more.
           | 
           | At a high level, the Transformer architecture is designed to
           | handle sequential data, like text, in parallel rather than
           | sequentially, as was common with earlier models like
           | Recurrent Neural Networks (RNNs) and Long Short-Term Memory
           | (LSTM) networks. This parallelization allows for faster
           | training and better scalability. The key innovation in
           | Transformers is the attention mechanism, which is used to
           | weigh the importance of different input tokens when
           | generating an output token.
           | 
           | Here's a simplified explanation of the main components of a
           | Transformer:                   Input Embeddings: Transformers
           | take input data in the form of text tokens. These tokens are
           | converted into continuous vectors through an embedding layer.
           | Positional Encoding: Unlike RNNs and LSTMs, Transformers
           | don't have an inherent sense of sequence order. To address
           | this, positional encoding is added to the input embeddings to
           | provide information about the position of each token in the
           | sequence.              Encoder and Decoder: The Transformer
           | architecture consists of an encoder and a decoder. The
           | encoder is a stack of identical layers, each containing two
           | primary components: a multi-head self-attention mechanism and
           | a position-wise feed-forward network. The decoder is also
           | composed of a stack of identical layers, but it has an
           | additional multi-head attention mechanism that attends to the
           | encoder's output.              Multi-Head Attention: The
           | attention mechanism is the core innovation in the Transformer
           | architecture. It computes a weighted average of input tokens
           | to generate context-aware representations. Multi-head
           | attention allows the model to focus on different aspects of
           | the input by employing multiple attention heads, each with
           | its own set of learnable parameters.              Feed-
           | Forward Networks: These are simple fully connected layers
           | that follow the attention mechanism in each layer of the
           | encoder and decoder.              Output: Finally, the
           | decoder generates the output sequence token by token. This is
           | done by applying a linear layer followed by a softmax
           | activation function to get the probabilities of each possible
           | output token.
           | 
           | In summary, the Transformer architecture introduced in the
           | "Attention is All You Need" paper revolutionized NLP by
           | enabling more efficient and powerful models. The key
           | innovation is the attention mechanism, which helps the model
           | learn complex relationships between input tokens and produce
           | context-aware representations. This has led to a wide array
           | of applications, from machine translation and text
           | summarization to sentiment analysis and language modeling.
           | 
           | *Me:* It's okayish but you're probably interested in a more
           | detailed low level explanation. Also transformers are widely
           | used for computer vision nowadays (a bit forgivable because
           | of ChatGPT 2021 cutoff, but they had already taken off by
           | then)
        
       | adamnemecek wrote:
       | I argue that the attention mechanism is just Hopf algebra
       | convolution. The QK mechanism selects possible candidates and the
       | OV mechanism picks one of these possible candidates.
       | 
       | I have recently written a paper on this
       | https://arxiv.org/abs/2302.01834
       | 
       | I also have a discord channel https://discord.cofunctional.ai.
       | 
       | Ironically, it's the same mechanism as what renormalization in
       | QFT does. I'm getting increasingly convinced that it's also how
       | the brain works.
        
         | detrites wrote:
         | Could you ELI5 (or maybe 15, with high-school maths) this
         | concept? Particularly how it may relate to how our brains might
         | work? Most interested.
        
       | raydiatian wrote:
       | It's not really something you need to understand unless you're an
       | ML researcher.
       | 
       | I guess the ELI5 (with a BUNCH of details left out) is
       | "Transformers: what if you didn't have to process sentences as a
       | sequence of words, but rather as a picture of words."
        
         | edgyquant wrote:
         | >t's not really something you need to understand unless you're
         | an ML researcher
         | 
         | What kind of an answer is this? No it's something you should
         | understand if you are at all interested.
        
           | raydiatian wrote:
           | It's like saying "I have no basis in differential equations
           | or complex numbers or Newtonian mechanics or experimental
           | findings on the wavelike nature of light but could somebody
           | explain the Schrodinger equation derivation like I'm five?"
           | 
           | Like, there's a reason we don't explain things to five year
           | olds.
           | 
           | Also I have a good ELI5 answer, so I don't know what else you
           | want dude lol so petty
        
       | rafi25 wrote:
       | ELI5: In neural networks we could take data numbers, multiply by
       | a constant, add a constant, add them to each other, and then do
       | the same to the results again and again. And you could learn the
       | right constants using labeled data. But we couldn't multiply data
       | numbers against each other. So in transformers you can. But you
       | must multiply all numbers against all numbers in a specific
       | sample, so it's heavy.
        
       | tworats wrote:
       | The Illustrated Transfomer (
       | https://jalammar.github.io/illustrated-transformer/ ) and
       | Visualizing attention (
       | https://towardsdatascience.com/deconstructing-bert-part-2-vi...
       | ), are both really good resources. For a more ELI5 approach this
       | non-technical explainer ( https://www.parand.com/a-non-technical-
       | explanation-of-chatgp... ) covers it at a high level.
        
       | kartayyar wrote:
       | - You can develop a very deep understanding of a sequence by
       | observing how each element interacts with each other over many
       | sequences.
       | 
       | - This understanding can be encapsulated in "compressed" low
       | dimensional vector representation of a sequences.
       | 
       | - You can use this understanding for many different downstream
       | tasks, especially predicting the next item in a sequence.
       | 
       | - This approach scales really well with lots of GPUs and data and
       | is super applicable to generating text.
        
       | cratermoon wrote:
       | One or both of these might help
       | 
       | Transformers, Explained: Understand the Model Behind GPT-3, BERT,
       | and T5: https://daleonai.com/transformers-explained
       | 
       | Transformers from Scratch: https://e2eml.school/transformers.html
       | 
       | The first link covers Attention well.
        
       | factorymoo wrote:
       | "Transformers" and "Attention is All You Need" refer to an
       | important development in machine learning and artificial
       | intelligence, particularly in the field of natural language
       | processing (NLP). I'll try to explain them in a simple way.
       | 
       | Think of a conversation you had with a friend. While they were
       | talking, you were probably not just listening to the words they
       | were saying right now, but also remembering what they said a few
       | minutes ago. Your brain was connecting the dots between different
       | parts of the conversation to understand the full meaning. Now,
       | imagine if you could only understand each word in isolation and
       | couldn't remember anything from a few seconds ago. Conversations
       | would be pretty hard to understand, right?
       | 
       | In early NLP models, this was a big problem. They couldn't easily
       | look at the "context" of a conversation or a sentence. They could
       | only look at a few words at a time, so they were a bit like our
       | forgetful person. They were good at understanding the meaning of
       | individual words, but not so good at understanding how those
       | words fit together to create meaning.
        
         | Sai_ wrote:
         | Enlightening example of having a conversation. Makes thing
         | clearer.
        
         | tikkun wrote:
         | Did you use GPT to write this? (Not a bad thing! It's a decent
         | answer)
        
           | mcbuilder wrote:
           | I copy pasted the attention is all you need paper into
           | ChatGPT4 and gave it the prompt "Explain like I'm 5 years
           | old".
           | 
           |  _The Transformer is a new type of computer program that
           | helps translate languages and understand sentences. It works
           | by paying attention to different parts of a sentence at the
           | same time, instead of looking at one word after another like
           | older programs. This makes it faster and better at
           | understanding complicated sentences. It has been tested on
           | translating English to German and English to French and did a
           | really good job._
           | 
           | (Edit)
        
           | jasonjmcghee wrote:
           | Immediately thought this was gpt as well.
           | 
           | Assuming prompt was "Explain Transformers and 'Attention is
           | all you need' in a simple way"
        
         | drexlspivey wrote:
         | AFAIK Transformers and context size are orthogonal concepts.
         | You could have large token contexts before. The transformer
         | directs the "attention" to a specific word/token inside the
         | context.
        
       | notfried wrote:
       | Prompt: "ELI5 the Attention is All We Need paper."
       | 
       | Imagine you have a box of toys. Some toys are more interesting to
       | you than others, so you pay more attention to those toys and play
       | with them more. The same thing happens in the "Attention is All
       | You Need" paper, but instead of toys, we have words in a
       | sentence.
       | 
       | Before this paper, when computers tried to translate one language
       | to another, they would look at one word, translate it, then move
       | to the next word. This works okay, but it's not great because
       | sometimes the meaning of a word can depend on other words in the
       | sentence.
       | 
       | The clever thing about the "Attention is All You Need" paper is
       | that it taught the computer to pay attention to all the words in
       | a sentence at the same time, but give more importance
       | ("attention") to the words that matter most for understanding and
       | translation.
       | 
       | This new way of translating languages using "attention" made
       | computers much better at understanding and translating languages.
       | It's like if you could become better at playing by understanding
       | all your toys at once! This paper was a big deal in the field of
       | machine learning and artificial intelligence because it improved
       | how machines understand languages.
        
       | fkarg wrote:
       | Uhm nope but here's a ~70min talk I held about it:
       | https://github.com/fkarg/things-to-talk-about/blob/master/tr...
       | 
       | Not sure how well the slides can be understood by themselves,
       | though I tried to be accommodating for that
        
       ___________________________________________________________________
       (page generated 2023-05-17 23:00 UTC)