[HN Gopher] The Illustrated Transformer
       ___________________________________________________________________
        
       The Illustrated Transformer
        
       Author : auraham
       Score  : 185 points
       Date   : 2025-12-22 19:15 UTC (3 hours ago)
        
 (HTM) web link (jalammar.github.io)
 (TXT) w3m dump (jalammar.github.io)
        
       | profsummergig wrote:
       | Haven't watched it yet...
       | 
       | ...but, if you have favorite resources on understanding Q & K,
       | please drop them in comments below...
       | 
       | (I've watched the Grant Sanderson/3blue1brown videos [including
       | his excellent talk at TNG Big Tech Day '24], but Q & K still
       | escape me).
       | 
       | Thank you in advance.
        
         | red2awn wrote:
         | Implement transformers yourself (ie in Numpy). You'll never
         | truly understand it by just watching videos.
        
           | D-Machine wrote:
           | Seconding this, the terms "Query" and "Value" are largely
           | arbitrary and meaningless in practice, look at how to
           | implement this in PyTorch and you'll see these are just
           | weight matrices that implement a projection of sorts, and
           | self-attention is always just self_attention(x, x, x) or
           | self_attention(x, x, y) in some cases, where x and y are are
           | outputs from previous layers.
           | 
           | Plus with different forms of attention, e.g. merged
           | attention, and the research into why / how attention
           | mechanisms might actually be working, the whole "they are
           | motivated by key-value stores" thing starts to look really
           | bogus. Really it is that the attention layer allows for
           | modeling correlations and/or multiplicative interactions
           | among a dimension-reduced representation.
        
             | profsummergig wrote:
             | Do you think the dimension reduction is necessary? Or is it
             | just practical (due to current hardware scarcity)?
        
           | krat0sprakhar wrote:
           | Do you have a tutorial that I can follow?
        
             | roadside_picnic wrote:
             | The most valuable tutorial will be translating from the
             | paper itself. The more hand holding you have in the
             | process, the less you'll be learning conceptually. The pure
             | manipulation of matrices is rather boring and uninformative
             | without some context.
             | 
             | I also think the implementation is more helpful for
             | understanding the engineering work to run these models that
             | getting a deeper mathematical understanding of what the
             | model is doing.
        
             | jwitthuhn wrote:
             | If you have 20 hours to spare I highly recommend this
             | youtube playlist from Andrej Karpathy https://www.youtube.c
             | om/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...
             | 
             | It starts with the fundamentals of how backpropagation
             | works then advances to building a few simple models and
             | ends with building a GPT-2 clone. It won't taech you
             | everything about AI models but it gives you a solid
             | foundation for branching out.
        
           | roadside_picnic wrote:
           | I personally don't think implementation is as enlightening as
           | far as really understanding _what_ the model is doing as this
           | statement implies. I had done that many times, but it wasn 't
           | until reading about the relationship to kernel methods that
           | it really _clicked_ for me what is really happening under the
           | hood.
           | 
           | Don't get me wrong, implementing attention is still great
           | (and necessary), but even with something as simple as linear
           | regression, implementing it doesn't really give you the
           | entire conceptual model. I do think implementation helps to
           | understand the _engineering_ of these models, but it still
           | requires reflection and study to start to understand
           | conceptually why they are working and what they 're really
           | doing (I would, of course, argue I'm _still_ learning about
           | linear models in that regard!)
        
         | leopd wrote:
         | I think this video does a pretty good job explaining it,
         | starting about 10:30 minutes in:
         | https://www.youtube.com/watch?v=S27pHKBEp30
        
           | oofbey wrote:
           | As the first comment says "This aged like fine wine". Six
           | years old, but the fundamentals haven't changed.
        
           | andoando wrote:
           | This wasn't any better than other explanation I've seen.
        
         | throw310822 wrote:
         | Have you tried asking e.g. Claude to explain it to you? None of
         | the usual resources worked for me, until I had a discussion
         | with Claude where I could ask questions about everything that I
         | didn't get.
        
         | bobbyschmidd wrote:
         | tldr: recursively _aggregating packing /unpacking_ 'if else if
         | (functions)/statements' as keyword arguments that (call)/take
         | them themselves as arguments, with their own position shifting
         | according to the number "(weights)" of else if
         | (functions)/statements needed to get all the other arguments
         | into (one of) THE adequate orders. the order changes based on
         | the language, input prompt and context.
         | 
         | if I understand it all correctly.
         | 
         | implemented it in html a while ago and might do it in htmx
         | sometime soon.
         | 
         | transformers are just slutty dictionaries that Papa Roach and
         | kage bunshin no jutsu right away again and again, spawning
         | clones and variations based on requirements, which is why they
         | tend to repeat themselves rather quickly and often. it's got
         | almost nothing to do with languages themselves and requirements
         | and weights amount to playbooks and DEFCON levels
        
         | roadside_picnic wrote:
         | It's just a re-invention of kernel smoothing. Cosma Shalizi has
         | an _excellent_ write up on this [0].
         | 
         | Once you recognize this it's a wonderful re-framing of what a
         | transformer is doing under the hood: you're effectively
         | learning a bunch of sophisticated kernels (though the FF part)
         | and then applying kernel smoothing in different ways through
         | the attention layers. It makes you realize that Transformers
         | are philosophically much closer to things like Gaussian
         | Processes (which are also just a bunch of kernel manipulation).
         | 
         | 0. http://bactra.org/notebooks/nn-attention-and-
         | transformers.ht...
        
         | machinationu wrote:
         | Q, K and V are a way of filtering the relevant aspects for the
         | task at hand from the token embeddings.
         | 
         | "he was red" - maybe color, maybe angry, the "red" token
         | embedding carries both, but only one aspect is relevant for
         | some particular prompt.
         | 
         | https://ngrok.com/blog/prompt-caching/
        
       | laser9 wrote:
       | Here's the comment from the author himself (jayalammar) talking
       | about other good resources on learning Transformers:
       | 
       | https://news.ycombinator.com/item?id=35990118
        
       | boltzmann_ wrote:
       | Kudos also to Transformer Explainer team for putting some amazing
       | visualizations https://poloclub.github.io/transformer-explainer/
       | It really clicked to me after reading this two and watching
       | 3blue1brown videos
        
         | gzer0 wrote:
         | This is hands down one of the best visualizations I have ever
         | come across.
        
       | gustavoaca1997 wrote:
       | I have this book. Really a life savior to help me catching up a
       | few months ago when my team decided to use LLMs in our systems.
        
         | qoez wrote:
         | Don't really see why you'd need to understand how the
         | transformer works to do LLMs at work. LLMs is just a synthetic
         | human performing reasoning with some failure modes that in-
         | depth knowledge of the transformer interals won't help you
         | predict what they are (just have to use experience with the
         | output to get a sense, or other peoples experiments).
        
           | roadside_picnic wrote:
           | In my experience this is a substantial difference in the
           | ability to really get performance in LLM related engineering
           | work from people who really understand how LLMs work vs
           | people who think it's a magic box.
           | 
           | If your mental model of an LLM is:
           | 
           | > a synthetic human performing reasoning
           | 
           | You are _severely_ overestimating the capabilities of these
           | models and not realizing potential areas of failure (even if
           | your prompt works for now in the happy case). Understanding
           | how transformers work absolutely can help debug problems (or
           | avoid them in the first place). People without a deep
           | understanding of LLMs also tend to get fooled by them more
           | frequently. When you have internalized the fact that LLMs are
           | literally _optimistized_ to trick you, you tend to be much
           | more skeptical of the initial results (which results in
           | better eval suites etc).
           | 
           | Then there's people who _actually do AI engineering_. If you
           | 're working with local/open weights models or on the
           | inference end of things you can't just play around with an
           | API, you have _a lot_ more control and observability into the
           | model and should be making use of it.
           | 
           | I still hold that the best test of an AI Engineer, at any
           | level of the "AI" stack, is how well they understand
           | speculative decoding. It involves understanding quite a bit
           | about how LLMs work and can still be implemented on a cheap
           | laptop.
        
             | amelius wrote:
             | But that AI engineer who is implementing speculative
             | decoding is still just doing basic plumbing that has little
             | to do with the actual reasoning. Yes, he/she might make the
             | process faster, but they will know just as little about
             | why/how the reasoning works as when they implemented a
             | naive, slow version of the inference.
        
               | roadside_picnic wrote:
               | What "actual reasoning" are you referring to? I believe
               | you're making my point for me.
               | 
               | Speculative decoding requires the implementer to
               | understand:
               | 
               | - How the initial prompt is processed by the LLM
               | 
               | - How to retrieve all the probabilities of previously
               | observed tokens in the prompt (this also help people
               | understand things like the probability of the entire
               | prompt itself, the entropy of the prompt etc).
               | 
               | - Details of how the logits generate the distribution of
               | next tokens
               | 
               | - Precise details of the sampling process + the rejection
               | sampling logic for comparing the two models
               | 
               | - How each step of the LLM is run under-the-hood as the
               | response is processed.
               | 
               | Hardly just plumbing, especially since, to my knowledge,
               | there are not a lot of hand-holding tutorials on this
               | topic. You need to really internalize what's going on and
               | how this is going to lead to a 2-5x speed up in
               | inference.
               | 
               | Building all of this yourself gives you a lot of
               | visibility into how the model behaves and how "reasoning"
               | emerges from the sampling process.
               | 
               | edit: Anyone who can perform speculative decoding work
               | _also_ has the ability to inspect the reasoning steps of
               | an LLM and do experiments such as _rewinding_ the thought
               | process of the LLM and substituting a reasoning step to
               | see how it impacts the results. If you 're just prompt
               | hacking you're not going to be able to perform these
               | types of experiments to understand _exactly_ how the
               | model is reasoning and what 's important to it.
        
               | amelius wrote:
               | But I can make a similar argument about a simple
               | multiplication:
               | 
               | - You have to know how the inputs are processed.
               | 
               | - You have to left-shift one of the operands by 0, 1, ...
               | N-1 times.
               | 
               | - Add those together, depending on the bits in the other
               | operand.
               | 
               | - Use an addition tree to make the whole process faster.
               | 
               | Does not mean that knowing the above process gives you a
               | good insight in the concept of A*B and all the related
               | math and certainly will not make you better at calculus.
        
             | machinationu wrote:
             | speculative decoding is 1+1
             | 
             | transformer attention is integrals
        
           | Koshkin wrote:
           | > _is just a synthetic human performing reasoning_
           | 
           | The future is now! (Not because of "a synthetic human" per se
           | but because of people thinking of them as something
           | unremarkable.)
        
           | bonesss wrote:
           | > LLMs is just a synthetic human
           | 
           | 1) 'human' encompasses behaviours that include revenge
           | cannibalism and recurrent sexual violence --- wish carefully.
           | 
           | 2) not even a little bit, and if you want to pretend then
           | pretend they're a deranged delusional psych patient who will
           | look you in the eye and say genuinely " _oops, I guess I was
           | lying, it won't ever happen again_ " and then lie to you
           | again, while making sure happens again.
           | 
           | 3) don't anthropomorphize LLMs, they don't like it.
        
       | Koshkin wrote:
       | (Going on a tangent.) The number of transformer
       | explanations/tutorials is becoming overwhelming. Reminds me of
       | monads (or maybe calculus). Someone feels a spark of
       | enlightenment at some point (while, often, in fact, remaining
       | deeply confused), and an urge to share their newly acquired
       | (mis)understanding with a wide audience.
        
         | nospice wrote:
         | So?
         | 
         | There's no rule that the internet is limited to a single
         | explanation. Find the one that clicks for you, ignore the rest.
         | Whenever I'm trying to learn about concepts in mathematics,
         | computer science, physics, or electronics, I often find that
         | the first or the "canonical" explanation is hard for me to
         | parse. I'm thankful for having options 2 through 10.
        
         | kadushka wrote:
         | Maybe so, but this particular blog post was the first and is
         | still the best explanation of how transformers work.
        
       | ActorNightly wrote:
       | People need to get away from this idea of Key/Query/Value as
       | being special.
       | 
       | Whereas a standard deep layer in a network is matrix * input,
       | where each row of the matrix is the weights of the particular
       | neuron in the next layer, a transformer is basically input*
       | MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a
       | matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just
       | simply more dimensions in a layer.
       | 
       | And consequently, you can represent the entire transformer
       | architecture with a set of deep layers as you unroll the
       | matricies, with a lot of zeros for the multiplication pieces that
       | are not needed.
       | 
       | This is a fairly complex blog but it shows that its just all
       | matrix multiplication all the way down.
       | https://pytorch.org/blog/inside-the-matrix/.
        
         | throw310822 wrote:
         | I might be completely off road, but I can't help thinking of
         | convolutions as my mental model for the K Q V mechanism.
         | Attention has the same property of a convolution kernel of
         | being trained independently of position; it learns how to
         | translate a large, rolling portion of an input to a new
         | "digested" value; and you can train multiple ones in parallel
         | so that they learn to focus on different aspects of the input
         | ("kernels" in the case of convolution, "heads" in the case of
         | attention).
        
       | zkmon wrote:
       | I think the internal of transformers would become less relevant
       | like internal of compilers, as programmers would only care about
       | how to "use" them instead of how to develop them.
        
         | rvz wrote:
         | Their internals are just as relevant (now even more relevant)
         | as any other technology as they always need to be improved to
         | the SOTA (state of the art) meaning that someone _has_ to
         | understand their internals.
         | 
         | It also means more jobs for the people who understand them at a
         | deeper level to advance the SOTA of specific widely used
         | technologies such as operating systems, compilers, neural
         | network architectures and hardware such as GPUs or TPU chips.
         | 
         | Someone has to maintain and improve them.
        
         | esafak wrote:
         | Practitioners already do not need to know about it. I bet most
         | don't know the fundamentals of machine learning. Hands up if
         | you know bias from variance...
        
       | libraryofbabel wrote:
       | I read this article back when I was learning the basics of
       | transformers; the visualizations were really helpful. Although in
       | retrospect knowing how a transformer works wasn't very useful at
       | all in my day job _applying_ LLMs, except as a sort of deep
       | background for reassurance that I had some idea of how the big
       | black box producing the tokens was put together, and to give me
       | the mathematical basis for things like context size limitations
       | etc.
       | 
       | I would strongly caution anyone who thinks that they will be able
       | to understand or explain LLM _behavior_ better by studying the
       | architecture closely. That is a trap. Big SotA models these days
       | exhibit so much nontrivial emergent phenomena (in part due to the
       | massive application of reinforcement learning techniques) that
       | give them capabilities very few people expected to _ever_ see
       | when this architecture first arrived. Most of us confidently
       | claimed even back in 2023 that, based on LLM architecture and
       | training algorithms, LLMs would never be able to perform well on
       | novel coding or mathematics tasks. We were wrong. That points
       | towards some caution and humility about using network
       | architecture alone to reason about how LLMs work and what they
       | can do. You 'd really need to be able to poke at the weights
       | inside a big SotA model to even begin to answer those kinds of
       | questions, but unfortunately that's only really possible if
       | you're a "mechanistic interpretability" researcher at one of the
       | major labs.
       | 
       | Regardless, this is a nice article, and this stuff is worth
       | learning because it's interesting for its own sake! Right now I'm
       | actually spending some vacation time implementing a transformer
       | in PyTorch just to refresh my memory of it all. It's a lot of
       | fun! If anyone else wants to get started with that I would highly
       | recommend Sebastian Raschka's book and youtube videos as way into
       | the subject: https://github.com/rasbt/LLMs-from-scratch .
       | 
       | Has anyone read TFA author Jay Alammar's book (published Oct
       | 2024) and would they recommend it for a more up-to-date picture?
        
         | nrhrjrjrjtntbt wrote:
         | It is almost like understanding wood at a molecular level and
         | being a carpenter. It also may help the carpentery, but you cam
         | be a great one without it. And a bad one with the knowledge.
        
         | ozgung wrote:
         | I think the biggest problem is that most tutorials use words to
         | illustrate how the attention mechanism works. In reality, there
         | are no word-associated tokens inside a Transformer. Tokens !=
         | word parts. An LLM does not perform language processing inside
         | the Transformer blocks, and a Vision Transformer does not
         | perform image processing. Words and pixels are only relevant at
         | the input. I think this misunderstanding was a root cause of
         | underestimating their capabilities.
        
         | miki123211 wrote:
         | > Most of us confidently claimed even back in 2023 that, based
         | on LLM architecture and training algorithms, LLMs would never
         | be able to perform well on novel coding or mathematics tasks.
         | 
         | I feel like there are three groups of people:
         | 
         | 1. Those who think that LLMs are stupid slop-generating
         | machines which couldn't ever possibly be of any use to anybody,
         | because there's some problem that is simple for humans but hard
         | for LLMs, which makes them unintelligent by definition.
         | 
         | 2. Those who think we have already achieved AGI and don't need
         | human programmers any more.
         | 
         | 3. Those who believe LLMs will destroy the world in the next 5
         | years.
         | 
         | I feel like the composition of these three groups is pretty
         | much constant since the release of Chat GPT, and like with most
         | political fights, evidence doesn't convince people either way.
        
           | libraryofbabel wrote:
           | Those three positions are all extreme viewpoints. There are
           | certainly people who hold them, and they tend to be loud and
           | confident and have an outsize presence in HN and other places
           | online.
           | 
           | But a lot of us have a more nuanced take! It's perfectly
           | possible to believe simultaneously that 1) LLMs are more than
           | stochastic parrots 2) LLMs are useful for software
           | development 3) LLMs have all sorts of limitations and risks
           | (you _can_ produce slop with them, and many people will,
           | there are security issues, I can go on and on...) 4) We 're
           | not getting AGI or world-destroying super-intelligence
           | anytime soon, if ever 4) We're in a bubble and it's going to
           | pop and cause a big mess 5) This tech is still going to be
           | transformative long term, on a similar level to the web and
           | smartphones.
           | 
           | Don't let the noise from the extreme people who formed their
           | opinions back when ChatGPT came out drown out serious
           | discussion! A lot of us try and walk a middle course with
           | this and have been and still are open to changing our minds.
        
       ___________________________________________________________________
       (page generated 2025-12-22 23:00 UTC)