[HN Gopher] The Illustrated Transformer
___________________________________________________________________
The Illustrated Transformer
Author : auraham
Score : 185 points
Date : 2025-12-22 19:15 UTC (3 hours ago)
(HTM) web link (jalammar.github.io)
(TXT) w3m dump (jalammar.github.io)
| profsummergig wrote:
| Haven't watched it yet...
|
| ...but, if you have favorite resources on understanding Q & K,
| please drop them in comments below...
|
| (I've watched the Grant Sanderson/3blue1brown videos [including
| his excellent talk at TNG Big Tech Day '24], but Q & K still
| escape me).
|
| Thank you in advance.
| red2awn wrote:
| Implement transformers yourself (ie in Numpy). You'll never
| truly understand it by just watching videos.
| D-Machine wrote:
| Seconding this, the terms "Query" and "Value" are largely
| arbitrary and meaningless in practice, look at how to
| implement this in PyTorch and you'll see these are just
| weight matrices that implement a projection of sorts, and
| self-attention is always just self_attention(x, x, x) or
| self_attention(x, x, y) in some cases, where x and y are are
| outputs from previous layers.
|
| Plus with different forms of attention, e.g. merged
| attention, and the research into why / how attention
| mechanisms might actually be working, the whole "they are
| motivated by key-value stores" thing starts to look really
| bogus. Really it is that the attention layer allows for
| modeling correlations and/or multiplicative interactions
| among a dimension-reduced representation.
| profsummergig wrote:
| Do you think the dimension reduction is necessary? Or is it
| just practical (due to current hardware scarcity)?
| krat0sprakhar wrote:
| Do you have a tutorial that I can follow?
| roadside_picnic wrote:
| The most valuable tutorial will be translating from the
| paper itself. The more hand holding you have in the
| process, the less you'll be learning conceptually. The pure
| manipulation of matrices is rather boring and uninformative
| without some context.
|
| I also think the implementation is more helpful for
| understanding the engineering work to run these models that
| getting a deeper mathematical understanding of what the
| model is doing.
| jwitthuhn wrote:
| If you have 20 hours to spare I highly recommend this
| youtube playlist from Andrej Karpathy https://www.youtube.c
| om/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...
|
| It starts with the fundamentals of how backpropagation
| works then advances to building a few simple models and
| ends with building a GPT-2 clone. It won't taech you
| everything about AI models but it gives you a solid
| foundation for branching out.
| roadside_picnic wrote:
| I personally don't think implementation is as enlightening as
| far as really understanding _what_ the model is doing as this
| statement implies. I had done that many times, but it wasn 't
| until reading about the relationship to kernel methods that
| it really _clicked_ for me what is really happening under the
| hood.
|
| Don't get me wrong, implementing attention is still great
| (and necessary), but even with something as simple as linear
| regression, implementing it doesn't really give you the
| entire conceptual model. I do think implementation helps to
| understand the _engineering_ of these models, but it still
| requires reflection and study to start to understand
| conceptually why they are working and what they 're really
| doing (I would, of course, argue I'm _still_ learning about
| linear models in that regard!)
| leopd wrote:
| I think this video does a pretty good job explaining it,
| starting about 10:30 minutes in:
| https://www.youtube.com/watch?v=S27pHKBEp30
| oofbey wrote:
| As the first comment says "This aged like fine wine". Six
| years old, but the fundamentals haven't changed.
| andoando wrote:
| This wasn't any better than other explanation I've seen.
| throw310822 wrote:
| Have you tried asking e.g. Claude to explain it to you? None of
| the usual resources worked for me, until I had a discussion
| with Claude where I could ask questions about everything that I
| didn't get.
| bobbyschmidd wrote:
| tldr: recursively _aggregating packing /unpacking_ 'if else if
| (functions)/statements' as keyword arguments that (call)/take
| them themselves as arguments, with their own position shifting
| according to the number "(weights)" of else if
| (functions)/statements needed to get all the other arguments
| into (one of) THE adequate orders. the order changes based on
| the language, input prompt and context.
|
| if I understand it all correctly.
|
| implemented it in html a while ago and might do it in htmx
| sometime soon.
|
| transformers are just slutty dictionaries that Papa Roach and
| kage bunshin no jutsu right away again and again, spawning
| clones and variations based on requirements, which is why they
| tend to repeat themselves rather quickly and often. it's got
| almost nothing to do with languages themselves and requirements
| and weights amount to playbooks and DEFCON levels
| roadside_picnic wrote:
| It's just a re-invention of kernel smoothing. Cosma Shalizi has
| an _excellent_ write up on this [0].
|
| Once you recognize this it's a wonderful re-framing of what a
| transformer is doing under the hood: you're effectively
| learning a bunch of sophisticated kernels (though the FF part)
| and then applying kernel smoothing in different ways through
| the attention layers. It makes you realize that Transformers
| are philosophically much closer to things like Gaussian
| Processes (which are also just a bunch of kernel manipulation).
|
| 0. http://bactra.org/notebooks/nn-attention-and-
| transformers.ht...
| machinationu wrote:
| Q, K and V are a way of filtering the relevant aspects for the
| task at hand from the token embeddings.
|
| "he was red" - maybe color, maybe angry, the "red" token
| embedding carries both, but only one aspect is relevant for
| some particular prompt.
|
| https://ngrok.com/blog/prompt-caching/
| laser9 wrote:
| Here's the comment from the author himself (jayalammar) talking
| about other good resources on learning Transformers:
|
| https://news.ycombinator.com/item?id=35990118
| boltzmann_ wrote:
| Kudos also to Transformer Explainer team for putting some amazing
| visualizations https://poloclub.github.io/transformer-explainer/
| It really clicked to me after reading this two and watching
| 3blue1brown videos
| gzer0 wrote:
| This is hands down one of the best visualizations I have ever
| come across.
| gustavoaca1997 wrote:
| I have this book. Really a life savior to help me catching up a
| few months ago when my team decided to use LLMs in our systems.
| qoez wrote:
| Don't really see why you'd need to understand how the
| transformer works to do LLMs at work. LLMs is just a synthetic
| human performing reasoning with some failure modes that in-
| depth knowledge of the transformer interals won't help you
| predict what they are (just have to use experience with the
| output to get a sense, or other peoples experiments).
| roadside_picnic wrote:
| In my experience this is a substantial difference in the
| ability to really get performance in LLM related engineering
| work from people who really understand how LLMs work vs
| people who think it's a magic box.
|
| If your mental model of an LLM is:
|
| > a synthetic human performing reasoning
|
| You are _severely_ overestimating the capabilities of these
| models and not realizing potential areas of failure (even if
| your prompt works for now in the happy case). Understanding
| how transformers work absolutely can help debug problems (or
| avoid them in the first place). People without a deep
| understanding of LLMs also tend to get fooled by them more
| frequently. When you have internalized the fact that LLMs are
| literally _optimistized_ to trick you, you tend to be much
| more skeptical of the initial results (which results in
| better eval suites etc).
|
| Then there's people who _actually do AI engineering_. If you
| 're working with local/open weights models or on the
| inference end of things you can't just play around with an
| API, you have _a lot_ more control and observability into the
| model and should be making use of it.
|
| I still hold that the best test of an AI Engineer, at any
| level of the "AI" stack, is how well they understand
| speculative decoding. It involves understanding quite a bit
| about how LLMs work and can still be implemented on a cheap
| laptop.
| amelius wrote:
| But that AI engineer who is implementing speculative
| decoding is still just doing basic plumbing that has little
| to do with the actual reasoning. Yes, he/she might make the
| process faster, but they will know just as little about
| why/how the reasoning works as when they implemented a
| naive, slow version of the inference.
| roadside_picnic wrote:
| What "actual reasoning" are you referring to? I believe
| you're making my point for me.
|
| Speculative decoding requires the implementer to
| understand:
|
| - How the initial prompt is processed by the LLM
|
| - How to retrieve all the probabilities of previously
| observed tokens in the prompt (this also help people
| understand things like the probability of the entire
| prompt itself, the entropy of the prompt etc).
|
| - Details of how the logits generate the distribution of
| next tokens
|
| - Precise details of the sampling process + the rejection
| sampling logic for comparing the two models
|
| - How each step of the LLM is run under-the-hood as the
| response is processed.
|
| Hardly just plumbing, especially since, to my knowledge,
| there are not a lot of hand-holding tutorials on this
| topic. You need to really internalize what's going on and
| how this is going to lead to a 2-5x speed up in
| inference.
|
| Building all of this yourself gives you a lot of
| visibility into how the model behaves and how "reasoning"
| emerges from the sampling process.
|
| edit: Anyone who can perform speculative decoding work
| _also_ has the ability to inspect the reasoning steps of
| an LLM and do experiments such as _rewinding_ the thought
| process of the LLM and substituting a reasoning step to
| see how it impacts the results. If you 're just prompt
| hacking you're not going to be able to perform these
| types of experiments to understand _exactly_ how the
| model is reasoning and what 's important to it.
| amelius wrote:
| But I can make a similar argument about a simple
| multiplication:
|
| - You have to know how the inputs are processed.
|
| - You have to left-shift one of the operands by 0, 1, ...
| N-1 times.
|
| - Add those together, depending on the bits in the other
| operand.
|
| - Use an addition tree to make the whole process faster.
|
| Does not mean that knowing the above process gives you a
| good insight in the concept of A*B and all the related
| math and certainly will not make you better at calculus.
| machinationu wrote:
| speculative decoding is 1+1
|
| transformer attention is integrals
| Koshkin wrote:
| > _is just a synthetic human performing reasoning_
|
| The future is now! (Not because of "a synthetic human" per se
| but because of people thinking of them as something
| unremarkable.)
| bonesss wrote:
| > LLMs is just a synthetic human
|
| 1) 'human' encompasses behaviours that include revenge
| cannibalism and recurrent sexual violence --- wish carefully.
|
| 2) not even a little bit, and if you want to pretend then
| pretend they're a deranged delusional psych patient who will
| look you in the eye and say genuinely " _oops, I guess I was
| lying, it won't ever happen again_ " and then lie to you
| again, while making sure happens again.
|
| 3) don't anthropomorphize LLMs, they don't like it.
| Koshkin wrote:
| (Going on a tangent.) The number of transformer
| explanations/tutorials is becoming overwhelming. Reminds me of
| monads (or maybe calculus). Someone feels a spark of
| enlightenment at some point (while, often, in fact, remaining
| deeply confused), and an urge to share their newly acquired
| (mis)understanding with a wide audience.
| nospice wrote:
| So?
|
| There's no rule that the internet is limited to a single
| explanation. Find the one that clicks for you, ignore the rest.
| Whenever I'm trying to learn about concepts in mathematics,
| computer science, physics, or electronics, I often find that
| the first or the "canonical" explanation is hard for me to
| parse. I'm thankful for having options 2 through 10.
| kadushka wrote:
| Maybe so, but this particular blog post was the first and is
| still the best explanation of how transformers work.
| ActorNightly wrote:
| People need to get away from this idea of Key/Query/Value as
| being special.
|
| Whereas a standard deep layer in a network is matrix * input,
| where each row of the matrix is the weights of the particular
| neuron in the next layer, a transformer is basically input*
| MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a
| matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just
| simply more dimensions in a layer.
|
| And consequently, you can represent the entire transformer
| architecture with a set of deep layers as you unroll the
| matricies, with a lot of zeros for the multiplication pieces that
| are not needed.
|
| This is a fairly complex blog but it shows that its just all
| matrix multiplication all the way down.
| https://pytorch.org/blog/inside-the-matrix/.
| throw310822 wrote:
| I might be completely off road, but I can't help thinking of
| convolutions as my mental model for the K Q V mechanism.
| Attention has the same property of a convolution kernel of
| being trained independently of position; it learns how to
| translate a large, rolling portion of an input to a new
| "digested" value; and you can train multiple ones in parallel
| so that they learn to focus on different aspects of the input
| ("kernels" in the case of convolution, "heads" in the case of
| attention).
| zkmon wrote:
| I think the internal of transformers would become less relevant
| like internal of compilers, as programmers would only care about
| how to "use" them instead of how to develop them.
| rvz wrote:
| Their internals are just as relevant (now even more relevant)
| as any other technology as they always need to be improved to
| the SOTA (state of the art) meaning that someone _has_ to
| understand their internals.
|
| It also means more jobs for the people who understand them at a
| deeper level to advance the SOTA of specific widely used
| technologies such as operating systems, compilers, neural
| network architectures and hardware such as GPUs or TPU chips.
|
| Someone has to maintain and improve them.
| esafak wrote:
| Practitioners already do not need to know about it. I bet most
| don't know the fundamentals of machine learning. Hands up if
| you know bias from variance...
| libraryofbabel wrote:
| I read this article back when I was learning the basics of
| transformers; the visualizations were really helpful. Although in
| retrospect knowing how a transformer works wasn't very useful at
| all in my day job _applying_ LLMs, except as a sort of deep
| background for reassurance that I had some idea of how the big
| black box producing the tokens was put together, and to give me
| the mathematical basis for things like context size limitations
| etc.
|
| I would strongly caution anyone who thinks that they will be able
| to understand or explain LLM _behavior_ better by studying the
| architecture closely. That is a trap. Big SotA models these days
| exhibit so much nontrivial emergent phenomena (in part due to the
| massive application of reinforcement learning techniques) that
| give them capabilities very few people expected to _ever_ see
| when this architecture first arrived. Most of us confidently
| claimed even back in 2023 that, based on LLM architecture and
| training algorithms, LLMs would never be able to perform well on
| novel coding or mathematics tasks. We were wrong. That points
| towards some caution and humility about using network
| architecture alone to reason about how LLMs work and what they
| can do. You 'd really need to be able to poke at the weights
| inside a big SotA model to even begin to answer those kinds of
| questions, but unfortunately that's only really possible if
| you're a "mechanistic interpretability" researcher at one of the
| major labs.
|
| Regardless, this is a nice article, and this stuff is worth
| learning because it's interesting for its own sake! Right now I'm
| actually spending some vacation time implementing a transformer
| in PyTorch just to refresh my memory of it all. It's a lot of
| fun! If anyone else wants to get started with that I would highly
| recommend Sebastian Raschka's book and youtube videos as way into
| the subject: https://github.com/rasbt/LLMs-from-scratch .
|
| Has anyone read TFA author Jay Alammar's book (published Oct
| 2024) and would they recommend it for a more up-to-date picture?
| nrhrjrjrjtntbt wrote:
| It is almost like understanding wood at a molecular level and
| being a carpenter. It also may help the carpentery, but you cam
| be a great one without it. And a bad one with the knowledge.
| ozgung wrote:
| I think the biggest problem is that most tutorials use words to
| illustrate how the attention mechanism works. In reality, there
| are no word-associated tokens inside a Transformer. Tokens !=
| word parts. An LLM does not perform language processing inside
| the Transformer blocks, and a Vision Transformer does not
| perform image processing. Words and pixels are only relevant at
| the input. I think this misunderstanding was a root cause of
| underestimating their capabilities.
| miki123211 wrote:
| > Most of us confidently claimed even back in 2023 that, based
| on LLM architecture and training algorithms, LLMs would never
| be able to perform well on novel coding or mathematics tasks.
|
| I feel like there are three groups of people:
|
| 1. Those who think that LLMs are stupid slop-generating
| machines which couldn't ever possibly be of any use to anybody,
| because there's some problem that is simple for humans but hard
| for LLMs, which makes them unintelligent by definition.
|
| 2. Those who think we have already achieved AGI and don't need
| human programmers any more.
|
| 3. Those who believe LLMs will destroy the world in the next 5
| years.
|
| I feel like the composition of these three groups is pretty
| much constant since the release of Chat GPT, and like with most
| political fights, evidence doesn't convince people either way.
| libraryofbabel wrote:
| Those three positions are all extreme viewpoints. There are
| certainly people who hold them, and they tend to be loud and
| confident and have an outsize presence in HN and other places
| online.
|
| But a lot of us have a more nuanced take! It's perfectly
| possible to believe simultaneously that 1) LLMs are more than
| stochastic parrots 2) LLMs are useful for software
| development 3) LLMs have all sorts of limitations and risks
| (you _can_ produce slop with them, and many people will,
| there are security issues, I can go on and on...) 4) We 're
| not getting AGI or world-destroying super-intelligence
| anytime soon, if ever 4) We're in a bubble and it's going to
| pop and cause a big mess 5) This tech is still going to be
| transformative long term, on a similar level to the web and
| smartphones.
|
| Don't let the noise from the extreme people who formed their
| opinions back when ChatGPT came out drown out serious
| discussion! A lot of us try and walk a middle course with
| this and have been and still are open to changing our minds.
___________________________________________________________________
(page generated 2025-12-22 23:00 UTC)