[HN Gopher] Writing an LLM from scratch, part 13 - attention hea...
___________________________________________________________________
Writing an LLM from scratch, part 13 - attention heads are dumb
Author : gpjt
Score : 175 points
Date : 2025-05-08 21:06 UTC (3 days ago)
(HTM) web link (www.gilesthomas.com)
(TXT) w3m dump (www.gilesthomas.com)
| theyinwhy wrote:
| There are multiple books about this topic now. What are your
| takes on the alternatives? Why did you choose this one?
| Appreciate your thoughts!
| andrehacker wrote:
| It is regarded to be "the best" book on the topic by many. I
| found just like what Giles Thomas wrote that the book focuses
| on the details and how to write the lower level code without
| providing the big picture.
|
| I am personally not very interested in that as these details
| are likely to change rather quickly while the principles of
| LLMs and transformers will probably remain relevant for many
| years.
|
| I have been looking for, but failed, to find a good resource
| that approaches it the way 3blue1brown [1] explains it but then
| go deeper from there.
|
| The blog series from Giles seem to take the book and add the
| background to the details.
|
| [1] https://m.youtube.com/watch?v=wjZofJX0v4M
| andrehacker wrote:
| This looks very interesting. The easiest way to navigate to the
| start of this series of articles seems to be
| https://www.gilesthomas.com/til-deep-dives/page/2
|
| Now if I only could find some time...
| sitkack wrote:
| https://news.ycombinator.com/from?site=gilesthomas.com
| Tokumei-no-hito wrote:
| maybe it renders differently on mobile but this was the first
| entry for me. you can use the nav at the end to continue to the
| next part
|
| https://www.gilesthomas.com/2024/12/llm-from-scratch-1
| badsectoracula wrote:
| Too bad the book seems to be using Python _and_ some external
| library like tiktokens just from chapter 2, meaning that it 'll
| basically stop working next week or so, like everything Python,
| making the whole thing much harder to follow in the future.
|
| Meanwhile i learned the basics of machine learning and (mainly)
| neural networks from a book written in 1997[0] - which i read
| last year[1]. It barely had any code and that code was written
| in C, meaning it'd still more or less work (though i didn't had
| to try it since the book descriptions were fine on their own).
|
| Now, Python was supposedly designed to look kinda like
| pseudocode, so using it for a book could be fine, but at least
| it should avoid relying on external libraries that do not come
| with the language itself - and preferably stick to stuff that
| have equivalent to other languages too.
|
| [0] https://www.cs.cmu.edu/~tom/mlbook.html
|
| [1] which is why i make this comment (and to address the
| apparent downvotes): even if i get the book now i might end up
| reading it in 3-4 years. Stuff not working will be a major
| obstacle. If the book is good, it might end up been recommended
| by people 2-3 years from now and some people may end up getting
| it and/or reading it even later in time. So it is important for
| the book to be self-contained, at least when it comes to books
| that try to teach the theory/ideas behind things.
| andrehacker wrote:
| Myeah, C and C++ have the advantage that the compilers
| support compile for old versions of the language. The
| languages are in much flux partly because of security
| problems, partly because features are added from other
| languages. That means that linking to external libraries
| using the older language version will fail unless you keep
| the old version around simply because the maintainer of the
| external library DID upgrade.
|
| Python is not popular in ML because it is a great language
| but because of the ecosystem: numpy, pandas, pytorch and
| everything built on those allows you to do the higher level
| ML coding without having to reinvent efficient matrix
| operations for a given hardware infrastructure.
| badsectoracula wrote:
| (i assume with "The languages are in much flux" you meant
| python and not c/c++ because these aren't in flux)
|
| Yeah i get why Python is currently used[0] and for a
| theory-focused book Python would still work to outline the
| algorithms - worst case you boot up an old version of
| Python in Docker or a VM, but it'd still require using only
| what is available out of the box in Python. And depending
| on how the book is written, it may not even be necessary.
|
| That said there are other alternatives nowadays and when
| trying to learn the theory you may not need to use the most
| efficient stuff. Using C, C++, Go, Java, C# or whatever
| other language with a decent backwards compatibility track
| record (so that it can work in 5-10 years) should be
| possible and all of these should have some small (if not
| necessarily uberefficient) library for the calculations you
| may want to do that you can distribute alongside the book
| for those who want to try the code out.
|
| [0] even if i wish people would stick on using it only for
| the testing/experimentation phase and move to something
| more robust and future proof for stuff meant to be used by
| others
| andrehacker wrote:
| "The languages are in much flux" you meant python and not
| c/c++ because these aren't in flux
|
| No I meant C++.
|
| 2011 14882:2011[44] C++11
|
| 2014 14882:2014[45] C++14
|
| 2017 14882:2017[46] C++17
|
| 2020 14882:2020[47] C++20
|
| 2024 14882:2024[17] C++23
|
| That is 4 major language changes in 10 years.
|
| As a S/W manager in an enterprise context having to
| coordinate upgrades of multi-million LOC codebases for
| mandated security compliance, C++ is not the silver
| bullet in handling the version problem that exists in
| every eco system.
|
| As said, the compilers/linkers allow you to run in
| compatibility mode so as long as you don't care about the
| new features (and the company you work for doesn't) then,
| yes, C/C++ is easier for managing legacy code.
| YZF wrote:
| These are new features. Many of them are part of the
| library not the language. Generally speaking what you do
| is enable the new features in your compiler, you don't
| need to disable that to compile old code. It's not a
| problem to work on legacy code and use new features for
| new code either.
| vlovich123 wrote:
| > That means that linking to external libraries using the
| older language version will fail unless you keep the old
| version around simply because the maintainer of the
| external library DID upgrade.
|
| This just isn't true. C ABIs has not seen any change with
| the updated standards and while C++ doesn't have a stable
| ABI boundary you shouldn't have any problem calling older
| binary interfaces from new code (or new binary interfaces
| from old code provided you're not using some new types that
| just aren't available). That's because the standard library
| authors themselves do strive to guarantee ABI comparability
| (or at least libc++ and stdlibc++ - I'm not as confident
| about MSVC but I have to believe this is generally true
| there too). Indeed the last ABI breakage in c++ was on
| Linux in C++11 15 years ago because of changes to
| std::string.
| og_kalu wrote:
| >Python is not popular in ML because it is a great language
| but because of the ecosystem: numpy, pandas, pytorch and
| everything built on those allows you to do the higher level
| ML coding without having to reinvent efficient matrix
| operations for a given hardware infrastructure.
|
| Ecosystems don't poof into existence. There are reasons
| people chose to write those libraries, sometimes partly or
| wholly in other languages for python in the first place.
|
| It's not like python was older than or a more prominent
| language than say C when those libraries began.
| y42 wrote:
| not sure if rage bait or serious, but: have you ever heard of
| conda or virtual environment?
| tonyarkles wrote:
| Those are decent options but you can still run into really
| ugly issues if you try to go back too far in time. An
| example I ran into in the last year or two was a Python
| library that linked against the system OpenSSL. A chain of
| dependencies ultimately required a super old version of
| this library and it failed to compile against the current
| system OpenSSL. Had to use virtualenv inside a Docker
| container that was based on either Ubuntu 18.04 or 20.04 to
| get it all to work.
| johnmaguire wrote:
| Wouldn't this be an issue with C too? Or anything that
| links against an external library?
| logicallee wrote:
| If you are interested in this sort of thing, you might want to
| take a look at a very simple neural network with two attention
| heads that runs right in the browser in pure Javascript, you can
| view source on this implementation:
|
| https://taonexus.com/mini-transformer-in-js.html
|
| Even after training for a hundred epochs it really doesn't work
| very well (you can test it in the Inference tab after training
| it), but it doesn't use any libraries, so you can see the math
| itself in action in the source code.
| quantadev wrote:
| Regarding this statement about semantic space:
|
| > so long as vectors are roughly the same length, the dot product
| is an indication of how similar they are.
|
| This potential length difference is the reason "Cosine
| Similarity" is used instead of dot products for concept
| comparisons. Cosine similarity is like a 'scale independent dot
| product', which represents a concept of similarity, independent
| of "signal strength".
|
| However, if two vectors point in the same direction, but one is
| 'longer' (higher magnitude) than the other, then what that
| indicates "semantically" is that the longer vector is indeed a
| "stronger signal" of the same concept. So if "happy" has a vector
| direction then "very happy" should be longer vector but in the
| same direction.
|
| Makes me wonder if there's a way to impose a "corrective" force
| upon model weights evolution during training so that words like
| "more" prefixed in front of a string can be guaranteed to encode
| as a vector multiple of said string? Not sure how that would work
| with back-propagation, but applying certain common sense
| knowledge about how the semantic space structures "must be"
| shaped could potentially be the next frontier of LLM development
| beyond transformers (and by transformers I really mean the
| attention heads specialization)
| bornfreddy wrote:
| Off topic rant: I hate blog posts which quote the author's
| earlier posts. They should just reiterate if it is important or
| use a link if not. Otherwise it feels like they want to fill some
| space without any extra work. The old posts are not that
| groundbreaking, I assure you. /rant
| crystal_revenge wrote:
| The most clarifying post I've read on attention is from Cosma
| Shalizi[0] who points out that "Attention" is quite literally
| just a re-discovery/re-invention of Kernel smoothing. Probably
| less helpful if you don't come from a quantitative background,
| but if you do it makes it shockingly clarifying.
|
| Once you realize this "Multi-headed Attention" is just kernel
| smoothing with more kernels and doing some linear transformation
| on the results of these (in practice: average or add)!
|
| 0. http://bactra.org/notebooks/nn-attention-and-
| transformers.ht...
| FreakLegion wrote:
| It's a useful realization, too, since ways of approximating
| kernel functions are already well-studied. Google themselves
| have been publishing in this area for years, e.g.
| https://research.google/blog/rethinking-attention-with-perfo...
|
| _> To resolve these issues, we introduce the Performer, a
| Transformer architecture with attention mechanisms that scale
| linearly, thus enabling faster training while allowing the
| model to process longer lengths, as required for certain image
| datasets such as ImageNet64 and text datasets such as PG-19.
| The Performer uses an efficient (linear) generalized attention
| framework, which allows a broad class of attention mechanisms
| based on different similarity measures (kernels). The framework
| is implemented by our novel Fast Attention Via Positive
| Orthogonal Random Features (FAVOR+) algorithm, which provides
| scalable low-variance and unbiased estimation of attention
| mechanisms that can be expressed by random feature map
| decompositions (in particular, regular softmax-attention). We
| obtain strong accuracy guarantees for this method while
| preserving linear space and time complexity, which can also be
| applied to standalone softmax operations._
| thomasahle wrote:
| For those who don't know the term "kernel smoothing", it just
| means [?] y * K(x, x[?]) / ([?] K(x, x[?]))
|
| In regular attention, we let K(x, x[?]) = exp(<x, x[?]>).
|
| Note that in Attention we use K(q, k[?]) where the q (query)
| and k (key) vectors are not the same.
|
| Unless you define K(x, x[?]) = exp(<W_q x, W_k x[?]>) as you do
| in self-attention.
|
| There are also some attention mechanisms that don't use the
| normalization term, ([?] K(x, x[?])), but most do.
| esafak wrote:
| In kernel methods the kernel is typically given, and things
| like positional embeddings, layer normalization, causal
| masking, and so on are missing. Kernel methods did not take off
| partly due to their computational complexity (quadratic in
| sample size), and transforms did precisely because they were
| parallelizable, and thus computationally efficient, compared
| with the RNNs and LSTMs that came before them.
|
| Reductions of one architecture to another are usually more
| enlightening from a theoretical perspective than a practical
| one.
| 3abiton wrote:
| Wow, thanks for referencing that. What a very detailed and long
| read!
| westoque wrote:
| Must be my ignorance but everytime I see explainers for LLMs
| similar to the post, it's hard to believe that AGI is upon us. It
| just doesn't feel that "intelligent" but again might just be my
| ignorance.
| jlawson wrote:
| Neurons are pretty simple too.
|
| Any arbitrarily complex system must be made of simpler
| components, recursively down to arbitrary levels of simplicity.
| If you zoom in enough everything is dumb.
| voidspark wrote:
| Neurons are surprisingly not simple. Vastly more complex than
| the ultra simplified model in artificial neural networks.
| throwawaymaths wrote:
| eh, transformers are universal differentiable layered hash
| tables. that's incredibly powerful. most logic is just pulling
| symbols and matching structures with "hash"es.
|
| if intelligence is just reasonable manipulation of logic it's
| unsurprising that an LLM could be intelligent, what _maybe_ is
| surprising is that we have ~intelligence without going up a few
| more orders of magnitude in size, what 's possibly more
| surprising is that training it on the internet got it doing the
| things it's doing
| Lerc wrote:
| I think there are two layers of the 'why' in machine learning.
|
| When you look at a model architecture it is described as a series
| of operations that produces the result.
|
| There is a lower level why, which, while being far from easy to
| show, describes why it is that these algorithms produce the
| required result. You can show why it's a good idea to use cosine
| similarity, why cross entropy was chosen to express the
| measurement. In Transformers you can show that the the Q and K
| matrices transform the embeddings into spaces that allows
| different things to be closer, and using that control over the
| proportion of closeness allows you to make distinctions. This
| form of why is the explanation usually given in papers. It is
| possible to methodically show you will get the benefits described
| from techniques proposed.
|
| The greater Why is much much harder, Harder to identify and
| harder to prove. the First why can tell you that something works,
| but it can't really tell you why it works in a way that can
| inform other techniques.
|
| In the Transformer, the intuition is that the 'Why' is something
| along the lines of The Q transforms embeddings into an encoding
| of what information is needed in the embedding to resolve
| confusion, and that the K transforms embeddings into information
| to impart. When there's a match between 'What I want to know
| about' and 'what I know about' the V can be used as 'the things I
| know' to accumulate the information where it needs to be.
|
| It's easy to see why this is the hard form, Once you get into the
| higher semantic descriptions of what is happening, it is much
| harder to prove that this is actually what is happening, or that
| it gives the benefits you think it might. Maybe Transformers
| don't work like that. Sometimes semantic relationships appear to
| be in processes when there is an unobserved quirk of the
| mathematics that makes the result coincidentally the same.
|
| In a way I think of the maths of it as picking up a many
| dimentional object in each hand and magically rotating and
| (linearly) squishing them differently until they look aligned
| enough to see the relationship I'm looking at and poking those
| bits towards each other. I can't really think about that and the
| semantic "what things want to know about" at the same time, even
| though they are conceptualisations of the same operation.
|
| The advantage of the lower why is that you can show that it
| works. The advantage of the upper why is that it can enable you
| to consider other mechanisms that might do the same function.
| They may be mathematically different but achieve the goal.
|
| To take a much simpler example in computer graphics. There are
| many ways to draw a circle with simple loops processing
| mathematically provable descriptions of a circle. The Bressenham
| Circle drawing algorithm does so with a why that shows why it
| makes a circle but the "Why do it that way" was informed by a
| greater understanding of what the task being performed was.
___________________________________________________________________
(page generated 2025-05-11 23:00 UTC)