[HN Gopher] Writing an LLM from scratch, part 13 - attention hea...
       ___________________________________________________________________
        
       Writing an LLM from scratch, part 13 - attention heads are dumb
        
       Author : gpjt
       Score  : 175 points
       Date   : 2025-05-08 21:06 UTC (3 days ago)
        
 (HTM) web link (www.gilesthomas.com)
 (TXT) w3m dump (www.gilesthomas.com)
        
       | theyinwhy wrote:
       | There are multiple books about this topic now. What are your
       | takes on the alternatives? Why did you choose this one?
       | Appreciate your thoughts!
        
         | andrehacker wrote:
         | It is regarded to be "the best" book on the topic by many. I
         | found just like what Giles Thomas wrote that the book focuses
         | on the details and how to write the lower level code without
         | providing the big picture.
         | 
         | I am personally not very interested in that as these details
         | are likely to change rather quickly while the principles of
         | LLMs and transformers will probably remain relevant for many
         | years.
         | 
         | I have been looking for, but failed, to find a good resource
         | that approaches it the way 3blue1brown [1] explains it but then
         | go deeper from there.
         | 
         | The blog series from Giles seem to take the book and add the
         | background to the details.
         | 
         | [1] https://m.youtube.com/watch?v=wjZofJX0v4M
        
       | andrehacker wrote:
       | This looks very interesting. The easiest way to navigate to the
       | start of this series of articles seems to be
       | https://www.gilesthomas.com/til-deep-dives/page/2
       | 
       | Now if I only could find some time...
        
         | sitkack wrote:
         | https://news.ycombinator.com/from?site=gilesthomas.com
        
         | Tokumei-no-hito wrote:
         | maybe it renders differently on mobile but this was the first
         | entry for me. you can use the nav at the end to continue to the
         | next part
         | 
         | https://www.gilesthomas.com/2024/12/llm-from-scratch-1
        
         | badsectoracula wrote:
         | Too bad the book seems to be using Python _and_ some external
         | library like tiktokens just from chapter 2, meaning that it 'll
         | basically stop working next week or so, like everything Python,
         | making the whole thing much harder to follow in the future.
         | 
         | Meanwhile i learned the basics of machine learning and (mainly)
         | neural networks from a book written in 1997[0] - which i read
         | last year[1]. It barely had any code and that code was written
         | in C, meaning it'd still more or less work (though i didn't had
         | to try it since the book descriptions were fine on their own).
         | 
         | Now, Python was supposedly designed to look kinda like
         | pseudocode, so using it for a book could be fine, but at least
         | it should avoid relying on external libraries that do not come
         | with the language itself - and preferably stick to stuff that
         | have equivalent to other languages too.
         | 
         | [0] https://www.cs.cmu.edu/~tom/mlbook.html
         | 
         | [1] which is why i make this comment (and to address the
         | apparent downvotes): even if i get the book now i might end up
         | reading it in 3-4 years. Stuff not working will be a major
         | obstacle. If the book is good, it might end up been recommended
         | by people 2-3 years from now and some people may end up getting
         | it and/or reading it even later in time. So it is important for
         | the book to be self-contained, at least when it comes to books
         | that try to teach the theory/ideas behind things.
        
           | andrehacker wrote:
           | Myeah, C and C++ have the advantage that the compilers
           | support compile for old versions of the language. The
           | languages are in much flux partly because of security
           | problems, partly because features are added from other
           | languages. That means that linking to external libraries
           | using the older language version will fail unless you keep
           | the old version around simply because the maintainer of the
           | external library DID upgrade.
           | 
           | Python is not popular in ML because it is a great language
           | but because of the ecosystem: numpy, pandas, pytorch and
           | everything built on those allows you to do the higher level
           | ML coding without having to reinvent efficient matrix
           | operations for a given hardware infrastructure.
        
             | badsectoracula wrote:
             | (i assume with "The languages are in much flux" you meant
             | python and not c/c++ because these aren't in flux)
             | 
             | Yeah i get why Python is currently used[0] and for a
             | theory-focused book Python would still work to outline the
             | algorithms - worst case you boot up an old version of
             | Python in Docker or a VM, but it'd still require using only
             | what is available out of the box in Python. And depending
             | on how the book is written, it may not even be necessary.
             | 
             | That said there are other alternatives nowadays and when
             | trying to learn the theory you may not need to use the most
             | efficient stuff. Using C, C++, Go, Java, C# or whatever
             | other language with a decent backwards compatibility track
             | record (so that it can work in 5-10 years) should be
             | possible and all of these should have some small (if not
             | necessarily uberefficient) library for the calculations you
             | may want to do that you can distribute alongside the book
             | for those who want to try the code out.
             | 
             | [0] even if i wish people would stick on using it only for
             | the testing/experimentation phase and move to something
             | more robust and future proof for stuff meant to be used by
             | others
        
               | andrehacker wrote:
               | "The languages are in much flux" you meant python and not
               | c/c++ because these aren't in flux
               | 
               | No I meant C++.
               | 
               | 2011 14882:2011[44] C++11
               | 
               | 2014 14882:2014[45] C++14
               | 
               | 2017 14882:2017[46] C++17
               | 
               | 2020 14882:2020[47] C++20
               | 
               | 2024 14882:2024[17] C++23
               | 
               | That is 4 major language changes in 10 years.
               | 
               | As a S/W manager in an enterprise context having to
               | coordinate upgrades of multi-million LOC codebases for
               | mandated security compliance, C++ is not the silver
               | bullet in handling the version problem that exists in
               | every eco system.
               | 
               | As said, the compilers/linkers allow you to run in
               | compatibility mode so as long as you don't care about the
               | new features (and the company you work for doesn't) then,
               | yes, C/C++ is easier for managing legacy code.
        
               | YZF wrote:
               | These are new features. Many of them are part of the
               | library not the language. Generally speaking what you do
               | is enable the new features in your compiler, you don't
               | need to disable that to compile old code. It's not a
               | problem to work on legacy code and use new features for
               | new code either.
        
             | vlovich123 wrote:
             | > That means that linking to external libraries using the
             | older language version will fail unless you keep the old
             | version around simply because the maintainer of the
             | external library DID upgrade.
             | 
             | This just isn't true. C ABIs has not seen any change with
             | the updated standards and while C++ doesn't have a stable
             | ABI boundary you shouldn't have any problem calling older
             | binary interfaces from new code (or new binary interfaces
             | from old code provided you're not using some new types that
             | just aren't available). That's because the standard library
             | authors themselves do strive to guarantee ABI comparability
             | (or at least libc++ and stdlibc++ - I'm not as confident
             | about MSVC but I have to believe this is generally true
             | there too). Indeed the last ABI breakage in c++ was on
             | Linux in C++11 15 years ago because of changes to
             | std::string.
        
             | og_kalu wrote:
             | >Python is not popular in ML because it is a great language
             | but because of the ecosystem: numpy, pandas, pytorch and
             | everything built on those allows you to do the higher level
             | ML coding without having to reinvent efficient matrix
             | operations for a given hardware infrastructure.
             | 
             | Ecosystems don't poof into existence. There are reasons
             | people chose to write those libraries, sometimes partly or
             | wholly in other languages for python in the first place.
             | 
             | It's not like python was older than or a more prominent
             | language than say C when those libraries began.
        
           | y42 wrote:
           | not sure if rage bait or serious, but: have you ever heard of
           | conda or virtual environment?
        
             | tonyarkles wrote:
             | Those are decent options but you can still run into really
             | ugly issues if you try to go back too far in time. An
             | example I ran into in the last year or two was a Python
             | library that linked against the system OpenSSL. A chain of
             | dependencies ultimately required a super old version of
             | this library and it failed to compile against the current
             | system OpenSSL. Had to use virtualenv inside a Docker
             | container that was based on either Ubuntu 18.04 or 20.04 to
             | get it all to work.
        
               | johnmaguire wrote:
               | Wouldn't this be an issue with C too? Or anything that
               | links against an external library?
        
       | logicallee wrote:
       | If you are interested in this sort of thing, you might want to
       | take a look at a very simple neural network with two attention
       | heads that runs right in the browser in pure Javascript, you can
       | view source on this implementation:
       | 
       | https://taonexus.com/mini-transformer-in-js.html
       | 
       | Even after training for a hundred epochs it really doesn't work
       | very well (you can test it in the Inference tab after training
       | it), but it doesn't use any libraries, so you can see the math
       | itself in action in the source code.
        
       | quantadev wrote:
       | Regarding this statement about semantic space:
       | 
       | > so long as vectors are roughly the same length, the dot product
       | is an indication of how similar they are.
       | 
       | This potential length difference is the reason "Cosine
       | Similarity" is used instead of dot products for concept
       | comparisons. Cosine similarity is like a 'scale independent dot
       | product', which represents a concept of similarity, independent
       | of "signal strength".
       | 
       | However, if two vectors point in the same direction, but one is
       | 'longer' (higher magnitude) than the other, then what that
       | indicates "semantically" is that the longer vector is indeed a
       | "stronger signal" of the same concept. So if "happy" has a vector
       | direction then "very happy" should be longer vector but in the
       | same direction.
       | 
       | Makes me wonder if there's a way to impose a "corrective" force
       | upon model weights evolution during training so that words like
       | "more" prefixed in front of a string can be guaranteed to encode
       | as a vector multiple of said string? Not sure how that would work
       | with back-propagation, but applying certain common sense
       | knowledge about how the semantic space structures "must be"
       | shaped could potentially be the next frontier of LLM development
       | beyond transformers (and by transformers I really mean the
       | attention heads specialization)
        
       | bornfreddy wrote:
       | Off topic rant: I hate blog posts which quote the author's
       | earlier posts. They should just reiterate if it is important or
       | use a link if not. Otherwise it feels like they want to fill some
       | space without any extra work. The old posts are not that
       | groundbreaking, I assure you. /rant
        
       | crystal_revenge wrote:
       | The most clarifying post I've read on attention is from Cosma
       | Shalizi[0] who points out that "Attention" is quite literally
       | just a re-discovery/re-invention of Kernel smoothing. Probably
       | less helpful if you don't come from a quantitative background,
       | but if you do it makes it shockingly clarifying.
       | 
       | Once you realize this "Multi-headed Attention" is just kernel
       | smoothing with more kernels and doing some linear transformation
       | on the results of these (in practice: average or add)!
       | 
       | 0. http://bactra.org/notebooks/nn-attention-and-
       | transformers.ht...
        
         | FreakLegion wrote:
         | It's a useful realization, too, since ways of approximating
         | kernel functions are already well-studied. Google themselves
         | have been publishing in this area for years, e.g.
         | https://research.google/blog/rethinking-attention-with-perfo...
         | 
         |  _> To resolve these issues, we introduce the Performer, a
         | Transformer architecture with attention mechanisms that scale
         | linearly, thus enabling faster training while allowing the
         | model to process longer lengths, as required for certain image
         | datasets such as ImageNet64 and text datasets such as PG-19.
         | The Performer uses an efficient (linear) generalized attention
         | framework, which allows a broad class of attention mechanisms
         | based on different similarity measures (kernels). The framework
         | is implemented by our novel Fast Attention Via Positive
         | Orthogonal Random Features (FAVOR+) algorithm, which provides
         | scalable low-variance and unbiased estimation of attention
         | mechanisms that can be expressed by random feature map
         | decompositions (in particular, regular softmax-attention). We
         | obtain strong accuracy guarantees for this method while
         | preserving linear space and time complexity, which can also be
         | applied to standalone softmax operations._
        
         | thomasahle wrote:
         | For those who don't know the term "kernel smoothing", it just
         | means                   [?] y * K(x, x[?]) / ([?] K(x, x[?]))
         | 
         | In regular attention, we let K(x, x[?]) = exp(<x, x[?]>).
         | 
         | Note that in Attention we use K(q, k[?]) where the q (query)
         | and k (key) vectors are not the same.
         | 
         | Unless you define K(x, x[?]) = exp(<W_q x, W_k x[?]>) as you do
         | in self-attention.
         | 
         | There are also some attention mechanisms that don't use the
         | normalization term, ([?] K(x, x[?])), but most do.
        
         | esafak wrote:
         | In kernel methods the kernel is typically given, and things
         | like positional embeddings, layer normalization, causal
         | masking, and so on are missing. Kernel methods did not take off
         | partly due to their computational complexity (quadratic in
         | sample size), and transforms did precisely because they were
         | parallelizable, and thus computationally efficient, compared
         | with the RNNs and LSTMs that came before them.
         | 
         | Reductions of one architecture to another are usually more
         | enlightening from a theoretical perspective than a practical
         | one.
        
         | 3abiton wrote:
         | Wow, thanks for referencing that. What a very detailed and long
         | read!
        
       | westoque wrote:
       | Must be my ignorance but everytime I see explainers for LLMs
       | similar to the post, it's hard to believe that AGI is upon us. It
       | just doesn't feel that "intelligent" but again might just be my
       | ignorance.
        
         | jlawson wrote:
         | Neurons are pretty simple too.
         | 
         | Any arbitrarily complex system must be made of simpler
         | components, recursively down to arbitrary levels of simplicity.
         | If you zoom in enough everything is dumb.
        
           | voidspark wrote:
           | Neurons are surprisingly not simple. Vastly more complex than
           | the ultra simplified model in artificial neural networks.
        
         | throwawaymaths wrote:
         | eh, transformers are universal differentiable layered hash
         | tables. that's incredibly powerful. most logic is just pulling
         | symbols and matching structures with "hash"es.
         | 
         | if intelligence is just reasonable manipulation of logic it's
         | unsurprising that an LLM could be intelligent, what _maybe_ is
         | surprising is that we have ~intelligence without going up a few
         | more orders of magnitude in size, what 's possibly more
         | surprising is that training it on the internet got it doing the
         | things it's doing
        
       | Lerc wrote:
       | I think there are two layers of the 'why' in machine learning.
       | 
       | When you look at a model architecture it is described as a series
       | of operations that produces the result.
       | 
       | There is a lower level why, which, while being far from easy to
       | show, describes why it is that these algorithms produce the
       | required result. You can show why it's a good idea to use cosine
       | similarity, why cross entropy was chosen to express the
       | measurement. In Transformers you can show that the the Q and K
       | matrices transform the embeddings into spaces that allows
       | different things to be closer, and using that control over the
       | proportion of closeness allows you to make distinctions. This
       | form of why is the explanation usually given in papers. It is
       | possible to methodically show you will get the benefits described
       | from techniques proposed.
       | 
       | The greater Why is much much harder, Harder to identify and
       | harder to prove. the First why can tell you that something works,
       | but it can't really tell you why it works in a way that can
       | inform other techniques.
       | 
       | In the Transformer, the intuition is that the 'Why' is something
       | along the lines of The Q transforms embeddings into an encoding
       | of what information is needed in the embedding to resolve
       | confusion, and that the K transforms embeddings into information
       | to impart. When there's a match between 'What I want to know
       | about' and 'what I know about' the V can be used as 'the things I
       | know' to accumulate the information where it needs to be.
       | 
       | It's easy to see why this is the hard form, Once you get into the
       | higher semantic descriptions of what is happening, it is much
       | harder to prove that this is actually what is happening, or that
       | it gives the benefits you think it might. Maybe Transformers
       | don't work like that. Sometimes semantic relationships appear to
       | be in processes when there is an unobserved quirk of the
       | mathematics that makes the result coincidentally the same.
       | 
       | In a way I think of the maths of it as picking up a many
       | dimentional object in each hand and magically rotating and
       | (linearly) squishing them differently until they look aligned
       | enough to see the relationship I'm looking at and poking those
       | bits towards each other. I can't really think about that and the
       | semantic "what things want to know about" at the same time, even
       | though they are conceptualisations of the same operation.
       | 
       | The advantage of the lower why is that you can show that it
       | works. The advantage of the upper why is that it can enable you
       | to consider other mechanisms that might do the same function.
       | They may be mathematically different but achieve the goal.
       | 
       | To take a much simpler example in computer graphics. There are
       | many ways to draw a circle with simple loops processing
       | mathematically provable descriptions of a circle. The Bressenham
       | Circle drawing algorithm does so with a why that shows why it
       | makes a circle but the "Why do it that way" was informed by a
       | greater understanding of what the task being performed was.
        
       ___________________________________________________________________
       (page generated 2025-05-11 23:00 UTC)