[HN Gopher] Attention Wasn't All We Needed
       ___________________________________________________________________
        
       Attention Wasn't All We Needed
        
       Author : mooreds
       Score  : 87 points
       Date   : 2025-05-23 18:14 UTC (4 hours ago)
        
 (HTM) web link (www.stephendiehl.com)
 (TXT) w3m dump (www.stephendiehl.com)
        
       | andrewmcwatters wrote:
       | I know this probably seems like such a small detail to a lot of
       | people, but I really love that the author adds comments.
       | 
       | I can't stand reading PyTorch or other neural network code and
       | asking myself, "What architecture am I looking at here?" or "What
       | the hell are these operations for?"
       | 
       | It's always like an mash up of reading some published paper code
       | with deep effort behind it along with all the worst programming
       | practices of complete unreadability.
        
         | imranq wrote:
         | Could you pop your code into an LLM and ask it to write
         | comments for you? I'm not sure how accurate it is though
        
           | andrewmcwatters wrote:
           | I've noticed leading models fail to understand what's
           | happening in undocumented neural network code as well, so not
           | yet it seems.
        
             | CamperBob2 wrote:
             | It may be a reasonable approach if you give the model a lot
             | of clues to start with. Basically tell it everything you do
             | know about the code.
             | 
             | I wouldn't expect miracles from just uploading a big .py
             | file and asking it to add comments.
        
       | flebron wrote:
       | This is an excellent summary of these techniques :) I like that
       | every single one comes with an example implementation, with shape
       | comments on the tensors. Thanks Stephen!
        
       | kouteiheika wrote:
       | > Let's look at some of the most important ones that have been
       | developed over the years and try to implement the basic ideas as
       | succinctly as possible.
       | 
       | One big architectural tweak that comes to mind and isn't in the
       | article is QK norm: https://arxiv.org/pdf/2010.04245
       | 
       | > Cosine Schedule
       | 
       | A lot (most?) of new training runs actually don't use cosine
       | schedule anymore; instead they keep the learning rate constant
       | and only decay it at the very end, which gives equivalent or
       | better results. See:
       | 
       | https://arxiv.org/pdf/2405.18392 https://arxiv.org/pdf/2404.06395
       | 
       | > There is a highly optimized implementation of AdamW in PyTorch.
       | 
       | A fun tidbit - it's actually not highly optimized from my
       | experience. Imagine my surprise when I reimplemented it in Triton
       | (because I needed to tweak a few things) and I got better
       | performance than the built-in PyTorch implementation.
        
         | Scene_Cast2 wrote:
         | RE: optimizer performance - any thoughts on heavyball?
        
           | kouteiheika wrote:
           | ...oh, I didn't know about this library, thanks!
           | 
           | I still probably wouldn't be able to use it because I need a
           | bunch of custom functionality for my optimizers (like for
           | example custom quantization support and incremental gradient
           | accumulation directly in optimizers' state), but I might
           | borrow some of their techniques if they make things even
           | faster.
        
       | yorwba wrote:
       | The explanation for Multi-head Latent Attention
       | https://www.stephendiehl.com/posts/post_transformers/#multi-...
       | does _not_ match the definition in the DeepSeek-V2 paper
       | https://arxiv.org/pdf/2405.04434#subsection.2.1
       | 
       | MLA as developed by DeepSeek is a technique to reduce the memory
       | footprint of the KV cache by storing only two vectors of size
       | _latent_dim_ and _rope_dim_ per token and layer, instead of 2 *
       | _num_heads_ vectors of size _head_dim_. (DeepSeek-V3 has
       | _num_heads_ = 128 and _head_dim_ = 128 vs _latent_dim_ = 512 and
       | _rope_dim_ = 64, so a significant reduction
       | https://arxiv.org/pdf/2412.19437#subsection.4.2 )
       | 
       | What this article describes instead is some kind of two-step
       | attention scheme I haven't seen before and that I think wouldn't
       | work with causal masking (despite _mask_ appearing in the example
       | code) because either you allow an earlier token to attend to a
       | latent that attended to a later token (creating backwards
       | information flow) or the latents can only attend to a limited
       | prefix of the sequence, after which they 're frozen and useless.
       | I wonder whether the author dreamed it up himself or whether
       | someone else is actually using this somewhere.
        
       | jdeaton wrote:
       | First four things on the list are attention
        
         | alanbernstein wrote:
         | The title is a cute shortening of "Attention Is All You Need
         | wasn't all we needed"
        
       | empiko wrote:
       | Nice writeup, but regarding title -- I find it fascinating how
       | powerful attention really is. There were some tweaks developedz
       | sure, but if I open Llama 4 code on HugginFace, it is more or
       | less the same code that I've seen there 5 years ago. Despite all
       | the AI hype, we are still just exploiting tech developed in
       | 2015-2020. And despite NeurIPS brandishing 25k papers this year,
       | the innovation rate in deep learning seems to stagnate
        
         | kjkjadksj wrote:
         | Too many horseriders, not enough horse breeders.
        
         | kouteiheika wrote:
         | > There were some tweaks developedz sure, but if I open Llama 4
         | code on HugginFace, it is more or less the same code that I've
         | seen there 5 years ago.
         | 
         | This is very much true. It's essentially the very same
         | architecture, just tweaked slightly.
         | 
         | I can take the code I've written which implements the original
         | GPT-2, tweak it very minimally (I don't know, maybe 30~40 lines
         | of code changed?) and get Qwen3 which is a state-of-art model
         | released ~3 weeks ago.
         | 
         | Contrary to what you might see when looking at e.g. HuggingFace
         | code where every new architecture needs a new multi-thousand
         | line of code file - that's just a result of an insane amount of
         | copy-pasting and technical debt (although they started to clean
         | it up a little bit lately). I have my own custom implementation
         | which can load weights for ~19 different architectures straight
         | off HuggingFace in like ~2k lines of code. They aren't really
         | all that different.
        
       ___________________________________________________________________
       (page generated 2025-05-23 23:01 UTC)