[HN Gopher] Attention Wasn't All We Needed
___________________________________________________________________
Attention Wasn't All We Needed
Author : mooreds
Score : 87 points
Date : 2025-05-23 18:14 UTC (4 hours ago)
(HTM) web link (www.stephendiehl.com)
(TXT) w3m dump (www.stephendiehl.com)
| andrewmcwatters wrote:
| I know this probably seems like such a small detail to a lot of
| people, but I really love that the author adds comments.
|
| I can't stand reading PyTorch or other neural network code and
| asking myself, "What architecture am I looking at here?" or "What
| the hell are these operations for?"
|
| It's always like an mash up of reading some published paper code
| with deep effort behind it along with all the worst programming
| practices of complete unreadability.
| imranq wrote:
| Could you pop your code into an LLM and ask it to write
| comments for you? I'm not sure how accurate it is though
| andrewmcwatters wrote:
| I've noticed leading models fail to understand what's
| happening in undocumented neural network code as well, so not
| yet it seems.
| CamperBob2 wrote:
| It may be a reasonable approach if you give the model a lot
| of clues to start with. Basically tell it everything you do
| know about the code.
|
| I wouldn't expect miracles from just uploading a big .py
| file and asking it to add comments.
| flebron wrote:
| This is an excellent summary of these techniques :) I like that
| every single one comes with an example implementation, with shape
| comments on the tensors. Thanks Stephen!
| kouteiheika wrote:
| > Let's look at some of the most important ones that have been
| developed over the years and try to implement the basic ideas as
| succinctly as possible.
|
| One big architectural tweak that comes to mind and isn't in the
| article is QK norm: https://arxiv.org/pdf/2010.04245
|
| > Cosine Schedule
|
| A lot (most?) of new training runs actually don't use cosine
| schedule anymore; instead they keep the learning rate constant
| and only decay it at the very end, which gives equivalent or
| better results. See:
|
| https://arxiv.org/pdf/2405.18392 https://arxiv.org/pdf/2404.06395
|
| > There is a highly optimized implementation of AdamW in PyTorch.
|
| A fun tidbit - it's actually not highly optimized from my
| experience. Imagine my surprise when I reimplemented it in Triton
| (because I needed to tweak a few things) and I got better
| performance than the built-in PyTorch implementation.
| Scene_Cast2 wrote:
| RE: optimizer performance - any thoughts on heavyball?
| kouteiheika wrote:
| ...oh, I didn't know about this library, thanks!
|
| I still probably wouldn't be able to use it because I need a
| bunch of custom functionality for my optimizers (like for
| example custom quantization support and incremental gradient
| accumulation directly in optimizers' state), but I might
| borrow some of their techniques if they make things even
| faster.
| yorwba wrote:
| The explanation for Multi-head Latent Attention
| https://www.stephendiehl.com/posts/post_transformers/#multi-...
| does _not_ match the definition in the DeepSeek-V2 paper
| https://arxiv.org/pdf/2405.04434#subsection.2.1
|
| MLA as developed by DeepSeek is a technique to reduce the memory
| footprint of the KV cache by storing only two vectors of size
| _latent_dim_ and _rope_dim_ per token and layer, instead of 2 *
| _num_heads_ vectors of size _head_dim_. (DeepSeek-V3 has
| _num_heads_ = 128 and _head_dim_ = 128 vs _latent_dim_ = 512 and
| _rope_dim_ = 64, so a significant reduction
| https://arxiv.org/pdf/2412.19437#subsection.4.2 )
|
| What this article describes instead is some kind of two-step
| attention scheme I haven't seen before and that I think wouldn't
| work with causal masking (despite _mask_ appearing in the example
| code) because either you allow an earlier token to attend to a
| latent that attended to a later token (creating backwards
| information flow) or the latents can only attend to a limited
| prefix of the sequence, after which they 're frozen and useless.
| I wonder whether the author dreamed it up himself or whether
| someone else is actually using this somewhere.
| jdeaton wrote:
| First four things on the list are attention
| alanbernstein wrote:
| The title is a cute shortening of "Attention Is All You Need
| wasn't all we needed"
| empiko wrote:
| Nice writeup, but regarding title -- I find it fascinating how
| powerful attention really is. There were some tweaks developedz
| sure, but if I open Llama 4 code on HugginFace, it is more or
| less the same code that I've seen there 5 years ago. Despite all
| the AI hype, we are still just exploiting tech developed in
| 2015-2020. And despite NeurIPS brandishing 25k papers this year,
| the innovation rate in deep learning seems to stagnate
| kjkjadksj wrote:
| Too many horseriders, not enough horse breeders.
| kouteiheika wrote:
| > There were some tweaks developedz sure, but if I open Llama 4
| code on HugginFace, it is more or less the same code that I've
| seen there 5 years ago.
|
| This is very much true. It's essentially the very same
| architecture, just tweaked slightly.
|
| I can take the code I've written which implements the original
| GPT-2, tweak it very minimally (I don't know, maybe 30~40 lines
| of code changed?) and get Qwen3 which is a state-of-art model
| released ~3 weeks ago.
|
| Contrary to what you might see when looking at e.g. HuggingFace
| code where every new architecture needs a new multi-thousand
| line of code file - that's just a result of an insane amount of
| copy-pasting and technical debt (although they started to clean
| it up a little bit lately). I have my own custom implementation
| which can load weights for ~19 different architectures straight
| off HuggingFace in like ~2k lines of code. They aren't really
| all that different.
___________________________________________________________________
(page generated 2025-05-23 23:01 UTC)