[HN Gopher] Breaking Quadratic Barriers: A Non-Attention LLM for...
___________________________________________________________________
Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long
Context Horizons
Author : PaulHoule
Score : 33 points
Date : 2025-06-16 19:19 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| zoklet-enjoyer wrote:
| I don't know what those words mean, but I am excited for the
| possibilities.
| PaulHoule wrote:
| LLMs can look back over a certain number (N) of tokens, which
| roughly correspond to words. For instance if you want to
| summarize or answer questions about a document accurately the
| length of the document has to be less than N.
|
| Conventionally they use an attention mechanism that compares
| every token to every other token which has a cost of N*N or N
| squared which is quadratic. If you want LLMs to chew over a
| huge amount of context (all the source code for your project)
| it's a problem so people are looking for ways around this.
| zoklet-enjoyer wrote:
| Thank you for that explanation
| rybosome wrote:
| Adding to that excellent high level explanation of what the
| attention mechanism is, I'd add (from my reading of the
| abstract of this paper);
|
| This work builds a model that has the ability to "remember"
| parts of its previous input when generating and processing
| new input, and has part of its intelligence devoted to
| determining what is relevant to remember.
|
| This is in lieu of kind of saying "I need to keep re-
| reading what I've already read and said to keep going".
|
| I'd welcome better explanations. :)
| Icko_ wrote:
| Not even that. With KV-caching, it's linear with the size of
| the context; and if someone figured out a way to have e.g.
| NlogN complexity, I imagine with KV-caching it may go down to
| logN complexity. (If the new algorithm permits that.)
| imranq wrote:
| I like the idea of removing quadratic scaling for attention, this
| paper has thin experimental support. No real tasks tested beyond
| perplexity. Nothing on reasoning, retrieval QA, or summarization
| quality. Even in perplexity the gains are marginal.
|
| However it removes attention so I think its worth watching that
| space of non-attention models
| yorwba wrote:
| This paper seems rather unfocused, explaining their architecture
| three times with slight variations while managing to omit crucial
| details like how exactly they compute gradients for their
| "External Retrieval Memory."
|
| Also, the section on DeepSeek is really weird: "While the precise
| architectural details of DeepSeek LLM are still emerging, early
| discussions suggest that it relies on an extended Transformer
| backbone or a "hybrid" approach that likely incorporates some
| form of attention-based mechanism, potentially at specific layers
| or across chunk boundaries, to facilitate information flow across
| large contexts." It makes it sound like a mystery, even though
| there have been multiple papers published on it (they cite the R1
| one) so that there's really no need to guess whether attention is
| involved.
|
| Overall I'm not convinced the authors know what they're doing.
| roxolotl wrote:
| Would you say they aren't paying attention?
| cubefox wrote:
| I think it's fair to say they are explicitly avoiding
| attention.
| albertzeyer wrote:
| "hundreds of thousands to potentially millions of tokens" -
| that's the same order as current commercial LLMs.
|
| Also note, if the sequence length is not really much larger than
| the model dimension (at least two orders of magnitude more), the
| quadratic complexity of the self-attention is really not such a
| big issue - the matrix multiplication in the feed-forward layers
| will be usually 8x the model dimension squared, and thus that
| part will usually dominate.
|
| Also note that there has been so much research on this already.
| While this particular approach might be novel, there has been
| attempts to avoid the O(n^2) complexity in self-attention
| basically almost since the original transformer paper came out in
| 2017. I wonder a bit that this paper does not cite xLSTM, or
| Block-Recurrent Transformers.
|
| Also, this paper comes very short in experiments. There is
| basically only table 2. There is no study on length extrapolation
| (which is very relevant for the topic), or needle-in-haystack
| experiments, or scaling studies, any larger scale experiments,
| etc. Also, even in this main table 2, I see a couple of typos.
| And looking at the results in table 2, the improvements seems to
| be quite minor.
|
| So I would conclude, this needs a lot more work.
| cubefox wrote:
| > "hundreds of thousands to potentially millions of tokens" -
| that's the same order as current commercial LLMs.
|
| Yes, but those are all relying on proprietary company secrets,
| while this is an open research paper. Besides, only Gemini so
| far has a context window of more than a million tokens.
| littlestymaar wrote:
| Llama 4 Scout has it also, and is an open weight LLM,
| unfortunately it is also disappointing at pretty much any
| context length...
| 3abiton wrote:
| > Unlike traditional Transformer designs, which suffer from
| quadratic memory and computation overload due to the nature of
| the self attention mechanism, our model avoids token to token
| attention entirely.
|
| I skimmed the paper, and unlike transformers they basically can
| scale much more efficiently with longer context. While it's
| possible to fit 1M token, you need a significant amount of
| memory. Alrhough they benchmark against GPT2, so I would say
| quite preliminary work so far, although promising architecture.
| boroboro4 wrote:
| > Also note, if the sequence length is not really much larger
| than the model dimension (at least two orders of magnitude
| more), the quadratic complexity of the self-attention is really
| not such a big issue - the matrix multiplication in the feed-
| forward layers will be usually 8x the model dimension squared,
| and thus that part will usually dominate.
|
| This is incorrect in case of batched inference. There are two
| bottlenecks at play: compute and memory, and your reasoning
| applies to compute. In case of memory it gets trickier: for MLP
| layers you'll need to read same set of weights for all elements
| of your batch, while for kv cache for attention elements will
| be different. That's why in practice the real length where
| attention dominates would be closer to model dimension / batch
| size, rather than just model dimension. And this number isn't
| as high anymore.
| daxfohl wrote:
| Partially related, is charging by token sustainable for LLM
| shops? If the compute requirements go up quadratically, doesn't
| that mean cost should as well?
| sakras wrote:
| Typically requests are binned by context length so that they
| can be batched together. So you might have a 10k bin and a 50k
| bin and a 500k bin, and then you drop context past 500k. So the
| costs are fixed per-bin.
| maxrmk wrote:
| > While the specific internal workings of DeepSeek LLM are still
| being elucidated, it appears to maintain or approximate the self-
| attention paradigm to some extent.
|
| Totally nonsensical. Deepseeks architecture is well documented,
| multiple implementations are available online.
___________________________________________________________________
(page generated 2025-06-16 23:00 UTC)