[HN Gopher] Breaking Quadratic Barriers: A Non-Attention LLM for...
       ___________________________________________________________________
        
       Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long
       Context Horizons
        
       Author : PaulHoule
       Score  : 33 points
       Date   : 2025-06-16 19:19 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | zoklet-enjoyer wrote:
       | I don't know what those words mean, but I am excited for the
       | possibilities.
        
         | PaulHoule wrote:
         | LLMs can look back over a certain number (N) of tokens, which
         | roughly correspond to words. For instance if you want to
         | summarize or answer questions about a document accurately the
         | length of the document has to be less than N.
         | 
         | Conventionally they use an attention mechanism that compares
         | every token to every other token which has a cost of N*N or N
         | squared which is quadratic. If you want LLMs to chew over a
         | huge amount of context (all the source code for your project)
         | it's a problem so people are looking for ways around this.
        
           | zoklet-enjoyer wrote:
           | Thank you for that explanation
        
             | rybosome wrote:
             | Adding to that excellent high level explanation of what the
             | attention mechanism is, I'd add (from my reading of the
             | abstract of this paper);
             | 
             | This work builds a model that has the ability to "remember"
             | parts of its previous input when generating and processing
             | new input, and has part of its intelligence devoted to
             | determining what is relevant to remember.
             | 
             | This is in lieu of kind of saying "I need to keep re-
             | reading what I've already read and said to keep going".
             | 
             | I'd welcome better explanations. :)
        
           | Icko_ wrote:
           | Not even that. With KV-caching, it's linear with the size of
           | the context; and if someone figured out a way to have e.g.
           | NlogN complexity, I imagine with KV-caching it may go down to
           | logN complexity. (If the new algorithm permits that.)
        
       | imranq wrote:
       | I like the idea of removing quadratic scaling for attention, this
       | paper has thin experimental support. No real tasks tested beyond
       | perplexity. Nothing on reasoning, retrieval QA, or summarization
       | quality. Even in perplexity the gains are marginal.
       | 
       | However it removes attention so I think its worth watching that
       | space of non-attention models
        
       | yorwba wrote:
       | This paper seems rather unfocused, explaining their architecture
       | three times with slight variations while managing to omit crucial
       | details like how exactly they compute gradients for their
       | "External Retrieval Memory."
       | 
       | Also, the section on DeepSeek is really weird: "While the precise
       | architectural details of DeepSeek LLM are still emerging, early
       | discussions suggest that it relies on an extended Transformer
       | backbone or a "hybrid" approach that likely incorporates some
       | form of attention-based mechanism, potentially at specific layers
       | or across chunk boundaries, to facilitate information flow across
       | large contexts." It makes it sound like a mystery, even though
       | there have been multiple papers published on it (they cite the R1
       | one) so that there's really no need to guess whether attention is
       | involved.
       | 
       | Overall I'm not convinced the authors know what they're doing.
        
         | roxolotl wrote:
         | Would you say they aren't paying attention?
        
           | cubefox wrote:
           | I think it's fair to say they are explicitly avoiding
           | attention.
        
       | albertzeyer wrote:
       | "hundreds of thousands to potentially millions of tokens" -
       | that's the same order as current commercial LLMs.
       | 
       | Also note, if the sequence length is not really much larger than
       | the model dimension (at least two orders of magnitude more), the
       | quadratic complexity of the self-attention is really not such a
       | big issue - the matrix multiplication in the feed-forward layers
       | will be usually 8x the model dimension squared, and thus that
       | part will usually dominate.
       | 
       | Also note that there has been so much research on this already.
       | While this particular approach might be novel, there has been
       | attempts to avoid the O(n^2) complexity in self-attention
       | basically almost since the original transformer paper came out in
       | 2017. I wonder a bit that this paper does not cite xLSTM, or
       | Block-Recurrent Transformers.
       | 
       | Also, this paper comes very short in experiments. There is
       | basically only table 2. There is no study on length extrapolation
       | (which is very relevant for the topic), or needle-in-haystack
       | experiments, or scaling studies, any larger scale experiments,
       | etc. Also, even in this main table 2, I see a couple of typos.
       | And looking at the results in table 2, the improvements seems to
       | be quite minor.
       | 
       | So I would conclude, this needs a lot more work.
        
         | cubefox wrote:
         | > "hundreds of thousands to potentially millions of tokens" -
         | that's the same order as current commercial LLMs.
         | 
         | Yes, but those are all relying on proprietary company secrets,
         | while this is an open research paper. Besides, only Gemini so
         | far has a context window of more than a million tokens.
        
           | littlestymaar wrote:
           | Llama 4 Scout has it also, and is an open weight LLM,
           | unfortunately it is also disappointing at pretty much any
           | context length...
        
         | 3abiton wrote:
         | > Unlike traditional Transformer designs, which suffer from
         | quadratic memory and computation overload due to the nature of
         | the self attention mechanism, our model avoids token to token
         | attention entirely.
         | 
         | I skimmed the paper, and unlike transformers they basically can
         | scale much more efficiently with longer context. While it's
         | possible to fit 1M token, you need a significant amount of
         | memory. Alrhough they benchmark against GPT2, so I would say
         | quite preliminary work so far, although promising architecture.
        
         | boroboro4 wrote:
         | > Also note, if the sequence length is not really much larger
         | than the model dimension (at least two orders of magnitude
         | more), the quadratic complexity of the self-attention is really
         | not such a big issue - the matrix multiplication in the feed-
         | forward layers will be usually 8x the model dimension squared,
         | and thus that part will usually dominate.
         | 
         | This is incorrect in case of batched inference. There are two
         | bottlenecks at play: compute and memory, and your reasoning
         | applies to compute. In case of memory it gets trickier: for MLP
         | layers you'll need to read same set of weights for all elements
         | of your batch, while for kv cache for attention elements will
         | be different. That's why in practice the real length where
         | attention dominates would be closer to model dimension / batch
         | size, rather than just model dimension. And this number isn't
         | as high anymore.
        
       | daxfohl wrote:
       | Partially related, is charging by token sustainable for LLM
       | shops? If the compute requirements go up quadratically, doesn't
       | that mean cost should as well?
        
         | sakras wrote:
         | Typically requests are binned by context length so that they
         | can be batched together. So you might have a 10k bin and a 50k
         | bin and a 500k bin, and then you drop context past 500k. So the
         | costs are fixed per-bin.
        
       | maxrmk wrote:
       | > While the specific internal workings of DeepSeek LLM are still
       | being elucidated, it appears to maintain or approximate the self-
       | attention paradigm to some extent.
       | 
       | Totally nonsensical. Deepseeks architecture is well documented,
       | multiple implementations are available online.
        
       ___________________________________________________________________
       (page generated 2025-06-16 23:00 UTC)