hngopher.com

       [HN Gopher] Attention Residuals
       ___________________________________________________________________
        
       Attention Residuals
        
       Author : GaggiX
       Score  : 228 points
       Date   : 2026-03-20 18:23 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jszymborski wrote:
       | This is reminds me of the input gates of an LSTM.
        
       | jjcm wrote:
       | Two things stand out to me with this:
       | 
       | 1. Drops compute required for training by ~20%. This approach
       | wont just help the ever escalating model sizes larger companies
       | are pushing for, it means things like autoresearch can iterate on
       | new model architectures faster.
       | 
       | 2. WAY lower bandwidth requirements for inference. Means with
       | approaches like this it should run on consumer hardware far
       | better. It apparently requires 1/6th the memory bandwidth of a
       | traditional approach for better results.
       | 
       | This is a big improvement if it can be generalized. They're
       | claiming it's a drop in replacement, so it seems like it can as
       | well.
        
         | dvt wrote:
         | > Drops compute required for training by ~20%.
         | 
         | This is not true. Authors claim that w.r.t. training, their
         | method adds negigible overhead for AttnRes with no memory
         | impact (but is way more complicated for Block AttnRes since we
         | need to use pipelining for larger models, hence the O(Ld) &
         | O(Nd) figures, with N [?] L).
         | 
         | > WAY lower bandwidth requirements for inference.
         | 
         | Also not true. Paper has nothing to do with inference, apart
         | from the benchmarks. If you're looking at the graph about
         | "compute advantage," it's about training compute. They do some
         | interpolation to get to the 1.25x number, basically answering
         | the question "if non-AttnRes architecture were trained, how
         | much compute would it take to get to the same loss as AttnRes?"
         | (The answer being ~20% more compute.) It's an interesting
         | claim, but there's all kinds of weird and unexpected
         | convergence that can happen, so take it with a grain of salt.
        
           | observationist wrote:
           | I think what they're getting at is that for a given unit of
           | compute, this method achieves 125% performance.
           | 
           | If model A reaches performance level 100 using 100 units of
           | compute using old methods, and you train model B using
           | AttnRes, aiming at performance level 100, it costs you 80
           | units of compute.
           | 
           | It probably doesn't map precisely, but that's where people
           | are diverging from the claim - it doesn't explicitly say
           | anything about reduced inference or training time, but that's
           | the implicit value of these sorts of things. Less compute to
           | equivalent performance can be a huge win for platforms at
           | scale as well as for local models.
        
             | dvt wrote:
             | > I think what they're getting at is that for a given unit
             | of compute, this method achieves 125% performance.
             | 
             | This is not what they're getting at; I explained exactly
             | what they're getting at. I mean, your equivalence of "loss"
             | (what authors _actually_ measured) and  "performance" is
             | just bizarre. We use benchmarks to measure performance, and
             | the numbers there were like 1-5% better (apart from the
             | GPQA-Diamond outlier).
             | 
             | Do people even read these papers?
        
               | jszymborski wrote:
               | > Do people even read these papers?
               | 
               | Overwhelmingly, no. You may have mistaken this for a
               | lab's reading group, but most people here just skim the
               | README, maybe read the abstract or figures. Expecting
               | them to do more is uh... a bit strange?
               | 
               | But also you can forgive people for equating loss with
               | performance, which are admittedly different but related
               | ideas.
        
         | com2kid wrote:
         | > 2. WAY lower bandwidth requirements for inference. Means with
         | approaches like this it should run on consumer hardware far
         | better. It apparently requires 1/6th the memory bandwidth of a
         | traditional approach for better results.
         | 
         | That should be the headline right there. Giant side 60 font
         | headline.
         | 
         | Some people have PhDs in burying the lede!
        
           | talloaktrees wrote:
           | except it's not true
        
             | observationist wrote:
             | It's not _not_ true, it 's just that things are getting
             | lost in the excitement. There are some specific cases where
             | there's a big boost, it's just not exactly what people are
             | hoping.
             | 
             | >>>The "1/6th" specifically appears in community
             | comparisons to DeepSeek's mHC (multi-lane highway
             | connections, a prior technique for better depth-wise
             | information flow in deep models). Several Chinese-language
             | sources and downstream discussions (e.g., translated
             | articles, YouTube breakdowns, and blogs like houdao.com)
             | state that Block AttnRes achieves comparable (or better)
             | performance to mHC while using only one-sixth of the data
             | read/write volume (or memory bandwidth pressure) during
             | inference/engineering deployment.
             | 
             | There are specific cases where that speedup does occur;
             | it's not going to translate exactly into local models or
             | other architectures or hardware.
        
               | djsjajah wrote:
               | No. It seems to me that the comment is objectively
               | incorrect. The original comment was talking about
               | inference and from what I can tell, it is strictly going
               | to run slower than the model trained to the same loss
               | without this approach (it has "minimal overhead"). The
               | main point is that you wont need to train that model for
               | as long.
        
       | westurner wrote:
       | ScholarlyArticle: "Attention Residuals" (2026)
       | https://arxiv.org/abs/2603.15031 :
       | 
       | > Abstract: _Residual connections with PreNorm are standard in
       | modern LLMs, yet they accumulate all layer outputs with fixed
       | unit weights. This uniform aggregation causes uncontrolled
       | hidden-state growth with depth, progressively diluting each layer
       | 's contribution. We propose Attention Residuals (_AttnRes _),_
       | which replaces this fixed accumulation with softmax attention
       | over preceding layer outputs, _allowing each layer to selectively
       | aggregate earlier representations with learned, input-dependent
       | weights. To address the memory and communication overhead of
       | attending over all preceding layer outputs for large-scale model
       | training, we introduce_ Block AttnRes, _which partitions layers
       | into blocks and attends over block-level representations,
       | reducing the memory footprint while preserving most of the gains
       | of full AttnRes._ [...]
        
         | czbond wrote:
         | Ah - now I understand how this has 2k+ (supposedly legitimate)
         | Github stars in less than a week. Thank you - I was more
         | skeptical
        
       | jryio wrote:
       | This is the key piece
       | 
       | > Full AttnRes is straightforward but requires O(Ld) memory at
       | scale. Block AttnRes partitions layers into N blocks, accumulates
       | within each block via standard residuals, and applies attention
       | only over block-level representations. With ~8 blocks, it
       | recovers most of Full AttnRes's gains while serving as a
       | practical drop-in replacement with marginal overhead.
        
       | Murfalo wrote:
       | Amazingly, the first author is a high school student!
       | https://nathanchen.me/public/About%20me.html
        
         | brcmthrowaway wrote:
         | We're about to get an onslaught of young Chinese geniuses
         | (raised in China). It's pure statistics
         | 
         | Sadly, same can't be said about India (infrastructure/food
         | security lags China).
        
           | jldugger wrote:
           | > It's pure statistics
           | 
           | I'm not so sure about that:
           | https://www.populationpyramid.net/china/2026/ suggests peak
           | high school in china was years ago.
        
             | yorwba wrote:
             | Of [?]17 million Chinese students who graduated junior high
             | school after grade 9 in 2024, [?]10 million were admitted
             | to a high school, [?]4 million to a vocational school and
             | the remaining [?]3 million disappear from education
             | statistics, presumably directly entering the workforce. htt
             | p://www.moe.gov.cn/jyb_sjzl/sjzl_fztjgb/202506/t20250611_..
             | .
             | 
             | So at least in theory there's still lots of room to
             | increase high school enrollment, though I doubt this would
             | lead to noticeably more geniuses. The testing system is
             | pretty good at sorting the best students into good schools,
             | I think.
        
               | jldugger wrote:
               | Unfortunately it's hard to take China's population /
               | enrollment demographics at face value. There's many
               | incentives in the system to overstate growth, and cross
               | checks between different reports that _should_ be
               | correlated suggest they're quite overstated.
               | 
               | It's bad enough they passed some legislation a few years
               | ago[1], but the damage has in many senses already been
               | done. And it's unclear how effective the changes will be.
               | So it's entirely possible those 3 million missing high
               | schoolers never existed.
               | 
               | [1]: https://www.reuters.com/world/china/chinas-top-
               | legislative-b...
        
               | AnotherGoodName wrote:
               | At this point i've witnessed over 30years of "stats about
               | China aren't real" type posts while they continue to
               | demonstrate impressive economic and social results that
               | i'm far more inclined to believe the potentially flawed
               | Chinese data than posts that basically claim all data out
               | of China is fake.
        
               | advael wrote:
               | Isolated demands for rigor, really. China does have a lot
               | of incentives to publish misleading statistics. Also, so
               | does everyone else. In most places we bake skepticism of
               | official lines from government and industry alike into
               | our epistemic weights and move on, but when China does it
               | we're supposed to treat it as a big deal. Propaganda at
               | its finest
        
               | yorwba wrote:
               | The figures for students graduating primary school (18.57
               | million) and entering junior high school (18.49 million)
               | match up quite well, though. Do you think primary schools
               | and junior high schools manage to coordinate massive
               | student number inflation to the tune of 3 million non-
               | existent students, but then at the transition to senior
               | high schools that suddenly breaks down? If anything, I'd
               | expect it to break down when those non-existent students
               | are supposed to take the Zhongkao exam in order to
               | graduate, not at the senior high school admissions stage.
               | 
               |  _Some_ statistics reported in China are unreliable
               | because the person doing the reporting also has their
               | performance evaluated by the numbers they report and
               | there are few external checks on validity, but I don 't
               | think that's the case for student numbers in particular.
               | 
               | Also, it seems like you're the same 'jldugger who cited
               | Chinese population statistics upthread, but when somebody
               | else does it, they're suddenly unreliable???
        
               | jldugger wrote:
               | > If anything, I'd expect it to break down when those
               | non-existent students are supposed to take the Zhongkao
               | exam in order to graduate, not at the senior high school
               | admissions stage.
               | 
               | Reasonable. If I were more conspiratorial, I might
               | suggest that it's precisely because people are watching
               | college exam numbers that 9th grade -> high school is
               | where the break is. Or could just be the result of
               | compounding growth from two competing officials making
               | different exaggerated claims decades ago.
               | 
               | But really, the high school enrollment gap is not super
               | germane to my main point: we may have seen peak China
               | population, stemming largely from a smaller incoming
               | cohort. The sidebar about offsetting that decline with
               | increased enrollment percentages is interesting, I'm just
               | default skeptical.
               | 
               | > cited Chinese population statistics upthread, but when
               | somebody else does it, they're suddenly unreliable???
               | 
               | My cite appears to use UN data, not the PRC's official
               | stats (at least not directly). But I'm pretty sure the
               | official stats are also showing the same trend, just at a
               | slower rate of decline. I mean, it's the entire reason
               | for loosening the one-child policy to two, then to three.
        
           | eru wrote:
           | I don't think you can blame food security here.
           | 
           | Even if food security holds back 10% of Indians (which would
           | still be a huge tragedy), that would still leave the other
           | 90% for the 'onslaught'. 10% is just a made up number. But
           | even with 50% you'd get an 'onslaught'.
           | 
           | So if we are seeing less than that, it's probably down to
           | other factors.
        
           | srean wrote:
           | > Sadly, same can't be said about India
           | 
           | And quality of leadership.
           | 
           | They (barring a few exceptions) are happy to gloat in
           | imagined past glories of vedic aeroplanes, inter-species head
           | transplants apparent performed in Hindu golden age and
           | loyalty based funding that produces institutions like
           | Galgotia University.
        
         | bawana wrote:
         | Long, long ago, I remember the first toy I ever got that was
         | made in chiina. It was a wooden cube puzzle. various
         | interlocking differently shaped pieces that when assembled
         | formed a cube. It was so different to all the other toys made
         | in america by hasbro, mattel, tonka, etc. Back then I felt like
         | I was holding a toy made by the ancient Greeks, a puzzle to
         | teach geometry, analysis, pattern recognition. So abstract, so
         | removed from daily life, it transported me into a different
         | world. Like chess, it was an engaging abstraction. But unlike
         | chess, it was not about conflict but rather interrelating
         | pieces to make a greater whole.
         | 
         | So this is what really unsettles me. Not that China graduates
         | more engineers every year than we have entirely employed in the
         | US, but rather, that these individuals are not about delegating
         | work, but actually doing it. Whereas the western credo is to
         | get someone else to do the work (or in the words of PAtton, to
         | get some one else to die for his country), I get the feeling
         | that China will get robots and AI to do the work. I am reminded
         | of the joke about Chinese factories having only 1 security
         | guard and 1 dog. The guard is there to feed the dog.
        
           | youknownothing wrote:
           | "the western credo is to get someone else to do the work"
           | 
           | This. One of the things that most shocked me when I moved to
           | London was how bad English people were at hard skills, but
           | also how easily giving orders and "projecting gravitas" came
           | to them. Everyone wants to be a "leader", which sadly has
           | become code for reaping benefits of other people's work.
        
       | caderosche wrote:
       | Very cool!
        
       | imtringued wrote:
       | I don't know why all the posts fail to summarize the results
       | properly.
       | 
       | I had a similar idea at the back of my head but here is a layman
       | explanation:
       | 
       | Standard attention threads the previous layers output to the next
       | layers input. By adding residual connections to each layer, the
       | layers learn an update rule.
       | 
       | There is an obvious limitation here. Only the first layer gets to
       | see the original input and all subsequent layers only get to see
       | the previous layer output.
       | 
       | With attention residuals, the idea is that you have a tiny
       | attention operator that decides between using the original input
       | and any of the previous layer outputs.
        
         | antirez wrote:
         | Exactly. I was reading all the other comments and wondering why
         | many looked like they were talking of something else.
        
       | ilreb wrote:
       | [dupe]https://news.ycombinator.com/item?id=47401111
        
       | roger_ wrote:
       | Great idea and seems quite obvious in hindsight.
       | 
       | Is it guaranteed to have the same effect on vanishing gradients
       | though? What if it put weight 1 on a layer that had a tiny
       | gradient?
        
       ___________________________________________________________________
       (page generated 2026-03-21 23:01 UTC)