[HN Gopher] Attention Residuals
___________________________________________________________________
Attention Residuals
Author : GaggiX
Score : 228 points
Date : 2026-03-20 18:23 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jszymborski wrote:
| This is reminds me of the input gates of an LSTM.
| jjcm wrote:
| Two things stand out to me with this:
|
| 1. Drops compute required for training by ~20%. This approach
| wont just help the ever escalating model sizes larger companies
| are pushing for, it means things like autoresearch can iterate on
| new model architectures faster.
|
| 2. WAY lower bandwidth requirements for inference. Means with
| approaches like this it should run on consumer hardware far
| better. It apparently requires 1/6th the memory bandwidth of a
| traditional approach for better results.
|
| This is a big improvement if it can be generalized. They're
| claiming it's a drop in replacement, so it seems like it can as
| well.
| dvt wrote:
| > Drops compute required for training by ~20%.
|
| This is not true. Authors claim that w.r.t. training, their
| method adds negigible overhead for AttnRes with no memory
| impact (but is way more complicated for Block AttnRes since we
| need to use pipelining for larger models, hence the O(Ld) &
| O(Nd) figures, with N [?] L).
|
| > WAY lower bandwidth requirements for inference.
|
| Also not true. Paper has nothing to do with inference, apart
| from the benchmarks. If you're looking at the graph about
| "compute advantage," it's about training compute. They do some
| interpolation to get to the 1.25x number, basically answering
| the question "if non-AttnRes architecture were trained, how
| much compute would it take to get to the same loss as AttnRes?"
| (The answer being ~20% more compute.) It's an interesting
| claim, but there's all kinds of weird and unexpected
| convergence that can happen, so take it with a grain of salt.
| observationist wrote:
| I think what they're getting at is that for a given unit of
| compute, this method achieves 125% performance.
|
| If model A reaches performance level 100 using 100 units of
| compute using old methods, and you train model B using
| AttnRes, aiming at performance level 100, it costs you 80
| units of compute.
|
| It probably doesn't map precisely, but that's where people
| are diverging from the claim - it doesn't explicitly say
| anything about reduced inference or training time, but that's
| the implicit value of these sorts of things. Less compute to
| equivalent performance can be a huge win for platforms at
| scale as well as for local models.
| dvt wrote:
| > I think what they're getting at is that for a given unit
| of compute, this method achieves 125% performance.
|
| This is not what they're getting at; I explained exactly
| what they're getting at. I mean, your equivalence of "loss"
| (what authors _actually_ measured) and "performance" is
| just bizarre. We use benchmarks to measure performance, and
| the numbers there were like 1-5% better (apart from the
| GPQA-Diamond outlier).
|
| Do people even read these papers?
| jszymborski wrote:
| > Do people even read these papers?
|
| Overwhelmingly, no. You may have mistaken this for a
| lab's reading group, but most people here just skim the
| README, maybe read the abstract or figures. Expecting
| them to do more is uh... a bit strange?
|
| But also you can forgive people for equating loss with
| performance, which are admittedly different but related
| ideas.
| com2kid wrote:
| > 2. WAY lower bandwidth requirements for inference. Means with
| approaches like this it should run on consumer hardware far
| better. It apparently requires 1/6th the memory bandwidth of a
| traditional approach for better results.
|
| That should be the headline right there. Giant side 60 font
| headline.
|
| Some people have PhDs in burying the lede!
| talloaktrees wrote:
| except it's not true
| observationist wrote:
| It's not _not_ true, it 's just that things are getting
| lost in the excitement. There are some specific cases where
| there's a big boost, it's just not exactly what people are
| hoping.
|
| >>>The "1/6th" specifically appears in community
| comparisons to DeepSeek's mHC (multi-lane highway
| connections, a prior technique for better depth-wise
| information flow in deep models). Several Chinese-language
| sources and downstream discussions (e.g., translated
| articles, YouTube breakdowns, and blogs like houdao.com)
| state that Block AttnRes achieves comparable (or better)
| performance to mHC while using only one-sixth of the data
| read/write volume (or memory bandwidth pressure) during
| inference/engineering deployment.
|
| There are specific cases where that speedup does occur;
| it's not going to translate exactly into local models or
| other architectures or hardware.
| djsjajah wrote:
| No. It seems to me that the comment is objectively
| incorrect. The original comment was talking about
| inference and from what I can tell, it is strictly going
| to run slower than the model trained to the same loss
| without this approach (it has "minimal overhead"). The
| main point is that you wont need to train that model for
| as long.
| westurner wrote:
| ScholarlyArticle: "Attention Residuals" (2026)
| https://arxiv.org/abs/2603.15031 :
|
| > Abstract: _Residual connections with PreNorm are standard in
| modern LLMs, yet they accumulate all layer outputs with fixed
| unit weights. This uniform aggregation causes uncontrolled
| hidden-state growth with depth, progressively diluting each layer
| 's contribution. We propose Attention Residuals (_AttnRes _),_
| which replaces this fixed accumulation with softmax attention
| over preceding layer outputs, _allowing each layer to selectively
| aggregate earlier representations with learned, input-dependent
| weights. To address the memory and communication overhead of
| attending over all preceding layer outputs for large-scale model
| training, we introduce_ Block AttnRes, _which partitions layers
| into blocks and attends over block-level representations,
| reducing the memory footprint while preserving most of the gains
| of full AttnRes._ [...]
| czbond wrote:
| Ah - now I understand how this has 2k+ (supposedly legitimate)
| Github stars in less than a week. Thank you - I was more
| skeptical
| jryio wrote:
| This is the key piece
|
| > Full AttnRes is straightforward but requires O(Ld) memory at
| scale. Block AttnRes partitions layers into N blocks, accumulates
| within each block via standard residuals, and applies attention
| only over block-level representations. With ~8 blocks, it
| recovers most of Full AttnRes's gains while serving as a
| practical drop-in replacement with marginal overhead.
| Murfalo wrote:
| Amazingly, the first author is a high school student!
| https://nathanchen.me/public/About%20me.html
| brcmthrowaway wrote:
| We're about to get an onslaught of young Chinese geniuses
| (raised in China). It's pure statistics
|
| Sadly, same can't be said about India (infrastructure/food
| security lags China).
| jldugger wrote:
| > It's pure statistics
|
| I'm not so sure about that:
| https://www.populationpyramid.net/china/2026/ suggests peak
| high school in china was years ago.
| yorwba wrote:
| Of [?]17 million Chinese students who graduated junior high
| school after grade 9 in 2024, [?]10 million were admitted
| to a high school, [?]4 million to a vocational school and
| the remaining [?]3 million disappear from education
| statistics, presumably directly entering the workforce. htt
| p://www.moe.gov.cn/jyb_sjzl/sjzl_fztjgb/202506/t20250611_..
| .
|
| So at least in theory there's still lots of room to
| increase high school enrollment, though I doubt this would
| lead to noticeably more geniuses. The testing system is
| pretty good at sorting the best students into good schools,
| I think.
| jldugger wrote:
| Unfortunately it's hard to take China's population /
| enrollment demographics at face value. There's many
| incentives in the system to overstate growth, and cross
| checks between different reports that _should_ be
| correlated suggest they're quite overstated.
|
| It's bad enough they passed some legislation a few years
| ago[1], but the damage has in many senses already been
| done. And it's unclear how effective the changes will be.
| So it's entirely possible those 3 million missing high
| schoolers never existed.
|
| [1]: https://www.reuters.com/world/china/chinas-top-
| legislative-b...
| AnotherGoodName wrote:
| At this point i've witnessed over 30years of "stats about
| China aren't real" type posts while they continue to
| demonstrate impressive economic and social results that
| i'm far more inclined to believe the potentially flawed
| Chinese data than posts that basically claim all data out
| of China is fake.
| advael wrote:
| Isolated demands for rigor, really. China does have a lot
| of incentives to publish misleading statistics. Also, so
| does everyone else. In most places we bake skepticism of
| official lines from government and industry alike into
| our epistemic weights and move on, but when China does it
| we're supposed to treat it as a big deal. Propaganda at
| its finest
| yorwba wrote:
| The figures for students graduating primary school (18.57
| million) and entering junior high school (18.49 million)
| match up quite well, though. Do you think primary schools
| and junior high schools manage to coordinate massive
| student number inflation to the tune of 3 million non-
| existent students, but then at the transition to senior
| high schools that suddenly breaks down? If anything, I'd
| expect it to break down when those non-existent students
| are supposed to take the Zhongkao exam in order to
| graduate, not at the senior high school admissions stage.
|
| _Some_ statistics reported in China are unreliable
| because the person doing the reporting also has their
| performance evaluated by the numbers they report and
| there are few external checks on validity, but I don 't
| think that's the case for student numbers in particular.
|
| Also, it seems like you're the same 'jldugger who cited
| Chinese population statistics upthread, but when somebody
| else does it, they're suddenly unreliable???
| jldugger wrote:
| > If anything, I'd expect it to break down when those
| non-existent students are supposed to take the Zhongkao
| exam in order to graduate, not at the senior high school
| admissions stage.
|
| Reasonable. If I were more conspiratorial, I might
| suggest that it's precisely because people are watching
| college exam numbers that 9th grade -> high school is
| where the break is. Or could just be the result of
| compounding growth from two competing officials making
| different exaggerated claims decades ago.
|
| But really, the high school enrollment gap is not super
| germane to my main point: we may have seen peak China
| population, stemming largely from a smaller incoming
| cohort. The sidebar about offsetting that decline with
| increased enrollment percentages is interesting, I'm just
| default skeptical.
|
| > cited Chinese population statistics upthread, but when
| somebody else does it, they're suddenly unreliable???
|
| My cite appears to use UN data, not the PRC's official
| stats (at least not directly). But I'm pretty sure the
| official stats are also showing the same trend, just at a
| slower rate of decline. I mean, it's the entire reason
| for loosening the one-child policy to two, then to three.
| eru wrote:
| I don't think you can blame food security here.
|
| Even if food security holds back 10% of Indians (which would
| still be a huge tragedy), that would still leave the other
| 90% for the 'onslaught'. 10% is just a made up number. But
| even with 50% you'd get an 'onslaught'.
|
| So if we are seeing less than that, it's probably down to
| other factors.
| srean wrote:
| > Sadly, same can't be said about India
|
| And quality of leadership.
|
| They (barring a few exceptions) are happy to gloat in
| imagined past glories of vedic aeroplanes, inter-species head
| transplants apparent performed in Hindu golden age and
| loyalty based funding that produces institutions like
| Galgotia University.
| bawana wrote:
| Long, long ago, I remember the first toy I ever got that was
| made in chiina. It was a wooden cube puzzle. various
| interlocking differently shaped pieces that when assembled
| formed a cube. It was so different to all the other toys made
| in america by hasbro, mattel, tonka, etc. Back then I felt like
| I was holding a toy made by the ancient Greeks, a puzzle to
| teach geometry, analysis, pattern recognition. So abstract, so
| removed from daily life, it transported me into a different
| world. Like chess, it was an engaging abstraction. But unlike
| chess, it was not about conflict but rather interrelating
| pieces to make a greater whole.
|
| So this is what really unsettles me. Not that China graduates
| more engineers every year than we have entirely employed in the
| US, but rather, that these individuals are not about delegating
| work, but actually doing it. Whereas the western credo is to
| get someone else to do the work (or in the words of PAtton, to
| get some one else to die for his country), I get the feeling
| that China will get robots and AI to do the work. I am reminded
| of the joke about Chinese factories having only 1 security
| guard and 1 dog. The guard is there to feed the dog.
| youknownothing wrote:
| "the western credo is to get someone else to do the work"
|
| This. One of the things that most shocked me when I moved to
| London was how bad English people were at hard skills, but
| also how easily giving orders and "projecting gravitas" came
| to them. Everyone wants to be a "leader", which sadly has
| become code for reaping benefits of other people's work.
| caderosche wrote:
| Very cool!
| imtringued wrote:
| I don't know why all the posts fail to summarize the results
| properly.
|
| I had a similar idea at the back of my head but here is a layman
| explanation:
|
| Standard attention threads the previous layers output to the next
| layers input. By adding residual connections to each layer, the
| layers learn an update rule.
|
| There is an obvious limitation here. Only the first layer gets to
| see the original input and all subsequent layers only get to see
| the previous layer output.
|
| With attention residuals, the idea is that you have a tiny
| attention operator that decides between using the original input
| and any of the previous layer outputs.
| antirez wrote:
| Exactly. I was reading all the other comments and wondering why
| many looked like they were talking of something else.
| ilreb wrote:
| [dupe]https://news.ycombinator.com/item?id=47401111
| roger_ wrote:
| Great idea and seems quite obvious in hindsight.
|
| Is it guaranteed to have the same effect on vanishing gradients
| though? What if it put weight 1 on a layer that had a tiny
| gradient?
___________________________________________________________________
(page generated 2026-03-21 23:01 UTC)