[HN Gopher] Transformers without normalization
___________________________________________________________________
Transformers without normalization
Author : kaycebasques
Score : 35 points
Date : 2025-07-24 14:48 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| gnabgib wrote:
| Discussion (260 points, 4 months ago, 32 comments)
| https://news.ycombinator.com/item?id=43369633
| godelski wrote:
| I think other than the title being a bit misleading, the paper is
| good. I say misleading because they replace Layer Normalization
| with a tanh function, which still bounds the range to [-1,1].
| Plenty of people would call that normalization (an unfortunately
| overloaded term).
|
| While the result isn't too surprising it has a good ablation
| study and helps build confidence in the mechanism. It's simple
| and quick to implement, but I don't find that a disadvantage.
| Arguably this is not novel, but sometimes it is worth revisiting
| things when the rest of the environment has changed and I think
| the study being thorough makes it useful to the community.
|
| The project page is here[0] which will give you a very quick
| understanding of the paper.
|
| [0] https://jiachenzhu.github.io/DyT/
| giancarlostoro wrote:
| > (an unfortunately overloaded term)
|
| I mentioned normalization in an interview, and they had no idea
| what I was talking about given my context, they were thinking
| of database normalization, I was thinking of DATA
| normalization, where you uppercase all inputs for e.g. an
| email, so when they login, casing doesn't matter, since you'll
| uppercase it when you check against the database. I'm sure
| there's a zillion other normalization methods for different
| things.
___________________________________________________________________
(page generated 2025-07-24 23:01 UTC)