[HN Gopher] Cautious Optimizers: Improving Training with One Lin...
___________________________________________________________________
Cautious Optimizers: Improving Training with One Line of Code
Author : tosh
Score : 38 points
Date : 2025-03-03 17:59 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| mkaic wrote:
| Damn, this is a strikingly simple modification. Basically, modern
| deep learning optimizers typically calculate the update to the
| weights each step using some kind of momentum and/or LR scaling
| based on the running variance of the gradients. This means that,
| in theory, the actual "instantaneous" gradients from a particular
| backward pass might point in a different direction than the
| actual update the optimizer applies. The change the authors
| propose is to simply ignore any parameter updates proposed by the
| optimizer that have the opposite sign of the current gradient
| from the most recent backwards pass. They're essentially saying
| "only apply the long-term stabilized update where it _agrees_
| with the current 'instantaneous' gradient." They show that this
| simple change significantly speeds up model training.
|
| I'm pretty intrigued by this, but will, as usual, wait for
| independent replications to come out before I fully believe it.
| That said, because of how simple this is, I'd expect such
| replications to happen within 24 hours. Exciting work!
| shoubidouwah wrote:
| I wonder if there mioght not be an opportunity for a warmup
| based mask inversion: for the first few epoches, only apply the
| momentum agreeing with instantaneous - after that, invert it
| since the momentum would technically have more info?
|
| In any case, good idea - reminds me of the "apply same gradient
| multiple times" trick from a few years ago. May have weird
| behaviours at low batch sizes though...
___________________________________________________________________
(page generated 2025-03-03 23:01 UTC)