hngopher.com

       [HN Gopher] Cautious Optimizers: Improving Training with One Lin...
       ___________________________________________________________________
        
       Cautious Optimizers: Improving Training with One Line of Code
        
       Author : tosh
       Score  : 38 points
       Date   : 2025-03-03 17:59 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | mkaic wrote:
       | Damn, this is a strikingly simple modification. Basically, modern
       | deep learning optimizers typically calculate the update to the
       | weights each step using some kind of momentum and/or LR scaling
       | based on the running variance of the gradients. This means that,
       | in theory, the actual "instantaneous" gradients from a particular
       | backward pass might point in a different direction than the
       | actual update the optimizer applies. The change the authors
       | propose is to simply ignore any parameter updates proposed by the
       | optimizer that have the opposite sign of the current gradient
       | from the most recent backwards pass. They're essentially saying
       | "only apply the long-term stabilized update where it _agrees_
       | with the current  'instantaneous' gradient." They show that this
       | simple change significantly speeds up model training.
       | 
       | I'm pretty intrigued by this, but will, as usual, wait for
       | independent replications to come out before I fully believe it.
       | That said, because of how simple this is, I'd expect such
       | replications to happen within 24 hours. Exciting work!
        
         | shoubidouwah wrote:
         | I wonder if there mioght not be an opportunity for a warmup
         | based mask inversion: for the first few epoches, only apply the
         | momentum agreeing with instantaneous - after that, invert it
         | since the momentum would technically have more info?
         | 
         | In any case, good idea - reminds me of the "apply same gradient
         | multiple times" trick from a few years ago. May have weird
         | behaviours at low batch sizes though...
        
       ___________________________________________________________________
       (page generated 2025-03-03 23:01 UTC)