[HN Gopher] Schedule-Free Learning - A New Way to Train
       ___________________________________________________________________
        
       Schedule-Free Learning - A New Way to Train
        
       Author : ironbound
       Score  : 66 points
       Date   : 2024-04-06 14:29 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | rand0mwalk wrote:
       | Is there an accompanying paper out there?
        
         | blt wrote:
         | +1, curious to see if the paper has a convergence rate proof
         | for convex objectives.
        
           | cutkosky wrote:
           | (an) author here: paper will likely be coming out in
           | O(month). But, yes it turns out that the method is minimax
           | optimal for stochastic convex optimization for a wide variety
           | of parameter settings. Of course, minimax optimality alone
           | does not fully explain empirical success - we've had minimax
           | optimal algorithms for decades!
        
             | xpe wrote:
             | This would have been witty IMO: "the paper will be out in
             | O(negative_peer_reviews)"
             | 
             | > (an) author here: paper will likely be coming out in
             | O(month)
             | 
             | Ug. I'm adding "O(month)" to my list of bootless metaphors.
             | 
             | Why? (1) Because in Big-O notation, O(month) would equal
             | O(day), which is not the intended meaning in the comment
             | above; (2) It is non-sensical; one would never say e.g.
             | "the run-time of an algorithm is O(seconds)" -- we write
             | some kind of _input_ inside the parens, not the _output_
             | 
             | Anyhow, we already have the words _roughly_ and _about_ ;
             | e.g. "about a month".
             | 
             | Feel free to call me pedantic, but words matter.
        
         | miven wrote:
         | I think the author of this method said it's coming in a month
         | or so
        
         | JieJie wrote:
         | From the Related Work section (best guess):
         | 
         | Stochastic Weight Averaging (Izmailov et al 2018)
         | https://arxiv.org/abs/1803.05407
         | 
         | Latest Weight Averaging (Kaddour 2022)
         | https://arxiv.org/abs/2209.14981
         | 
         | Latest Weight Averaging? (Sanyal et al 2023)
         | https://arxiv.org/abs/2311.16294
         | 
         | Cyclic Learning Rates (Portes et al 2022)
         | https://arxiv.org/abs/2206.00832
         | 
         | Exponential Moving Average? (Zhanghan? et al 2019)
         | https://arxiv.org/abs/1909.01804
        
       | yinser wrote:
       | I'm continually impressed by Meta/FAIR's contributions to the
       | open AI space. Never thought I'd say that
        
       | shostack wrote:
       | And here I was hoping for something related to how to approach
       | self-driven learning and education when you have a hectic and
       | unpredictable schedule and are trying to fit learning in-between
       | things with the fragments of time you have.
        
         | p1esk wrote:
         | Models learn so you don't have to :)
        
       | tysam_and wrote:
       | This is a pretty hyped-up optimizer that seems to have okay-ish
       | performance in-practice, but there are a number of major red
       | flags here. For one, the baselines are decently sandbagged, but
       | the twitter posts sharing them (which are pretty hype-y) directly
       | says that the baselines are "highly tuned" and that there's no
       | benchmark trickery (which is flat-out wrong). If someone has not
       | had experience with said benchmarks, it is a plausible statement,
       | having worked with some these datasets very closely, some of the
       | baselines are simply terrible, I don't know where they came from.
       | 
       | Additionally, the optimizer does actually appear to have a kind
       | of momentum, despite claims directly saying the contrary, and
       | uses it with a nesterov-like step (line 2 of 3 in the inner
       | loop). Finally, it is 'schedule-free' because the schedule is
       | actually hardcoded into the algorithm itself -- 1./steps_taken
       | which is not necessarily a rare learning rate schedule. This is a
       | decently robust but sometimes suboptimal schedule, and I find it
       | sketchy to make claims that it is 'schedule-free'. This also
       | cripples the optimizer by tying performance to the number of
       | steps taken -- which is potentially a problem if you are using
       | any batchsize+lr scaling strategies as I understand.
       | 
       | There is a mixture of hype and substance here, and I wish the
       | author was more straightforward with their approach and claims. I
       | think there is the potential for a good "bolts-included"
       | optimizer with some of the ideas being presented here, but the
       | amount of overhyping and deception makes me not want to trust any
       | of the following work coming.
       | 
       | Unfortunately, hype is what sells best on Twitter, and some of
       | the claims being made here appear to be at the very best
       | deceptive, and at the very worst, untrue. I could be wrong --
       | these are just my personal opinions from my own experience, but I
       | do occasionally find myself distraught about the things that tend
       | to catch wind in the technical news cycle.
       | 
       | -Fern
        
       | johndough wrote:
       | I did a quick comparison on MNIST with a small ConvNet, comparing
       | this AdamWSCheduleFree optimizer against a few other optimizers
       | (RAdam, NAdam, AdamW, SGD, Adam, Adafactor, SophiaG). The
       | validation accuracy seems to be okay and the train loss decreases
       | remarkably quickly.
       | 
       | Validation accuracy: https://i.imgur.com/8ZtX7Rd.png
       | 
       | Train loss: https://i.imgur.com/o5XdQ29.png
       | 
       | Code: https://bpa.st/NVJQ (currently only runs on my computer,
       | but not enough time to clean it up)
       | 
       | Note that this is just a toy benchmark with very little
       | hyperparameter tuning. You could probably get similar results
       | with most optimizers and an appropriate schedule. Nevertheless, I
       | appreciate every hyperparameter that I do not have to set
       | manually.
       | 
       | In summary, this seems to be a promising optimizer. I'll add it
       | to my list of optimizers to try for new deep learning projects.
        
       ___________________________________________________________________
       (page generated 2024-04-06 23:00 UTC)