[HN Gopher] Schedule-Free Learning - A New Way to Train
___________________________________________________________________
Schedule-Free Learning - A New Way to Train
Author : ironbound
Score : 66 points
Date : 2024-04-06 14:29 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| rand0mwalk wrote:
| Is there an accompanying paper out there?
| blt wrote:
| +1, curious to see if the paper has a convergence rate proof
| for convex objectives.
| cutkosky wrote:
| (an) author here: paper will likely be coming out in
| O(month). But, yes it turns out that the method is minimax
| optimal for stochastic convex optimization for a wide variety
| of parameter settings. Of course, minimax optimality alone
| does not fully explain empirical success - we've had minimax
| optimal algorithms for decades!
| xpe wrote:
| This would have been witty IMO: "the paper will be out in
| O(negative_peer_reviews)"
|
| > (an) author here: paper will likely be coming out in
| O(month)
|
| Ug. I'm adding "O(month)" to my list of bootless metaphors.
|
| Why? (1) Because in Big-O notation, O(month) would equal
| O(day), which is not the intended meaning in the comment
| above; (2) It is non-sensical; one would never say e.g.
| "the run-time of an algorithm is O(seconds)" -- we write
| some kind of _input_ inside the parens, not the _output_
|
| Anyhow, we already have the words _roughly_ and _about_ ;
| e.g. "about a month".
|
| Feel free to call me pedantic, but words matter.
| miven wrote:
| I think the author of this method said it's coming in a month
| or so
| JieJie wrote:
| From the Related Work section (best guess):
|
| Stochastic Weight Averaging (Izmailov et al 2018)
| https://arxiv.org/abs/1803.05407
|
| Latest Weight Averaging (Kaddour 2022)
| https://arxiv.org/abs/2209.14981
|
| Latest Weight Averaging? (Sanyal et al 2023)
| https://arxiv.org/abs/2311.16294
|
| Cyclic Learning Rates (Portes et al 2022)
| https://arxiv.org/abs/2206.00832
|
| Exponential Moving Average? (Zhanghan? et al 2019)
| https://arxiv.org/abs/1909.01804
| yinser wrote:
| I'm continually impressed by Meta/FAIR's contributions to the
| open AI space. Never thought I'd say that
| shostack wrote:
| And here I was hoping for something related to how to approach
| self-driven learning and education when you have a hectic and
| unpredictable schedule and are trying to fit learning in-between
| things with the fragments of time you have.
| p1esk wrote:
| Models learn so you don't have to :)
| tysam_and wrote:
| This is a pretty hyped-up optimizer that seems to have okay-ish
| performance in-practice, but there are a number of major red
| flags here. For one, the baselines are decently sandbagged, but
| the twitter posts sharing them (which are pretty hype-y) directly
| says that the baselines are "highly tuned" and that there's no
| benchmark trickery (which is flat-out wrong). If someone has not
| had experience with said benchmarks, it is a plausible statement,
| having worked with some these datasets very closely, some of the
| baselines are simply terrible, I don't know where they came from.
|
| Additionally, the optimizer does actually appear to have a kind
| of momentum, despite claims directly saying the contrary, and
| uses it with a nesterov-like step (line 2 of 3 in the inner
| loop). Finally, it is 'schedule-free' because the schedule is
| actually hardcoded into the algorithm itself -- 1./steps_taken
| which is not necessarily a rare learning rate schedule. This is a
| decently robust but sometimes suboptimal schedule, and I find it
| sketchy to make claims that it is 'schedule-free'. This also
| cripples the optimizer by tying performance to the number of
| steps taken -- which is potentially a problem if you are using
| any batchsize+lr scaling strategies as I understand.
|
| There is a mixture of hype and substance here, and I wish the
| author was more straightforward with their approach and claims. I
| think there is the potential for a good "bolts-included"
| optimizer with some of the ideas being presented here, but the
| amount of overhyping and deception makes me not want to trust any
| of the following work coming.
|
| Unfortunately, hype is what sells best on Twitter, and some of
| the claims being made here appear to be at the very best
| deceptive, and at the very worst, untrue. I could be wrong --
| these are just my personal opinions from my own experience, but I
| do occasionally find myself distraught about the things that tend
| to catch wind in the technical news cycle.
|
| -Fern
| johndough wrote:
| I did a quick comparison on MNIST with a small ConvNet, comparing
| this AdamWSCheduleFree optimizer against a few other optimizers
| (RAdam, NAdam, AdamW, SGD, Adam, Adafactor, SophiaG). The
| validation accuracy seems to be okay and the train loss decreases
| remarkably quickly.
|
| Validation accuracy: https://i.imgur.com/8ZtX7Rd.png
|
| Train loss: https://i.imgur.com/o5XdQ29.png
|
| Code: https://bpa.st/NVJQ (currently only runs on my computer,
| but not enough time to clean it up)
|
| Note that this is just a toy benchmark with very little
| hyperparameter tuning. You could probably get similar results
| with most optimizers and an appropriate schedule. Nevertheless, I
| appreciate every hyperparameter that I do not have to set
| manually.
|
| In summary, this seems to be a promising optimizer. I'll add it
| to my list of optimizers to try for new deep learning projects.
___________________________________________________________________
(page generated 2024-04-06 23:00 UTC)