https://github.com/facebookresearch/schedule_free Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} facebookresearch / schedule_free Public * Notifications * Fork 18 * Star 514 * Schedule-Free Optimization in PyTorch License Apache-2.0 license 514 stars 18 forks Branches Tags Activity Star Notifications * Code * Issues 3 * Pull requests 1 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights facebookresearch/schedule_free This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 14 Commits examples/mnist examples/mnist schedulefree schedulefree CODE_OF_CONDUCT.md CODE_OF_CONDUCT.md CONTRIBUTING.md CONTRIBUTING.md LICENSE LICENSE README.md README.md pyproject.toml pyproject.toml requirements.txt requirements.txt setup.cfg setup.cfg setup.py setup.py View all files Repository files navigation * README * Code of conduct * Apache-2.0 license * Security Schedule-Free Learning - A New Way to Train Schedule-Free Optimizers in PyTorch. Authors: Aaron Defazio, Xingyu Yang, Konstantin Mishchenko, Ashok Cutkosky, Harsh Mehta, Ahmed Khaled TLDR Faster training without schedules - no need to specify the stopping time/steps in advance! pip install schedulefree Primary implementations are SGDScheduleFree and AdamWScheduleFree. Approach Schedule-Free learning replaces the momentum of an underlying optimizer with a combination of interpolation and averaging. In the case of gradient descent, the Schedule-free update is: $$ \begin{align*} y_{t} & = (1-\beta)z_{t} + \beta x_{t},\\ z_{t+1} & =z_{t}-\gamma\nabla f(y_{t}),\\ x_{t+1} & =\left(1-\frac{1}{t}\right) x_{t}+\frac{1}{t}z_{t+1}, \end{align*} $$ Here $x$ is the sequence that evaluations of test/val loss should occur at, which differs from the primary iterates $z$ and the gradient evaluation locations $y$. The updates to $z$ correspond to the underlying optimizer, in this case a simple gradient step. As the name suggests, Schedule-free learning does not require a decreasing learning rate schedule, yet typically out-performs, or at worst matches, SOTA schedules such as cosine-decay and linear decay. Only two sequences need to be stored at a time (the third can be computed from the other two on the fly) so this method has the same memory requirements as the base optimizer (parameter buffer + momentum). We provide both AdamW and SGD versions in this repo. How to Use Since our optimizer uses two different points for gradient calls and test/val loss calculations, it's necessary to switch the param buffer between the two during training. This is done by calling optimizer.train() at the same place you call model.train() and optimizer.eval() at the same place you call model.eval(). If your code supports PyTorch Optimizer step closures, you can use the closure forms of the optimizers, which do not require the .train () and .eval() calls. Examples Examples of using the schedulefree package can be found in the examples folder. These include: * Image classification (MNIST) using Convnets* * More examples to be added *Example is modified from Pytorch Examples Repo. Caveats * If your model uses BatchNorm, additional modifications are required for test/val evaluations to work correctly. Right before eval, something like the following: model.train() optimizer.eval() for batch in itertools.islice(train_loader, 50): _ = self.model(batch) model.eval() This will replace the training_mean/training_var cache (which is updated in each forward pass when in model.train() mode) with values calculated at $x$ instead of $y$. Using PreciseBN will also avoid this issue. * Many code bases use additional features that may not be compatible without additonal changes. For instance, if the parameters are cached in fp16, the cached versions will need to be updated manually to ensure the correct $x$ sequence is used for evaluation, not the $y$ sequence. Some GradScalers do this. * Training is more sensitive to the choice of $\beta$ than you may expect from standard momentum. Our default of $0.9$ works on most problems but it may be necessary to increase the value to $0.95$ or $0.98$ particually for very long training runs. * There is no need to use a learning rate scheduler, however the code is compatible with one. * Using learning rate warmup is recommended. This is supported through the warmup_steps parameter. * This method does require tuning - it won't necessarily out-perform a schedule approach without also tuning regularization and learning rate parameters. * For SGD, a learning rate 10x-50x larger than classical rates seems to be a good starting point. * For AdamW, learnings rates in the range 1x-10x larger than with schedule based approaches seem to work. * Our method can also be implemented as a wrapper around a base optimizer, where the momentum of the base optimizer is disabled. We didn't do that as PyTorch's Adam implementation would still allocate memory for it's momentum buffer exp_avg even if we don't use it. License See the License file. Related Work Schedule-Free learning can be seen as an interpolation between primal averaging ($\beta=1$) and Polyak-Ruppert averaging ($\beta=0)$. The advantage of this interpolation is that it allows us to get the best of both worlds. We can achieve the fast early stage convergence of Polyak-Ruppert averaging (since the $z$ sequence moves quicker than the $x$ sequence), without the $x$ sequence straying too far from the $z$ sequence, which causes instability. Our method is also related to Nesterov's accelerated method (Nesterov, 1983), which can be written in the following form: $$ \begin{align*} y_{t} & =(1-2/(t+1))x_{t} + (2/(t+1))z_{t}\\ z_ {t+1} & =z_{t}-\frac{t}{2L}\nabla f(y_{t})\\ x_{t+1} & =(1-2/(t+1))x_ {t}+(2/(t+1))z_{t+1} \end{align*} $$ Our approach has the same three sequences, but uses very different weights, and crucially, does not include an increasing learning rate over time, which is essential for accelerated rates with Nesterov's method. We also use different weight sequences for the interpolation operation versus the averaging operation. Tail averaging approaches such as Stochastic Weight Averaging (Izmailov et al., 2018) and LAtest Weight Averaging (Kaddour, 2022; Sanyal et al., 2023) combine averaging with large or cyclic learning rates. They still require the use of a schedule, introduce additional hyper-parameters to tune, and require additional memory compared to our technique. It is also possible to use SWA and LAWA on top of our approach, potentially giving further gains. Portes Et. Al. (2022) use cyclic learning rate schedules with increasing cycle periods to give a method that explores multiple points along the Pareto frontier of training time vs eval performance. Each point at the end of a cycle is an approximation to the model from a tuned schedule ending at that time. Our method gives the entire frontier, rather than just a few points along the path. Exponential moving averages (EMA) of the iterate sequence are used in the popular Lookahead optimizer (Zhang et al., 2019). The Lookahead method can be seen as the EMA version of primal averaging, just as exponential weight averaging is the EMA version of Polyak-Ruppert averaging. Our extra interpolation step can potentially be used in combination with the lookahead optimizer also. About Schedule-Free Optimization in PyTorch Resources Readme License Apache-2.0 license Code of conduct Code of conduct Security policy Security policy Activity Custom properties Stars 514 stars Watchers 9 watching Forks 18 forks Report repository Releases No releases published Packages 0 No packages published Contributors 4 * @adefazio adefazio Aaron Defazio * @tfaod tfaod Alice Yang * @mrdbourke mrdbourke Daniel Bourke * @BasedLukas BasedLukas Languages * Python 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.