[HN Gopher] An overview of gradient descent optimization algorit...
___________________________________________________________________
An overview of gradient descent optimization algorithms (2016)
Author : skidrow
Score : 113 points
Date : 2025-01-23 13:28 UTC (2 days ago)
(HTM) web link (www.ruder.io)
(TXT) w3m dump (www.ruder.io)
| sk11001 wrote:
| It's a great summary for ML interview prep.
| janalsncm wrote:
| I disagree, it is old and most of those algorithms aren't used
| anymore.
| sk11001 wrote:
| That's how interviews go though, it's not like I've ever had
| to use Bayes rule at work but for a few years everyone loved
| asking about it in screening rounds.
| esafak wrote:
| I'd still expect an MLE to know it though.
| mike-the-mikado wrote:
| In my experience a lot of people "know" maths, but fail to
| recognise the opportunities to use it. Some of my
| colleagues were pleased when I showed them that their ad
| hoc algorithm was equivalent to an application of Bayes'
| rule. It gave them insights into the meaning of constants
| that had formerly been chosen by trial and error.
| janalsncm wrote:
| Everyone's experience is different but I've been in dozens
| of MLE interviews (some of which I passed!) and have never
| once been asked to explain the internals of an optimizer.
| The interviews were all post 2020, though.
|
| Unless someone had a _very_ good reason I would consider it
| weird to use anything other than AdamW. The compute you
| could save on a slightly better optimizer pale in
| comparison to the time you will spend debugging an opaque
| training bug.
| sega_sai wrote:
| Interesting, but it does not seem to be an overview of gradient
| optimisers, but rather gradient optimisers in ML, as I see no
| mentions of BFGS and the likes.
| mike-the-mikado wrote:
| I think the big difference is dimensionality. If the
| dimensionality is low, then taking account of the 2nd
| derivatives becomes practical and worthwhile.
| VHRanger wrote:
| I'm also curious about gradient-less algorithms
|
| For non deep learning applications, Nelder-Mead saved my butt a
| fees times
| amelius wrote:
| ChatGPT also advised me to use NM a couple of times, which
| was neat.
| imurray wrote:
| Nelder-Mead has often not worked well for me in moderate to
| high dimensions. I'd recommend trying Powell's method if you
| want to quickly converge to a local optimum. If you're using
| scipy's wrappers, it's easy to swap between the two:
|
| https://docs.scipy.org/doc/scipy/reference/optimize.html#loc.
| ..
|
| For nastier optimization problems there are lots of other
| options, including evolutionary algorithms and Bayesian
| optimization:
|
| https://facebookresearch.github.io/nevergrad/
|
| https://github.com/facebook/Ax
| woadwarrior01 wrote:
| Look into zeroth-order optimizers and CMA-ES.
| analog31 wrote:
| It's with the utmost humility that I confess to falling back
| on "just use Nelder-Mead" in *scipy.optimize* when something
| is ill behaved. I consider it to be a sign that I'm doing
| something wrong, but I certainly respect its use.
| janalsncm wrote:
| Article is from 2016. It only mentions AdamW at the very end in
| passing. These days I rarely see much besides AdamW in
| production.
|
| Messing with optimizers is one of the ways to enter
| hyperparameter hell: it's like legacy code but on steroids
| because changing it only breaks your training code
| stochastically. Much better to stop worrying and love AdamW.
| nkurz wrote:
| The mention of AdamW is brief, but in his defense he includes a
| link that gives a gloss of it: "An updated overview of recent
| gradient descent algorithms"
| [https://johnchenresearch.github.io/demon/].
| pizza wrote:
| Luckily we have Shampoo, SOAP, Modula, Schedule-free variants,
| and many more these days being researched! I am very very
| excited by the heavyball library in particular
| ImageXav wrote:
| Something that stuck out to me in the updated blog [0] is that
| Demon Adam performed much better than even AdamW, with very
| interesting learning curves. I'm wondering now why it didn't
| become the standard. Anyone here have insights into this?
|
| [0] https://johnchenresearch.github.io/demon/
| gzer0 wrote:
| Demon Adam didn't become standard largely for the same reason
| many "better" optimizers never see wide adoption: it's a
| newer tweak, not clearly superior on every problem, is less
| familiar to most engineers, and isn't always bundled in major
| frameworks. By contrast, AdamW is now the "safe default" that
| nearly everyone supports and knows how to tune, so teams
| stick with it unless they have a strong reason not to.
|
| Edit: Demon involves decaying the momentum parameter over
| time, which introduces a new schedule or formula for how
| momentum should be reduced during training. That can feel
| like additional complexity or a potential hyperparameter
| rabbit hole. Teams trying to ship products quickly often
| avoid adding new hyperparameters unless the gains are
| decisive.
| ipunchghosts wrote:
| Example of thr bitter lesson. None of these nuanced matter 8
| years later where everyone uses sgd or adamw.
___________________________________________________________________
(page generated 2025-01-25 23:01 UTC)