[HN Gopher] An overview of gradient descent optimization algorit...
       ___________________________________________________________________
        
       An overview of gradient descent optimization algorithms (2016)
        
       Author : skidrow
       Score  : 113 points
       Date   : 2025-01-23 13:28 UTC (2 days ago)
        
 (HTM) web link (www.ruder.io)
 (TXT) w3m dump (www.ruder.io)
        
       | sk11001 wrote:
       | It's a great summary for ML interview prep.
        
         | janalsncm wrote:
         | I disagree, it is old and most of those algorithms aren't used
         | anymore.
        
           | sk11001 wrote:
           | That's how interviews go though, it's not like I've ever had
           | to use Bayes rule at work but for a few years everyone loved
           | asking about it in screening rounds.
        
             | esafak wrote:
             | I'd still expect an MLE to know it though.
        
             | mike-the-mikado wrote:
             | In my experience a lot of people "know" maths, but fail to
             | recognise the opportunities to use it. Some of my
             | colleagues were pleased when I showed them that their ad
             | hoc algorithm was equivalent to an application of Bayes'
             | rule. It gave them insights into the meaning of constants
             | that had formerly been chosen by trial and error.
        
             | janalsncm wrote:
             | Everyone's experience is different but I've been in dozens
             | of MLE interviews (some of which I passed!) and have never
             | once been asked to explain the internals of an optimizer.
             | The interviews were all post 2020, though.
             | 
             | Unless someone had a _very_ good reason I would consider it
             | weird to use anything other than AdamW. The compute you
             | could save on a slightly better optimizer pale in
             | comparison to the time you will spend debugging an opaque
             | training bug.
        
       | sega_sai wrote:
       | Interesting, but it does not seem to be an overview of gradient
       | optimisers, but rather gradient optimisers in ML, as I see no
       | mentions of BFGS and the likes.
        
         | mike-the-mikado wrote:
         | I think the big difference is dimensionality. If the
         | dimensionality is low, then taking account of the 2nd
         | derivatives becomes practical and worthwhile.
        
         | VHRanger wrote:
         | I'm also curious about gradient-less algorithms
         | 
         | For non deep learning applications, Nelder-Mead saved my butt a
         | fees times
        
           | amelius wrote:
           | ChatGPT also advised me to use NM a couple of times, which
           | was neat.
        
           | imurray wrote:
           | Nelder-Mead has often not worked well for me in moderate to
           | high dimensions. I'd recommend trying Powell's method if you
           | want to quickly converge to a local optimum. If you're using
           | scipy's wrappers, it's easy to swap between the two:
           | 
           | https://docs.scipy.org/doc/scipy/reference/optimize.html#loc.
           | ..
           | 
           | For nastier optimization problems there are lots of other
           | options, including evolutionary algorithms and Bayesian
           | optimization:
           | 
           | https://facebookresearch.github.io/nevergrad/
           | 
           | https://github.com/facebook/Ax
        
           | woadwarrior01 wrote:
           | Look into zeroth-order optimizers and CMA-ES.
        
           | analog31 wrote:
           | It's with the utmost humility that I confess to falling back
           | on "just use Nelder-Mead" in *scipy.optimize* when something
           | is ill behaved. I consider it to be a sign that I'm doing
           | something wrong, but I certainly respect its use.
        
       | janalsncm wrote:
       | Article is from 2016. It only mentions AdamW at the very end in
       | passing. These days I rarely see much besides AdamW in
       | production.
       | 
       | Messing with optimizers is one of the ways to enter
       | hyperparameter hell: it's like legacy code but on steroids
       | because changing it only breaks your training code
       | stochastically. Much better to stop worrying and love AdamW.
        
         | nkurz wrote:
         | The mention of AdamW is brief, but in his defense he includes a
         | link that gives a gloss of it: "An updated overview of recent
         | gradient descent algorithms"
         | [https://johnchenresearch.github.io/demon/].
        
         | pizza wrote:
         | Luckily we have Shampoo, SOAP, Modula, Schedule-free variants,
         | and many more these days being researched! I am very very
         | excited by the heavyball library in particular
        
         | ImageXav wrote:
         | Something that stuck out to me in the updated blog [0] is that
         | Demon Adam performed much better than even AdamW, with very
         | interesting learning curves. I'm wondering now why it didn't
         | become the standard. Anyone here have insights into this?
         | 
         | [0] https://johnchenresearch.github.io/demon/
        
           | gzer0 wrote:
           | Demon Adam didn't become standard largely for the same reason
           | many "better" optimizers never see wide adoption: it's a
           | newer tweak, not clearly superior on every problem, is less
           | familiar to most engineers, and isn't always bundled in major
           | frameworks. By contrast, AdamW is now the "safe default" that
           | nearly everyone supports and knows how to tune, so teams
           | stick with it unless they have a strong reason not to.
           | 
           | Edit: Demon involves decaying the momentum parameter over
           | time, which introduces a new schedule or formula for how
           | momentum should be reduced during training. That can feel
           | like additional complexity or a potential hyperparameter
           | rabbit hole. Teams trying to ship products quickly often
           | avoid adding new hyperparameters unless the gains are
           | decisive.
        
       | ipunchghosts wrote:
       | Example of thr bitter lesson. None of these nuanced matter 8
       | years later where everyone uses sgd or adamw.
        
       ___________________________________________________________________
       (page generated 2025-01-25 23:01 UTC)