[HN Gopher] How does gradient descent work?
       ___________________________________________________________________
        
       How does gradient descent work?
        
       Author : jxmorris12
       Score  : 172 points
       Date   : 2025-10-03 20:59 UTC (4 days ago)
        
 (HTM) web link (centralflows.github.io)
 (TXT) w3m dump (centralflows.github.io)
        
       | xnx wrote:
       | Related: Gradient descent visualizer:
       | https://uclaacm.github.io/gradient-descent-visualiser/
        
       | markisus wrote:
       | First I thought this would be just another gradient descent
       | tutorial for beginners. But the article goes quite deep into
       | gradient descent dynamics, looking into third order
       | approximations of the loss function and eventually motivating a
       | concept called "central flows." Their central flow model was able
       | to predict loss graphs for various training runs across different
       | neural network architectures.
        
       | untrimmed wrote:
       | So all the classic optimization theory about staying in the
       | stable region is basically what deep learning doesn't do. The
       | model literally learns by becoming unstable, oscillating, and
       | then using that energy to self-correct.
       | 
       | The chaos is the point. What a crazy, beautiful mess.
        
         | pkoird wrote:
         | Reminds me of Simulated Annealing. Some randomness have always
         | been part of optimization processes that seek a better
         | equilbrium than local. Genetic Algorithms have mutation,
         | Simulated Annealing has temperature, Gradient Descent similarly
         | has random batches.
        
         | danielmarkbruce wrote:
         | I dont think it's a good way to think of it.
         | 
         | Researchers are constantly looking to train more expressive
         | models more quickly. Any method which can converge + take large
         | jumps will be chosen. You are sort of guaranteed to end up in a
         | place where the sharpness is high but it somehow comes under
         | control. If we weren't there.... we'd try a new architecture
         | until we arrived there. So deep learning doesn't "do this", we
         | do it using any method possible and it happened to be the
         | various architectures that currently fit into "deep learning".
         | Keep in mind many architectures which are deep do not converge
         | - you see survivorship bias.
        
       | DoctorOetker wrote:
       | Fascinating, do the gained insights allow to directly compute the
       | central flow in order to speed up convergence? Or is this
       | preliminary exploration to understand _how_ it had been working?
       | 
       | They explicitly ignore momentum and exponentially weighted moving
       | average, but that should result in the time-averaged gradient
       | descent (along the valley, not across it). But that requires
       | multiple evaluations, do any of the expressions for the central
       | flow admit fast / computationally efficient central flow
       | calculation?
        
         | markisus wrote:
         | The authors somewhat address your questions in the accompanying
         | paper https://arxiv.org/abs/2410.24206
         | 
         | > We emphasize that the central flow is a theoretical tool for
         | understanding optimizer behavior, not a practical optimization
         | method. In practice, maintaining an exponential moving average
         | of the iterates (e.g., Morales-Brotons et al., 2024) is likely
         | a computational feasible way to estimate the optimizer's time-
         | averaged trajectory.
         | 
         | They analyze the behavior of RMSProp (Adam without momentum)
         | using their framework to come up with simplified mathematical
         | models that are able to predict actual training behavior in
         | experiments. It looks like their mathematical models explain
         | why RMSProp works, in a way that is more satisfying than the
         | usual hand waving explanations.
        
           | DoctorOetker wrote:
           | Yes, it certainly provides a lot more clarity than the
           | handwaving.
           | 
           | While momentum seems to work, and the authors clearly state
           | it is not intended as a practical optimization method, I
           | can't exclude that we can improve convergence rates by
           | building on this knowledge.
           | 
           | Is it guaranteed for the oscillating behavior to have a
           | period of 2 steps? or is say 3 step period also possible (a
           | vector in a plane could alternately point to 0 degrees, 120
           | degrees and 240 degrees).
           | 
           | The way I read this presentation the implication seems to be
           | that its always a period of 2. Perhaps if the top-2
           | sharpnesses are degenerate (identical), a period of N
           | distinct from 2 could be possible?
           | 
           | It makes you wonder what if instead of storing momentum with
           | exponential moving average one were to use the average of the
           | last 2 iterations, so there would be less lag.
           | 
           | It also makes me wonder if we should perform 2 iterative
           | steps PER sequence so that the single-sample-sequence gives
           | feedback _along it 's valley_ instead of across it. One would
           | go through the corpus at half the speed, but convergence may
           | be more accurate.
        
       | big_hacker wrote:
       | https://youtu.be/ajGX7odA87k?t=613 the stuff is what the stuff,
       | brother.
        
       | ComplexSystems wrote:
       | Very neat stuff! So one question is, if we had an analog computer
       | that could run these flows exactly, would we get better results
       | if we ran the gradient flow or this central flow?
        
       | WithinReason wrote:
       | very nice, but it's only useful if it also works with stochastic
       | gradients.
        
       | programjames wrote:
       | It's a little easier to see what's happening if you fully write
       | out the central flow:                   -1/e * dw/dt = [?]L -
       | [?]S * <[?]L, [?]S> /||[?]S||2
       | 
       | We're projecting the loss gradient onto the sharpness gradient,
       | and subtracting it off. If you didn't read the article, the
       | sharpness S is the sum of the eigenvalues of the Hessian of the
       | loss that are larger than 2/e, a measure of how unstable the
       | learning dynamics are.
       | 
       | This is almost Sobolev preconditioning:                   -1/e *
       | dw/dt = [?]L - [?]S = [?](I - D)L
       | 
       | where this time S is the sum of _all_ the eigenvalues (so, the
       | Laplacian of L).
        
         | lcnielsen wrote:
         | Yeah, I did a lot of traditional optimization problems during
         | my Ph. D., this type of expression pops up all the time with
         | higher-order gradient-based methods. You rescale or otherwise
         | adjust the gradient based on some system-characteristic
         | eigenvalues to promote convergence without overshooting too
         | much.
        
           | d3m0t3p wrote:
           | This sounds a lot like what the Muon / Shampoo optimizer do.
        
           | d3m0t3p wrote:
           | Would you have some literature about that ?
        
       | imtringued wrote:
       | I apparently didn't get the memo and used stochastic gradient
       | descent with momentum outside of deep learning without running
       | into any problems given a sufficiently low learning rate.
       | 
       | I'm not really convinced that their explanation truly captures
       | why this success should be exclusive to deep learning.
        
       ___________________________________________________________________
       (page generated 2025-10-07 23:00 UTC)