[HN Gopher] Sophia: Scalable Stochastic 2nd-Order Optimizer for ...
       ___________________________________________________________________
        
       Sophia: Scalable Stochastic 2nd-Order Optimizer for Language Model
       Pre-Training
        
       Author : tosh
       Score  : 49 points
       Date   : 2024-04-07 08:23 UTC (14 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | namibj wrote:
       | I wonder how it relates/compares to
       | https://arxiv.org/abs/2205.08253 : Title: Momentum-Based Policy
       | Gradient with Second-Order Information
       | 
       | Abstract: Variance-reduced gradient estimators for policy
       | gradient methods have been one of the main focus of research in
       | the reinforcement learning in recent years as they allow
       | acceleration of the estimation process. We propose a variance-
       | reduced policy-gradient method, called SHARP, which incorporates
       | second-order information into stochastic gradient descent (SGD)
       | using momentum with a time-varying learning rate. SHARP algorithm
       | is parameter-free, achieving [?]-approximate first-order
       | stationary point with O([?]-3) number of trajectories, while
       | using a batch size of O(1) at each iteration. Unlike most
       | previous work, our proposed algorithm does not require importance
       | sampling which can compromise the advantage of variance reduction
       | process. Moreover, the variance of estimation error decays with
       | the fast rate of O(1/t2/3) where t is the number of iterations.
       | Our extensive experimental evaluations show the effectiveness of
       | the proposed algorithm on various control tasks and its advantage
       | over the state of the art in practice.
       | 
       | Though I guess it may be more suitable to training on more
       | interactive tasks like those emphasizing in-context learning, to
       | better exploit RL's ability to adequately deal with tasks where
       | early parts of the model's output are left open, especially when
       | used with curriculum learning to gradually build up capability.
       | Might be more suited to diffusion models than classic GPT 's,
       | though, as RL shines where models have more agency, even if it's
       | just deciding in what order to write the output.
        
         | f_devd wrote:
         | I'm not too familiar with SHARP but I've implemented SophiaH
         | before, it is effectively mSGD but with the momentum being
         | scaled by inverse of a separate hessian moment with some
         | additional clipping. It works surprisingly well if you can
         | afford the additional memory/compute.
         | 
         | It seems like SHARP directly incorporates the hessian directly
         | into the SGD momentum, and increases the 'alpha' or momentum
         | contribution over time.
        
       ___________________________________________________________________
       (page generated 2024-04-07 23:01 UTC)