[HN Gopher] Sophia: Scalable Stochastic 2nd-Order Optimizer for ...
___________________________________________________________________
Sophia: Scalable Stochastic 2nd-Order Optimizer for Language Model
Pre-Training
Author : tosh
Score : 49 points
Date : 2024-04-07 08:23 UTC (14 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| namibj wrote:
| I wonder how it relates/compares to
| https://arxiv.org/abs/2205.08253 : Title: Momentum-Based Policy
| Gradient with Second-Order Information
|
| Abstract: Variance-reduced gradient estimators for policy
| gradient methods have been one of the main focus of research in
| the reinforcement learning in recent years as they allow
| acceleration of the estimation process. We propose a variance-
| reduced policy-gradient method, called SHARP, which incorporates
| second-order information into stochastic gradient descent (SGD)
| using momentum with a time-varying learning rate. SHARP algorithm
| is parameter-free, achieving [?]-approximate first-order
| stationary point with O([?]-3) number of trajectories, while
| using a batch size of O(1) at each iteration. Unlike most
| previous work, our proposed algorithm does not require importance
| sampling which can compromise the advantage of variance reduction
| process. Moreover, the variance of estimation error decays with
| the fast rate of O(1/t2/3) where t is the number of iterations.
| Our extensive experimental evaluations show the effectiveness of
| the proposed algorithm on various control tasks and its advantage
| over the state of the art in practice.
|
| Though I guess it may be more suitable to training on more
| interactive tasks like those emphasizing in-context learning, to
| better exploit RL's ability to adequately deal with tasks where
| early parts of the model's output are left open, especially when
| used with curriculum learning to gradually build up capability.
| Might be more suited to diffusion models than classic GPT 's,
| though, as RL shines where models have more agency, even if it's
| just deciding in what order to write the output.
| f_devd wrote:
| I'm not too familiar with SHARP but I've implemented SophiaH
| before, it is effectively mSGD but with the momentum being
| scaled by inverse of a separate hessian moment with some
| additional clipping. It works surprisingly well if you can
| afford the additional memory/compute.
|
| It seems like SHARP directly incorporates the hessian directly
| into the SGD momentum, and increases the 'alpha' or momentum
| contribution over time.
___________________________________________________________________
(page generated 2024-04-07 23:01 UTC)