[HN Gopher] 100 Pages of raw notes released with the language mo...
___________________________________________________________________
100 Pages of raw notes released with the language model OPT-175
Author : mfiguiere
Score : 66 points
Date : 2022-05-04 14:10 UTC (8 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| flakiness wrote:
| From the note: > "AKA: Help! I'm oncall, it's 3am, and everything
| is on fire!"
|
| I didn't think ML model training ever needs on-call, especially
| for this kind of research-oriented ones. But apparently it's a
| thing. So is this what MLOps is about?
| Nzen wrote:
| This twitter post points at the pdf rendering [0] of the communal
| log book that facebook researchers kept while training opt-175.
|
| [0]
| https://github.com/facebookresearch/metaseq/tree/main/projec...
| humanistbot wrote:
| SemanticStrengh wrote:
| Did they leverage deepspeed? Also where are the accuracy results
| vs popular datasets?
| tomcam wrote:
| MAD props for Meta being releasing these raw notes. Love seeing
| into their work process as well.
| learndeeply wrote:
| Skimming through this, a lot of it has to deal with bad GPU
| hosts.
|
| > CSP fat fingered and deleted our entire cluster when trying to
| replenish our buffer nodes.
|
| Ouch.
| ensan wrote:
| "The paper mentions 35 (!) manual restarts to train OPT-175B due
| to hardware failure (and 70+ automatic restarts)."
|
| https://twitter.com/awnihannun/status/1521572873449533440
| Ameo wrote:
| Wow, they hot-swapped activation functions (GELU -> RELU) during
| training. They are indeed very similar activation functions, but
| it's kinda crazy to me that you can make that kind of a change to
| a model while it's training, preserving all weights and other
| state, and just keep going. They changed weight clipping
| thresholds on the fly too.
|
| They also swapped out the optimizer several times from what I can
| tell, switching between Adam, "Fake SGD", and "Vanilla SGD"
| multiple times.
|
| Even without the huge amounts of hardware/driver issues they
| seemed to be having with the GPUs in their big training
| cluster(s), this puts into perspective how hard it is to train
| enormous models like this. Many of the failures don't have an
| immediately obvious cause. Plus, there aren't all that many
| places out there doing training at this scale so I imagine many
| of these things need to get figured out on their own.
| ackbar03 wrote:
| I'm surprised by how hacky the whole process is and how it's
| mostly just about tuning different hyperparameters
| sbierwagen wrote:
| Welcome to ML.
| daenz wrote:
| Can you say more about why you are seeing the process as hacky?
| ackbar03 wrote:
| I've started reading from bottom and haven't read the whole
| thing yet. But their default action as stated in their log
| when face exploding gradients or unstable training is to just
| roll back a checkpoint and lower lr. Other proposed actions
| such as clamping activations are also just pretty standard
| things to try.
|
| I guess since their goal is to just be able to have a trained
| model it doesn't really matter. But it doesn't seem to be a
| easily reproducible process, and like i said a bit hacky in
| my opinion
| gnulinux wrote:
| They hot-swapped all kinds of model hyperparameters such as
| changing activation function and optimizer. It doesn't look
| like there was a principled reason why they kept switching
| optimizer or activation function. Maybe as they were training
| the model their data scientists kept finding ways to improve
| the model? Not sure, but it looks extremely hacky to me. Not
| something some team ran one day and forgot until it trained.
| joshvm wrote:
| Not sure if your comment is meant as a disagreement or a
| question.
|
| Generally the way hyperparameters are adjusted is some mix of
| intuition/experience and random/grid searching. Plus most
| people don't have the resources/infra to do a large scale
| grid search on a model that might take a day or more to
| train. It's somewhat principled, but often a random search is
| just as good as fiddling numbers by hand and often you have
| to figure out why something worked post-hoc. You also accept
| that you might never have a good explanation - for all you
| know it's dataset dependent - and trust that your results are
| good enough to convince peer review (and you can show that
| this other parameter set was worse, so you didn't use it).
| It's hacky in the sense that a lot of the work in getting to
| state of the art (moving the needle on a benchmark by less
| than 1%) involves playing with the numbers until you get the
| best results. For example here the engineers modify the
| learning rate between various runs. I don't think they really
| had any theoretical reason behind the step changes apart from
| "this will probably work better because we've seen that
| effect when training similar sized models".
|
| Adjusting learning rate schedules is one of the simplest
| knobs to tweak. When you're working with huge models
| generally you want to use as big a batch size as you can get
| away with to reduce training time. A bit counter to the
| earlier thinking where LeCunn said something like "friends
| don't let friends use batch sizes > 32".
|
| There may be some guided methods like exploring the parameter
| space in a Bayesian way (eg try to efficiently explore which
| knobs make the most difference).
| ackbar03 wrote:
| They seem to be adjusting lr between epochs as well when
| the loss explodes, not just runs. But I haven't read
| through the whole thing yet, maybe they trained the whole
| thing properly from start to finish at the end. Otherwise
| that would be extremely hacky and irreproducible
| dotnet00 wrote:
| Yeah I think for now they were just trying to get any
| comparable results due to a near complete lack of details
| on GPT-3. They seemed to have a hard deadline for the
| task.
| lumost wrote:
| The time and expense of training a model at this size
| does not benefit well from trial and error. It's simply
| impractical to iteratively try ~20 different learning
| schedules.
|
| Hideously ineficient and hacky to have someone manually
| tweaking things, but not terribly different from the
| state of the art for scientific research. As long as they
| state the objectives of their manual control and produce
| a log of what they did someone else could _try_ to
| replicate it.
| mhh__ wrote:
| Currently I think we are still gluing transistors (networks)
| together (spiritually) like the very early days of the modern
| computer, it is hacky.
| dotnet00 wrote:
| It's pretty reassuring to see that constantly fiddling with the
| model and trying to adjust learning rates on the fly is also
| normal at leading research labs. Although on the other hand it
| only makes the replication crisis even worse.
|
| After a quick look through, I really hope releasing raw notes
| like this becomes more of a trend!
___________________________________________________________________
(page generated 2022-05-04 23:01 UTC)