[HN Gopher] Zero-3 Offload: Scale DL models to trillion paramete...
___________________________________________________________________
Zero-3 Offload: Scale DL models to trillion parameters without code
changes
Author : ghosthamlet
Score : 81 points
Date : 2021-03-13 15:06 UTC (7 hours ago)
(HTM) web link (www.deepspeed.ai)
(TXT) w3m dump (www.deepspeed.ai)
| ansk wrote:
| Question for someone knowledgable about this: if I have a model
| which is large -- but small enough that I can fit a single
| training example on GPU -- does this approach offer speedups
| compared to simple gradient accumulation? Or is this only useful
| for models which are so large that the model parameters
| themselves are overwhelming GPU memory?
| singhrac wrote:
| For those searching, DeepSpeed is implemented as a set of
| C++/CUDA extensions on top of PyTorch (compiled using their JIT).
| alphagrep12345 wrote:
| Simple 10 min overview/tutorial (official) if someone is
| interested - https://www.youtube.com/watch?v=ovQC7FqXHXk
| vladf wrote:
| Alternatively, one could get rid of the memory used by optimizers
| entirely by switching to vanilla SGD.
|
| I haven't tried this on transformers and maybe that's what breaks
| down here but in "classic" supervised settings I've found SGD
| with schedule tuning just as fast as Adam.
| gwern wrote:
| SGD doesn't work on large Transformers, no. You need something
| like AdamW.
| The_rationalist wrote:
| Mish is generally superior to RadamW
| https://lessw.medium.com/meet-mish-new-state-of-the-art-
| ai-a...
| andrewprock wrote:
| How much data do you need to mitigate the risk of over fitting a
| trillion parameter model?
| gwern wrote:
| You ideally need ~500GB of text, or so. EleutherAI's The Pile
| was designed to be just big enough to fit a 1t GPT efficiently,
| and you can get the various scaling curves out of the OA-
| related scaling papers. (You want the amount of data that fits
| into a single epoch, because if you reuse data, you get less
| bang for the FLOPs buck, and FLOPS constraints are right now
| much more binding than data or model size.)
| andrewprock wrote:
| This feels off by a couple of orders of magnitude, unless a
| significant number of the parameters are not independent.
| gwern wrote:
| It's quite amusing. The standard statistical theory does
| not work at all in estimating data vs model size, and the
| bounds are all vacuously large. It's a very active area of
| research, understanding why models act so simple when
| overparameterized and coming up with real measures of model
| complexity. Lots to read there if you are interested in
| such things.
| singhrac wrote:
| Well, that's the "magic" of modern deep learning. You can
| fit models with p > n somehow without overfitting. In some
| areas you might find this called "the strong inductive bias
| of neural networks" or "double descent" but no one has
| found a convincing explanation (to me).
| The_rationalist wrote:
| See also zeroth order backpropagation which allows 300X faster
| training while not reducing throughput that much
| https://arxiv.org/abs/2011.08895 How much zero-3 affect accuracy?
|
| See also https://github.com/microsoft/fastformers
| stephenroller wrote:
| Support for this was also added to
| [Fairscale](https://fairscale.readthedocs.io/en/latest/) and
| [Fairseq](https://github.com/pytorch/fairseq) last week. In
| particular, the Fairscale implementation can be used in any
| pyotrch project without requiring the use of the Deepspeed
| trainer.
| diptanu wrote:
| What are the relevant commits in Fairseq for this? I couldn't
| figure out the changes by looking at the commits from last
| week.
| mchusma wrote:
| This is super impressive. I could not figure out for a while who
| exactly was running this project, but it looks like its
| Microsoft. Great work!
| bionhoward wrote:
| please hook this up to Jax!
| joshlk wrote:
| GPT-NeoX is an example project that is using deepspeed and Zero-3
| offloading. The wider project intend to train a GPT-3 sized model
| and release it freely to the world.
|
| https://github.com/EleutherAI/gpt-neox
| ma2rten wrote:
| It seems like Zero-3 doesn't work for them:
|
| https://github.com/EleutherAI/gpt-neox/issues/171
| joshlk wrote:
| Looks like they got it working recently
| https://github.com/EleutherAI/gpt-neox/pull/178
| dqpb wrote:
| Did you even read through the issue? I don't see anything
| that indicates it won't work.
| ma2rten wrote:
| Yes, I did. The last comment is a traceback and an
| explanation what would have to be done to fix it.
| minimaxir wrote:
| Your comment implied it's not possible _at all_ for them
| to use it, not that it 's currently not working.
| bevenky wrote:
| This is also being added to pytorch
|
| https://github.com/pytorch/pytorch/pull/46750
| minimaxir wrote:
| I don't think that's the Stage 3 announced in this blog post,
| but it's def a framework for it.
| FL33TW00D wrote:
| Huggingface has been working on implementing this into their
| library, and it has some pretty amazing effects on the size of
| models you can train on a simple Colab.
|
| https://huggingface.co/blog/zero-deepspeed-fairscale
| dataangel wrote:
| ELI5? All this techno babble just sounds like "it's faster
| because we optimized it". What are the nontrivial, new
| fundamental tricks?
| jonbaer wrote:
| I think there is some explanation (on the previous model?)
| here, https://www.youtube.com/watch?v=tC01FRB0M7w
| jiofih wrote:
| Third paragraph or so in the overview:
|
| > ZeRO removes the memory redundancies across data-parallel
| processes by partitioning the three model states (optimizer
| states, gradients, and parameters) across data-parallel
| processes instead of replicating them. By doing this, it boosts
| memory efficiency compared to classic data-parallelism while
| retaining its computational granularity and communication
| efficiency
| dataangel wrote:
| Yeah that would be the techno-babble. I've been working on a
| machine learning pipeline for 6 years and I still have no
| idea what this means.
| eugenhotaj wrote:
| If your pipeline uses only "classic" ml models, then this
| won't make too much sense. It's mostly applicable to NNs.
| cambalache wrote:
| The product is obviously not for you but for clueless PHBs
| who want the "latest and best" for the team so those
| useless ML engineers can finally put his brilliant idea in
| production with a less than 1% prediction error.
| zachthewf wrote:
| It doesn't sound like techno-babble to me. They've
| distributed storage across nodes rather than replicating on
| each node, hence the model size is now scalable with number
| of nodes rather than being limited to what could be stored
| on a single node.
| p1esk wrote:
| But it's not clear how they managed to improve training
| on a single GPU: they say they can fit 40B model on a
| single V100.
| liuliu wrote:
| They offload parameters, gradients and optimizer states
| (such as moment, velocity and exponential avg of these in
| Adam) into CPU memory.
| p1esk wrote:
| They did all that before:
| https://arxiv.org/abs/2101.06840, but they could only fit
| a model with 13B weights on a single V100.
| jiofih wrote:
| You can read the paper here:
| https://arxiv.org/abs/1910.02054
| liuliu wrote:
| It is mostly applicable to transformer models, the ideas in
| the paper would be alien if you work on computer vision.
|
| In transformer models, big chunk of memory was parameters,
| and states for optimizers (because vanilla SGD not used
| there). The memory optimization technique that removes
| parameters duplication on each GPU or offload entirely to
| CPU makes sense.
|
| In computer vision, big chunk of memory was hold by forward
| layer activations and the memory optimization technique
| applicable in these cases would be binomial checkpointing.
___________________________________________________________________
(page generated 2021-03-13 23:01 UTC)