[HN Gopher] How to train large models on many GPUs? (2021)
___________________________________________________________________
How to train large models on many GPUs? (2021)
Author : eternalban
Score : 132 points
Date : 2023-02-11 14:22 UTC (8 hours ago)
(HTM) web link (lilianweng.github.io)
(TXT) w3m dump (lilianweng.github.io)
| dauertewigkeit wrote:
| Why isn't there a framework that does all this automatically for
| you?
|
| I tried torch FSDP but it only managed to increase the memory to
| something like 150% of 1 GPU.
|
| I eventually ended up sharding my model manually with .cuda() and
| .to() which works much better, but now I am limited to one module
| on one GPU and I would like to expand even more, and that would
| mean spinning up more nodes and splitting the model over how many
| GPUs manually.
|
| I would be interested if anyone knows of a framework that manages
| this automatically and just works.
|
| EDIT: BTW I am talking about model sharding not data parallelism
| which works very well with DDP.
| atty wrote:
| Beyond the other answers, I'll point out that pytorch is
| developing tools that will make doing this work by hand or
| implementing in a framework much easier. They're building a
| native DTensor implementation and testing out SPMD-style
| distributed models with pipelining. DTensor is in
| torch.distributed, and the SPMD code is in the repo called Tau
| under the pytorch org on github.
| amelius wrote:
| > Why isn't there a framework that does all this automatically
| for you?
|
| Question: could this be implemented in PyTorch in an opaque
| way? Or would it require changes to its API?
| guardiantesla wrote:
| >Why isn't there a framework that does all this automatically
| for you?
|
| Check MosaicML if it might help in your case. I haven't tried
| myself but they've most customizations and speed up
| optimizations I came across in the recent times
|
| https://www.mosaicml.com/blog/supercharge-training-composer
|
| Also worth checking out their "training from scratch" blog
| posts.
|
| Training StableDiffusion:
| https://www.mosaicml.com/blog/training-stable-diffusion-from...
|
| Training GPT-3: https://www.mosaicml.com/blog/billion-
| parameter-gpt-training...
| NerdyDrone wrote:
| Mosaic's open source library is excellent: Composer
| https://github.com/mosaicml/composer.
|
| * It gives you PyTorch DDP for free. Makes FSDP about as easy
| as can be, and provides best in class performance monitoring
| tools. https://docs.mosaicml.com/en/v0.12.1/notes/distributed
| _train...
|
| Here's a nice intro to using Huggingface models: https://docs
| .mosaicml.com/en/v0.12.1/examples/finetune_huggi...
|
| I'm just a huge fan of their developer experience. It's up
| there with Transformers and Datasets as the nicest tools to
| use.
| sandkoan wrote:
| Might this be what you're looking for:
| https://github.com/bigscience-workshop/petals ?
| arcanus wrote:
| It's a fair question.
|
| Nvidia's NCCL and AMD's RCCL provide parallelism constructs
| that really are hidden at the framework level (such as PyT).
|
| However, I don't think that you would want to hide model, data,
| or tensor parallelism. It's too important a consideration for
| performance and training convergence impact.
|
| At least in scientific computing, I've never observed effective
| means of automatic parallelism expressed across many nodes
| despite decades of research. I'm not optimistic this will be
| effective anytime soon.
| minimaxir wrote:
| DeepSpeed became popular soon after this post was originally
| published and is natively supported by many PyTorch training
| frameworks.
|
| https://www.deepspeed.ai
|
| https://www.deepspeed.ai/training/
| dauertewigkeit wrote:
| I tried that as well, but maybe I did not use it correctly. I
| did not see the full sharding that I was hoping for. I only
| saw results similiar to FSDP.
| cma wrote:
| How about flexflow?
|
| https://huggingface.co/transformers/v4.9.2/parallelism.html
| #...
| buildbot wrote:
| Any framework that "just works" tends to not work when some
| small change is needed or a new model with new data/compute
| roofline comes out.
| option wrote:
| There is - https://docs.nvidia.com/deeplearning/nemo/user-
| guide/docs/en...
|
| Supports data, tensor, pipeline, sequence parallelisms,
| activation checkpointing, distributed optimizers, fused kernels
| and more.
| amelius wrote:
| I'm waiting for GPU cards that allow the user to plug in memory
| modules.
| saurik wrote:
| Instead of waiting for the future maybe you could look to the
| past? That's how graphics cards used to work back a couple
| decades ago.
| TheGuyWhoCodes wrote:
| AMD Radeon Pro SSG had 4 nvme slots on the card itself but that
| was 2017 but with direct storage API that might be able to have
| some gains for large models.
| buildbot wrote:
| I could never get a solid answer wether that was presented as
| memory to the GPU or just as a PCIE switch with NVME drives
| hanging off one side and the GPU on another.
| TheGuyWhoCodes wrote:
| As far as I remember it was presented as a drive and was
| good for sequential read but you had to use AMD's API to
| get the full benefit
| dang wrote:
| Discussed (a bit) at the time:
|
| _How to train large models on many GPUs?_ -
| https://news.ycombinator.com/item?id=28657797 - Sept 2021 (9
| comments)
| eternalban wrote:
| Somewhat amazed, dang, that this topic is not discussed more
| widely here or elsewhere. There is a _lot_ of HPC and DS
| expertise out there which lacks understanding of ML system
| architecture (in the sense of the deployed machinery in toto).
|
| Her follow up post [1] is also recommended for those who (like
| me, are experienced but not in ML) finally had things click
| because of the OP writeup:
|
| _Large Transformer Model Inference Optimization_ (2023)
|
| https://lilianweng.github.io/posts/2023-01-10-inference-opti...
|
| A very cool cite from that article is LLM.int8():
| https://arxiv.org/abs/2208.07339
| 631246101 wrote:
| [flagged]
| 631246101 wrote:
| https://news.ycombinator.com/login
___________________________________________________________________
(page generated 2023-02-11 23:00 UTC)