[HN Gopher] DisTrO - a family of low latency distributed optimizers
___________________________________________________________________
DisTrO - a family of low latency distributed optimizers
Author : SchwKatze
Score : 47 points
Date : 2024-08-27 18:32 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| simonw wrote:
| Most of the information about this is in this PDF (I hate when
| people publish interesting information exclusively in PDFs):
| https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...
|
| I converted it to Markdown (using Gemini 1.5 Pro) and pasted it
| into a Gist here:
| https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...
|
| From the abstract:
|
| > Training large scale neural networks typically involves sharing
| gradients between all accelerators, which necessitates
| specialized, high-speed interconnects. To address this, we
| introduce DisTrO, a family of architecture-agnostic and network-
| agnostic distributed optimizers that reduces the inter-GPU
| communication requirements by four to five orders of magnitude
| without relying on amortized analysis, enabling low-latency
| training of large neural networks on slow internet bandwidths
| with heterogeneous networking hardware.
|
| This could be a HUGE deal.
|
| Currently if you want to train giant LLMs you need a big pile of
| GPUs in the same location as each other due to the amount of
| information that needs to shuffle between them during training.
|
| If DisTrO works as intended, it will be possible to train models
| using GPUs in different places - potentially enabling SETI@home
| style training where thousands of people with gaming PCs at home
| could donate their GPU time to a large training effort.
|
| Their tweet about this has more:
| https://twitter.com/NousResearch/status/1828121648383566270
|
| > Nous Research is proud to release a preliminary report on
| DisTrO (Distributed Training Over-the-Internet) a family of
| architecture-agnostic and network-agnostic distributed optimizers
| that reduces the inter-GPU communication requirements by 1000x to
| 10,000x without relying on amortized analysis, and matches
| AdamW+All-Reduce in convergence rates. This enables low-latency
| training of large neural networks on slow internet bandwidths
| with heterogeneous networking hardware.
|
| > DisTrO can increase the resilience and robustness of training
| LLMs by minimizing dependency on a single entity for computation.
| DisTrO is one step towards a more secure and equitable
| environment for all participants involved in building LLMs.
|
| > Without relying on a single company to manage and control the
| training process, researchers and institutions can have more
| freedom to collaborate and experiment with new techniques,
| algorithms, and models. This increased competition fosters
| innovation, drives progress, and ultimately benefits society as a
| whole.
| liuliu wrote:
| As much as I liked the team, there is really no information
| other than the loss graph :(
| az226 wrote:
| That's not quite true. They also tested benchmarks and
| compared with an AdamW trained model.
| logicchains wrote:
| I'd love to believe it's true but I suspect they're overstating
| the result, or it's a fluke. Presumably teams at large firms like
| Meta would have put a lot of effort into checking whether not-
| synchronise-every-step training could match synchronise-every-
| step training before investing hundreds of millions of dollars
| into the low-latency, high-throughput network hardware necessary
| for the latter.
| regularfry wrote:
| Not if it cost them a month to do so.
| arilotter wrote:
| We're pretty confident it's not a fluke, and paper + code are
| the next step, within a couple months. It's not "synchronize
| every step", but it's "do something every step".
|
| We double and triple and quadruple checked our results, to make
| sure that we are in fact getting results like this while only
| doing our thing every step, and it really keeps holding up.
|
| Don't trust our word for it, though, you'll see when the paper
| comes out :)
| RicoElectrico wrote:
| Um, so why announce something before even a paper with
| replicable details is available? To put it bluntly, what are
| we supposed to do with the information?
|
| I could be less harsh if this was some grant requirement to
| release a report before a certain date, but I don't see any
| grant funding declaration.
| CuriouslyC wrote:
| I'm happy to have the project on my radar, and though they
| could be a bit clearer about the provisional nature of the
| research I don't think it's wrong to want to hype the
| potential of it a bit.
| arilotter wrote:
| We're excited about the potential and want to find other
| folks also excited about it that are interested in working
| for/with us to build things on the foundations of DisTrO!
| Plus also it's so cool and mind boggling to us that we
| wanted to share the hype a little bit, it was hard not
| being able to tell anyone we were working on it
| SchwKatze wrote:
| I sent a email yesterday to you guys to find a way I can
| help to build this pretty pretty cool idea.
| hobofan wrote:
| Is synchronize-every-step training the status quo for training
| LLMs?
|
| I've not kept up-to-date with training/optimizer research for
| quite some time but during the deep learning craze there were
| papers like the ones about DistBelief/Downpour SDG[0] that
| showed how to scale up training by only doing occasional
| synchronization. Did that not transfer to transformer/LLM
| training?
|
| [0]:
| https://proceedings.neurips.cc/paper_files/paper/2012/hash/6...
| adw wrote:
| Yes, ultimately everyone is currently doing something which
| looks like synchronous data parallel training on the outside.
|
| The linked PDF is very light on detail, but what results they
| do claim are about a 1.2bn parameter model. This is tiny; you
| don't need network-bound distributed training (ie, anything
| beyond a single datacenter class machine, or less if you're
| patient) to train a model that size. The comms requirements
| also scale with the model size, so I strongly suspect people
| hoping for embarrassingly-parallel-style scaling properties
| are going to be disappointed.
|
| (They also appear to have, in part, reinvented parameter
| servers.)
| huac wrote:
| in particular it appears that they only implement data
| parallel DP - at 1.2B you can fit full copy of model into
| memory, but larger models require splitting the weights
| across multiple machines (different techniques eg
| distributed data parallel DDP, tensor parallel TP, pipeline
| parallel TP, ...)
|
| without more details it's unclear if the proposed technique
| keeps its speedups in that case
| iamronaldo wrote:
| This seems huge no? Couldn't this enable "community based" ai
| training at home?
| arjvik wrote:
| There's no information about what this is, beyond a teaser of a
| loss graph. Really hoping this is something that gets released to
| the world, not hidden behind closed doors.
| arilotter wrote:
| Paper & code in the next couple months. We're workin on em :)
___________________________________________________________________
(page generated 2024-08-27 23:00 UTC)