[HN Gopher] DisTrO - a family of low latency distributed optimizers
       ___________________________________________________________________
        
       DisTrO - a family of low latency distributed optimizers
        
       Author : SchwKatze
       Score  : 47 points
       Date   : 2024-08-27 18:32 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | simonw wrote:
       | Most of the information about this is in this PDF (I hate when
       | people publish interesting information exclusively in PDFs):
       | https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...
       | 
       | I converted it to Markdown (using Gemini 1.5 Pro) and pasted it
       | into a Gist here:
       | https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...
       | 
       | From the abstract:
       | 
       | > Training large scale neural networks typically involves sharing
       | gradients between all accelerators, which necessitates
       | specialized, high-speed interconnects. To address this, we
       | introduce DisTrO, a family of architecture-agnostic and network-
       | agnostic distributed optimizers that reduces the inter-GPU
       | communication requirements by four to five orders of magnitude
       | without relying on amortized analysis, enabling low-latency
       | training of large neural networks on slow internet bandwidths
       | with heterogeneous networking hardware.
       | 
       | This could be a HUGE deal.
       | 
       | Currently if you want to train giant LLMs you need a big pile of
       | GPUs in the same location as each other due to the amount of
       | information that needs to shuffle between them during training.
       | 
       | If DisTrO works as intended, it will be possible to train models
       | using GPUs in different places - potentially enabling SETI@home
       | style training where thousands of people with gaming PCs at home
       | could donate their GPU time to a large training effort.
       | 
       | Their tweet about this has more:
       | https://twitter.com/NousResearch/status/1828121648383566270
       | 
       | > Nous Research is proud to release a preliminary report on
       | DisTrO (Distributed Training Over-the-Internet) a family of
       | architecture-agnostic and network-agnostic distributed optimizers
       | that reduces the inter-GPU communication requirements by 1000x to
       | 10,000x without relying on amortized analysis, and matches
       | AdamW+All-Reduce in convergence rates. This enables low-latency
       | training of large neural networks on slow internet bandwidths
       | with heterogeneous networking hardware.
       | 
       | > DisTrO can increase the resilience and robustness of training
       | LLMs by minimizing dependency on a single entity for computation.
       | DisTrO is one step towards a more secure and equitable
       | environment for all participants involved in building LLMs.
       | 
       | > Without relying on a single company to manage and control the
       | training process, researchers and institutions can have more
       | freedom to collaborate and experiment with new techniques,
       | algorithms, and models. This increased competition fosters
       | innovation, drives progress, and ultimately benefits society as a
       | whole.
        
         | liuliu wrote:
         | As much as I liked the team, there is really no information
         | other than the loss graph :(
        
           | az226 wrote:
           | That's not quite true. They also tested benchmarks and
           | compared with an AdamW trained model.
        
       | logicchains wrote:
       | I'd love to believe it's true but I suspect they're overstating
       | the result, or it's a fluke. Presumably teams at large firms like
       | Meta would have put a lot of effort into checking whether not-
       | synchronise-every-step training could match synchronise-every-
       | step training before investing hundreds of millions of dollars
       | into the low-latency, high-throughput network hardware necessary
       | for the latter.
        
         | regularfry wrote:
         | Not if it cost them a month to do so.
        
         | arilotter wrote:
         | We're pretty confident it's not a fluke, and paper + code are
         | the next step, within a couple months. It's not "synchronize
         | every step", but it's "do something every step".
         | 
         | We double and triple and quadruple checked our results, to make
         | sure that we are in fact getting results like this while only
         | doing our thing every step, and it really keeps holding up.
         | 
         | Don't trust our word for it, though, you'll see when the paper
         | comes out :)
        
           | RicoElectrico wrote:
           | Um, so why announce something before even a paper with
           | replicable details is available? To put it bluntly, what are
           | we supposed to do with the information?
           | 
           | I could be less harsh if this was some grant requirement to
           | release a report before a certain date, but I don't see any
           | grant funding declaration.
        
             | CuriouslyC wrote:
             | I'm happy to have the project on my radar, and though they
             | could be a bit clearer about the provisional nature of the
             | research I don't think it's wrong to want to hype the
             | potential of it a bit.
        
             | arilotter wrote:
             | We're excited about the potential and want to find other
             | folks also excited about it that are interested in working
             | for/with us to build things on the foundations of DisTrO!
             | Plus also it's so cool and mind boggling to us that we
             | wanted to share the hype a little bit, it was hard not
             | being able to tell anyone we were working on it
        
               | SchwKatze wrote:
               | I sent a email yesterday to you guys to find a way I can
               | help to build this pretty pretty cool idea.
        
         | hobofan wrote:
         | Is synchronize-every-step training the status quo for training
         | LLMs?
         | 
         | I've not kept up-to-date with training/optimizer research for
         | quite some time but during the deep learning craze there were
         | papers like the ones about DistBelief/Downpour SDG[0] that
         | showed how to scale up training by only doing occasional
         | synchronization. Did that not transfer to transformer/LLM
         | training?
         | 
         | [0]:
         | https://proceedings.neurips.cc/paper_files/paper/2012/hash/6...
        
           | adw wrote:
           | Yes, ultimately everyone is currently doing something which
           | looks like synchronous data parallel training on the outside.
           | 
           | The linked PDF is very light on detail, but what results they
           | do claim are about a 1.2bn parameter model. This is tiny; you
           | don't need network-bound distributed training (ie, anything
           | beyond a single datacenter class machine, or less if you're
           | patient) to train a model that size. The comms requirements
           | also scale with the model size, so I strongly suspect people
           | hoping for embarrassingly-parallel-style scaling properties
           | are going to be disappointed.
           | 
           | (They also appear to have, in part, reinvented parameter
           | servers.)
        
             | huac wrote:
             | in particular it appears that they only implement data
             | parallel DP - at 1.2B you can fit full copy of model into
             | memory, but larger models require splitting the weights
             | across multiple machines (different techniques eg
             | distributed data parallel DDP, tensor parallel TP, pipeline
             | parallel TP, ...)
             | 
             | without more details it's unclear if the proposed technique
             | keeps its speedups in that case
        
       | iamronaldo wrote:
       | This seems huge no? Couldn't this enable "community based" ai
       | training at home?
        
       | arjvik wrote:
       | There's no information about what this is, beyond a teaser of a
       | loss graph. Really hoping this is something that gets released to
       | the world, not hidden behind closed doors.
        
         | arilotter wrote:
         | Paper & code in the next couple months. We're workin on em :)
        
       ___________________________________________________________________
       (page generated 2024-08-27 23:00 UTC)