[HN Gopher] A RoCE network for distributed AI training at scale
       ___________________________________________________________________
        
       A RoCE network for distributed AI training at scale
        
       Author : mikece
       Score  : 47 points
       Date   : 2024-08-05 16:13 UTC (6 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | jauntywundrkind wrote:
       | From the paper, seems like they are using RDMA to/from video
       | cards, skipping the nic.
       | 
       | > * These transactions require GPU-to-RDMA NIC support for
       | optimal performance*
       | 
       | Remarkably consumer computing actually has similarly found reason
       | to bypass sending data through the cpu; texture streaming.
       | DirectStorage and Sony's Kraken purport to let the GPU read
       | direct from the SSD. It's a storage application instead of NIC,
       | but still built around PCIe DMA-P2P (at least the DirectStorage
       | is I think).
       | 
       | Table 2, network stats for 128 GPUs is kind of interesting. Most
       | topologies such as AllGather and AllReduce run with only 4 Queue
       | Pairs. Not my area of expertise at all but wow that seems tiny!
       | All this network, and basically everyone's talking to only a few
       | peers? That's what it means right?
       | 
       | The discussion at the end of the paper talked about Flowlets. The
       | description makes me think a little bit of hash bucket chaining,
       | where you try the first path, and if latter a conflict arise or
       | the oath degrades, there's a fallback path already planned. Like
       | there's would be a fallback chained bucket in a hash.
        
         | wmf wrote:
         | The NIC is still there but they're skipping the data copy from
         | system RAM to GPU RAM. https://developer.nvidia.com/gpudirect
        
       | eslaught wrote:
       | So they're re-inventing HPC networks in the data center.
       | 
       | https://en.wikipedia.org/wiki/Fat_tree
       | 
       | https://www.cs.umd.edu/class/spring2021/cmsc714/readings/Kim...
       | 
       | I'm sure there are innovations here, but most of this has been
       | standard in HPC for decades. (Fat trees since 1985, Dragonfly
       | since 2008.) This is not new science, folks.
        
         | wmf wrote:
         | It's not new science, but tuning RoCE performance is new
         | engineering.
        
       ___________________________________________________________________
       (page generated 2024-08-05 23:00 UTC)