[HN Gopher] Nvidia DGX GH200 Whitepaper
       ___________________________________________________________________
        
       Nvidia DGX GH200 Whitepaper
        
       Author : volta87
       Score  : 53 points
       Date   : 2023-07-30 17:30 UTC (5 hours ago)
        
 (HTM) web link (resources.nvidia.com)
 (TXT) w3m dump (resources.nvidia.com)
        
       | mmaunder wrote:
       | The memory and bandwidth numbers are mind blowing. Going to be
       | very hard to catch Nvidia. It's as if competitors are going
       | through the motions for participation prizes.
        
       | smodad wrote:
       | What's funny is that even though the DGX GH200 is some of the
       | most powerful hardware available, there's such a voracious demand
       | that it's not gonna be enough to quench it. In fact, this is one
       | of those cases where I think the demand will always outpace
       | supply. Exciting stuff ahead.
       | 
       | I heard Elon say something interesting during the
       | discussion/launch of xAI: "My prediction is that we will go from
       | an extreme silicon shortage today, to probably a voltage-
       | transformer shortage in about year, and then an electricity
       | shortage in about a year, two years."
       | 
       | I'm not sure about the timeline, but it's an intriguing idea that
       | soon the rate limiting resource will be electricity. I wonder how
       | true that is and if we're prepared for that.
        
         | jiggawatts wrote:
         | He's just plain wrong about the electricity usage going up
         | because of AI compute.
         | 
         | To a first approximation, the amount of silicon wafers going
         | through fabs globally is constant. We won't suddenly increase
         | chip manufacturing a hundredfold! There aren't enough fabs or
         | "tools" like the ASML EUV machines for that.
         | 
         | Electricity is used for lots of things, not just compute, and
         | within compute the AI fraction is tiny. We're ramping up a
         | rounding error to a slightly larger rounding error.
         | 
         | What _will_ increase is global energy demand for overall
         | economic activity as manufacturing and industry is accelerated
         | by AIs.
         | 
         | Anyone who's played games like Factorio would know intuitively
         | that the only two real inputs to the economy are raw materials
         | and energy. Increases to manufacturing speed need matching
         | increases to energy supply!
        
       | m3kw9 wrote:
       | So basically 2x faster than H100
        
         | [deleted]
        
         | luc4sdreyer wrote:
         | They claim 1.1x to 7x, depending on what you're doing. The 10%
         | to 50% is for the ~10k GPU LLM training, where the main
         | bottleneck tends to be networking:
         | 
         | > DGX GH200 enables more efficient parallel mapping and
         | alleviates the networking communication bottleneck. As a
         | result, up to 1.5x faster training time can be achieved over a
         | DGX H100-based solution for LLM training at scale.
        
       | tikkun wrote:
       | As context: 1x dgx gh200 has 256x gh200s which each have 1x h100
       | and 1x grace cpu
        
         | luc4sdreyer wrote:
         | Adding up to "1 exaFLOPS" (sparse FP8). For reference, the
         | fastest FP64 supercomputer is the AMD-based Frontier
         | supercomputer, at 1.1 exaFLOPS.
        
       | jacquesm wrote:
       | I wonder how much this thing will cost, best I've been able to
       | find so far is a 'low 8 digits' estimate in Anandtech article but
       | nothing more specific than that.
       | 
       | https://www.anandtech.com/show/18877/nvidia-grace-hopper-has...
        
       | LASR wrote:
       | I would be interesting to know what kind of next-gen models this
       | can train.
       | 
       | On the LLM frontier, we're starting to hit the limits of
       | reasoning abilities in the current gen.
        
       | tuetuopay wrote:
       | Why is this called a whitepaper, as this is more of a
       | documentation and architecture overview of the cluster? Wow a
       | CLOS topology for networking, very innovative.
       | 
       | Details on NVLink would be great. For example, the needs and
       | problems solved by their custom cables seemingly required by
       | NVLink would be worth a whitepaper.
       | 
       | Don't get me wrong, this is still great the general public can
       | get a glimpse into Grace Hopper. And they do a good job of
       | simplifying while throwing around mind-boggling numbers (the
       | NVLink bandwidth is insane, though no words on latency, crucial
       | for remote memory access).
        
         | syntaxing wrote:
         | Agreed, seems like an application note more than a white paper.
        
       ___________________________________________________________________
       (page generated 2023-07-30 23:00 UTC)