[HN Gopher] Nvidia DGX GH200 Whitepaper
___________________________________________________________________
Nvidia DGX GH200 Whitepaper
Author : volta87
Score : 53 points
Date : 2023-07-30 17:30 UTC (5 hours ago)
(HTM) web link (resources.nvidia.com)
(TXT) w3m dump (resources.nvidia.com)
| mmaunder wrote:
| The memory and bandwidth numbers are mind blowing. Going to be
| very hard to catch Nvidia. It's as if competitors are going
| through the motions for participation prizes.
| smodad wrote:
| What's funny is that even though the DGX GH200 is some of the
| most powerful hardware available, there's such a voracious demand
| that it's not gonna be enough to quench it. In fact, this is one
| of those cases where I think the demand will always outpace
| supply. Exciting stuff ahead.
|
| I heard Elon say something interesting during the
| discussion/launch of xAI: "My prediction is that we will go from
| an extreme silicon shortage today, to probably a voltage-
| transformer shortage in about year, and then an electricity
| shortage in about a year, two years."
|
| I'm not sure about the timeline, but it's an intriguing idea that
| soon the rate limiting resource will be electricity. I wonder how
| true that is and if we're prepared for that.
| jiggawatts wrote:
| He's just plain wrong about the electricity usage going up
| because of AI compute.
|
| To a first approximation, the amount of silicon wafers going
| through fabs globally is constant. We won't suddenly increase
| chip manufacturing a hundredfold! There aren't enough fabs or
| "tools" like the ASML EUV machines for that.
|
| Electricity is used for lots of things, not just compute, and
| within compute the AI fraction is tiny. We're ramping up a
| rounding error to a slightly larger rounding error.
|
| What _will_ increase is global energy demand for overall
| economic activity as manufacturing and industry is accelerated
| by AIs.
|
| Anyone who's played games like Factorio would know intuitively
| that the only two real inputs to the economy are raw materials
| and energy. Increases to manufacturing speed need matching
| increases to energy supply!
| m3kw9 wrote:
| So basically 2x faster than H100
| [deleted]
| luc4sdreyer wrote:
| They claim 1.1x to 7x, depending on what you're doing. The 10%
| to 50% is for the ~10k GPU LLM training, where the main
| bottleneck tends to be networking:
|
| > DGX GH200 enables more efficient parallel mapping and
| alleviates the networking communication bottleneck. As a
| result, up to 1.5x faster training time can be achieved over a
| DGX H100-based solution for LLM training at scale.
| tikkun wrote:
| As context: 1x dgx gh200 has 256x gh200s which each have 1x h100
| and 1x grace cpu
| luc4sdreyer wrote:
| Adding up to "1 exaFLOPS" (sparse FP8). For reference, the
| fastest FP64 supercomputer is the AMD-based Frontier
| supercomputer, at 1.1 exaFLOPS.
| jacquesm wrote:
| I wonder how much this thing will cost, best I've been able to
| find so far is a 'low 8 digits' estimate in Anandtech article but
| nothing more specific than that.
|
| https://www.anandtech.com/show/18877/nvidia-grace-hopper-has...
| LASR wrote:
| I would be interesting to know what kind of next-gen models this
| can train.
|
| On the LLM frontier, we're starting to hit the limits of
| reasoning abilities in the current gen.
| tuetuopay wrote:
| Why is this called a whitepaper, as this is more of a
| documentation and architecture overview of the cluster? Wow a
| CLOS topology for networking, very innovative.
|
| Details on NVLink would be great. For example, the needs and
| problems solved by their custom cables seemingly required by
| NVLink would be worth a whitepaper.
|
| Don't get me wrong, this is still great the general public can
| get a glimpse into Grace Hopper. And they do a good job of
| simplifying while throwing around mind-boggling numbers (the
| NVLink bandwidth is insane, though no words on latency, crucial
| for remote memory access).
| syntaxing wrote:
| Agreed, seems like an application note more than a white paper.
___________________________________________________________________
(page generated 2023-07-30 23:00 UTC)