[HN Gopher] Nvidia Hopper GPU Architecture and H100 Accelerator
___________________________________________________________________
Nvidia Hopper GPU Architecture and H100 Accelerator
Author : jsheard
Score : 214 points
Date : 2022-03-22 15:52 UTC (7 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| lmeyerov wrote:
| Seeing the increased bandwidth is super exciting for a lot of
| business analytics cases we get into for
| IT/security/fraud/finance teams: imagine correlating across lots
| of event data from transactions, logs, ... . Every year, it just
| goes up!
|
| The big welcome surprise for us the secure virtualization.
| Outside of some limited 24/7 ML teams, we mostly see bursty
| multi-tenant scenarios for achieving cost-effective utilization.
| MiG etc static physical partitioning was interesting -- I can
| imagine cloud providers giving that -- but more dynamic & logical
| isolation, with more of a focus on namespace isolation, is more
| relevant to what we see. Once we get into federated learning, and
| further disintermediations around that, even more cool. Imagine
| bursting on 0.1-100 GPUs every 30s-20min. Amazing times!
| algo_trader wrote:
| From the nvidia page,
|
| > 80 billion transistors
|
| > Hopper H100 .. generational leap
|
| > 9x at-scale training performance over A100
|
| > 30x LLM inference throughput
|
| > Transformer Engine .. speed .. 6x without losing accuracy
|
| So another monster chip - same size of the Apple M1-max thingy ..
|
| I guess it comes down to pricing. The A100 is already
| ridiculously expensive at $10K. They can this one at $50K and it
| would sell out?
| p1esk wrote:
| You can buy A100s in a server today, a number of integrators
| will happily sell it to you.
| chockchocschoir wrote:
| As someone who've tried for some weeks, it really seems like
| it's out-of-stock literally everywhere. The demand seems to
| be a lot higher than the supply at the moment, so much that
| I'm considering buying one myself instead of renting servers
| with it.
| nqnielsen wrote:
| And even if vendors does say they have it, or can get it,
| it ended up taking us 4-6 months before systems were
| online.
| kovek wrote:
| Does it make sense that all the GPUs are bought out? They
| each provide a return for mining in the short-term. In the
| long term, they can be used to run A(G)I models, which will
| be very very useful
| fennecfoxen wrote:
| This is the GPU the parent is talking about
| https://www.nvidia.com/en-us/data-center/a100/
| kovek wrote:
| This still makes sense! TPUs are useful for AI, which
| itself will be very very useful. It's almost like it's
| the best investment. That's why smart players buy them
| all. Maybe I'm going out-of-topic.
| p1esk wrote:
| Did you check Lambda or Exxact?
| chockchocschoir wrote:
| Yes, nor Lambda Labs or Exxact Corporation have them
| available last time I checked (last week). Both citing
| high demand as the reason for it being unavailable.
| asciimike wrote:
| Howdy, I run [Crusoe Cloud](https://crusoecloud.com/) and
| we just launched an alpha of an A100 and A40 Cloud
| offering--we've got capacity at a reasonable price!
|
| If you're interested in giving us a shot, feel free to
| shoot me an email at mike at crusoecloud dot com.
| sabalaba wrote:
| We (Lambda) have all of the different NVIDIA GPUs in
| stock ---- can you send a message to sales@lambdalabs.com
| and check in again with your requirements? We're seeing a
| lot more stock these days as the supply chain crisis of
| 2021 comes to an end.
| Uehreka wrote:
| Most people who use one of these will be doing so through an
| EC2 VM (or equivalent). Given that cloud platforms can spread
| load, keep these GPUs churning close to 24/7 and more easily
| predict/amortize costs, they'll probably buy the amount that
| they know they need, and Nvidia probably has some approximately
| correct idea of what that number is.
| spoonjim wrote:
| Off topic but I can't stand when corporations use actual people's
| names for their marketing who never gave them the permission to
| do so. For something like Shakespeare or Cicero I'm OK with it
| but Grace Hopper was alive in my lifetime, and even Tesla feels a
| little weird. What gives you the right to use that person's
| reputation to shill your product?
| paxys wrote:
| > What gives you the right to use that person's reputation to
| shill your product?
|
| Practically speaking you have the right to do anything unless
| someone complains about it. A lot of popular figures, even
| those long dead, have estates and organizations that manage
| their likeliness and other related copyright and IP. IDK what
| the situation is in this case, but Nvidia may very well have
| paid for the name.
| erosenbe0 wrote:
| The situation is that various Australian companies (think
| Kangaroo) and DISH network already have Hopper product lines
| and Nvidia didn't care about getting into a legal kerfuffle
| and used the name anyway. As to whether Hopper's estate was
| consulted I don't know.
| cosmiccatnap wrote:
| spoonjim wrote:
| I don't think my kids have any more right to use my name than
| a corporation, unless I specifically grant them that right
| (like Walt Disney did by naming it the Walt Disney company).
| Another sickening one is the Ed Lee Club in SF, who endorses
| political candidates under the name of a much-loved dead SF
| mayor.
| paxys wrote:
| Your kids have the right to everything you own (
| _including_ your name) by default unless you take steps to
| change that, say using a will or estate.
| spoonjim wrote:
| Yes, I know, I'm saying that it should not be that way.
| Rights to your likeness should end at your death unless
| you specifically write down otherwise.
| oblio wrote:
| Do you have kids?
| spoonjim wrote:
| Yes.
| oblio wrote:
| And you think they shouldn't have that right because of
| social concerns like accumulation of wealth?
| erosenbe0 wrote:
| Theranos' "Edison" machine enters the chat...
| foolfoolz wrote:
| what gives you the right to own a name ever? especially once
| your dead?
| eigenvalue wrote:
| I generally agree with you, but in this case I suspect Grace
| Hopper would be honored by it and also impressed with the
| engineering here. It's not like they slapped her name on a soda
| can or something.
| gautamcgoel wrote:
| This chip is capable of 2000 INT8 Tensor TOPS, or 1000 F16 Tensor
| TFLOPS. In other words, it is capable of performing over a
| quadrillion operations per second. Absolutely insane... I still
| have fond memories of installing my first NVidia gaming GPU, with
| just 512MB of RAM, probably capable of much less than a single
| teraflop of compute.
| Rafuino wrote:
| Good lord 700W TDP!
| Symmetry wrote:
| NVidia and AMD datacenter GPUs continue to diverge between
| focusing on deep learning and traditional scientific computing
| respectively.
| orangebeet wrote:
| This would be cool if they had decent drivers for Linux.
| why_only_15 wrote:
| They do have good drivers for Linux for the things this chips
| is intended to be used for (research, ML).
| johndough wrote:
| If you haven't had any issues with NVIDIA Linux drivers, you
| can count yourself extremely lucky. In the past, I had a
| 50/50 chance of boot failure after installing CUDA drivers
| over 12 different systems. Mainline Ubuntu drivers are
| somewhat stable, but installing a specific CUDA version from
| the official NVIDIA repos rarely works on the first try.
| Switching from Tensorflow to PyTorch has helped a lot though,
| as Tensorflow was much more picky about the installed CUDA
| version.
|
| Obligatory Linus Torvals on NVIDIA:
| https://www.youtube.com/watch?v=_36yNWw_07g
| paxys wrote:
| I can assure you systems that take advantage of this chip for
| scientific/ML workloads aren't running Windows.
| flatiron wrote:
| they may have edited their comment but they were commenting
| on the lack of quality of their Linux drivers (which I agree
| with but only on a consumer level, never used nvidia in a
| server)
| obeliskora wrote:
| Anyone find details about the DPX instructions for dynamic
| programming?
| pjmlp wrote:
| There are some deep dive sessions at GTC that will probably go
| into them.
| trollied wrote:
| Looking forward to seeing Doom run on this.
| [deleted]
| minimaxir wrote:
| So the product naming for Nvidia's server-GPUs by compute power
| now goes:
|
| P100 -> V100 -> A100 -> H100
|
| This is not confusing at all.
| torginus wrote:
| I think this is less of an issue since these GPUs are not meant
| for the everyman, so basically the handful of server
| integrators can figure this out by themselves.
|
| And for your typical dev - they'll interact with the GPU
| through a cloud provider, where they can easily know that a G5
| instance is newer than a G4 one.
| gtirloni wrote:
| Isn't it based on the architecture name?
|
| https://en.wikipedia.org/wiki/Category:Nvidia_microarchitect...
| jsheard wrote:
| Yeah it is, but unless you've memorised the history of Nvidia
| architectures it doesn't tell you which is the newer one
|
| Fermi -> Kepler -> Maxwell -> Pascal -> Volta (HPC only) ->
| Turing -> Ampere -> Hopper (HPC only?) -> Lovelace?
| modeless wrote:
| Someone should make a game like "Pokemon or Big Data" [1]
| except you have to choose which of two GPU names is faster.
| Even the consumer naming is bonkers so there's plenty of
| material there!
|
| [1] http://pixelastic.github.io/pokemonorbigdata/
| ksec wrote:
| Isn't this the norm? Only AMD started the trend of naming
| the uArch with Numbers as Zen 4 or RDNA 3 fairly recently.
| With Intel it is Haswell > Broadwell > ..... Whatever Lake.
| kergonath wrote:
| Intel is using generation numbers in their marketing
| materials. In the technical-oriented slide decks you'd
| see things like "42th generation formerly named Bullshit
| Creek" but they are not supposed to use that for sales.
| And then actual part names like i9-42045K.
|
| We keep using code names in discussions because the
| actual names are ass backwards and not very descriptive.
| jsheard wrote:
| Usually the architecture name isn't the only
| distinguishing feature of the product name, you don't
| need to remember Intel codenames because a Core 12700 is
| obviously newer than a Core 11700
|
| Nvidia's accelerators are just called <Architecture
| Letter>100 every time so if you don't remember the order
| of the letters it's not obvious
|
| They could have just named them P100, V200, A300 and H400
| instead
| 867-5309 wrote:
| >you don't need to remember Intel codenames because a
| Core 12700 is obviously newer than a Core 11700
|
| J3710, 7th Gen J3060, 8th Gen
|
| J4205, 8th Gen J4125, 9th Gen
|
| i3-5005U, 5th Gen N5095, 10th Gen
|
| i7-3770, 3rd Gen 3865U, 7th Gen N3060, 8th Gen
| paulmd wrote:
| And an AMD 5700U is older than a 5400U as well. A 3400G
| is older than a 3100X. 3300X isn't really distinctive
| from 3100X, both are quad-core configurations (but
| different CCD/cache configurations, which is of course
| the name doesn't really disclose to the consumer). It
| happens, naming is a complex topic and there's a lot of
| dimensions to a product.
|
| In general, complaining about naming is peak bikeshedding
| for the tech-aware crowd. There are multiple naming
| schemes, all of them are reasonable, and everyone hates
| some of them for completely legitimate reasons (but
| different for every person). And the resulting
| bikeshedding is exactly as you'd expect with that.
|
| The underlying problem is that products have multiple
| dimensions of interest - you've got architecture, big vs
| small core, core count, TDP, clockrate/binning, cache
| configuration/CCD configuration, graphics configuration,
| etc. If you sort them by generation, then an older but
| higher-spec can beat a newer but lower-spec. If you sort
| by date then refreshes break the scheme. If you split
| things out into series (m7 vs i7) to express TDP then
| some people don't like that there's a bunch of different
| series. If you put them into the same naming scheme then
| some people don't like that a 5700U is slower than a
| 5700X. If you try to express all the variables in a
| single name, you end up with a name like "i7 1185G7"
| where it's incomprehensible if you don't understand what
| each of the parts of the name mean.
|
| (as a power user, I personally think the Ice Lake/Tiger
| Lake naming is the best of the bunch, it expresses
| everything you need to know: architecture, core count,
| power, binning, graphics. But then big.LITTLE had to go
| and mess everything up! And other people still hated it
| because it was more complex.)
|
| There are certain ones like AMD's 5000 series or the
| Intel 10th-gen (Comet Lake 10xxxU) that are just really
| ghastly because they're deliberately trying to mix-and-
| match to confuse the consumer (to sell older stuff as
| being new), but in general when people complain about
| "not understanding all those Lakes and Coves" it's
| usually just because they aren't interested in the
| brand/product and don't want to bother learning the
| names, and they will eagerly rattle off a list of
| painters or cities that AMD uses as their codenames.
|
| Like, again, to reiterate here, I literally never have
| seen anyone raise AMD using painter names as being
| "opaque to the consumer" in the same way that people
| repeatedly get upset about lakes. And it's the exact same
| thing. It's people who know the AMD brand and don't know
| the Intel brand and think that's some kind of a problem
| with the branding, as opposed to a reflection of their
| own personal knowledge.
|
| I fully expect that AMD will release 7000 series desktop
| processors this year or early next year, and exactly 0
| people are going to think that a 7600 being newer than a
| 7702 is confusing in the way that we get all these
| aggrieved posts about Intel and NVIDIA. Yes, 7600 and
| 7702 are different product lines, and that's the exact
| same as your "but i7 3770 and N3060 are different!"
| example. It's simply not that confusing, it takes less
| time to learn than to make a single indignant post on
| social media about it.
|
| Similarly, the NVIDIA practice of using inventors/compsci
| people is not particularly confusing either. Basically
| the same as AMD with the painters/cities.
|
| It's just not that interesting, and it's not worth all
| the bikeshedding that gets devoted to it.
|
| </soapbox>
|
| Anyway, your example is all messed up though. J3710 and
| J3060 are both the same gen (Braswell), launched at the
| same time (Q1 2016), that example is entirely wrong.
| J4125 vs J4205 is an older but higher specced processor
| vs a newer but lower spec, it's a 8th gen Pentium vs a
| 9th gen Celeron, like a 3100X vs a 2700X (zomg 3100X is
| bigger number but actually slower!). And the J4125 and
| J4205 are refreshes of the same architecture with
| legitimately very similar performance classes. i3 and
| Atom or i7 and Atom are completely different product
| lines and the naming is not similar at all there, apart
| from both having 3s as their first number (not even first
| character, that is different too, just happen to share
| the first number somewhere in the name).
|
| Again, like with the Tiger Lake 11xxGxx naming, the
| characters and positions in the name have meaning. You
| can come up with better examples than that even within
| the Intel lineup. Just literally picking 3770 and J3060
| as being "similar" because they both have 3s in them.
|
| The one I would legitimately agree on is that the Atom
| lineup is kind of a mess. Braswell, Apollo Lake, Gemini
| Lake, and Gemini Lake Refresh are all crammed into the
| "3000/4000" series space, and there is no "generational
| number" in that scheme either. Braswell is all 3000
| series and Gemini Lake/Gemini Lake Refresh is all 4000
| series but you've got Apollo Lake sitting in the middle
| with both 3000 and 4000 series chips. And a J3455 (Apollo
| Lake 1.5 GHz) is legitimately a better (or at least
| equal) processor to a J3710 (Braswell 1.6 GHz). Like
| 5700U vs 5800U, there are some legitimate architectural
| differences behind hidden behind an opaque number there
| (and on the Intel it's graphics - Gemini Lake/Gemini Lake
| Refresh have a much better video block).
|
| (And that's the problem with "performance rating"
| approaches, even if a 3710 and a 3455 are similar in
| performance there's still other differences between them.
| Also, PR naming instantly turns into gamesmanship - what
| benchmark, what conditions, what TDP, what level of
| threading? Is an Intel 37000 the same as an AMD 37000?)
| 867-5309 wrote:
| yes, it's a bit of a shitshow, as mutually evidenced.
| unless consumers brush up on such intricate details (most
| do not), they will inevitably fall into traps such as "i7
| is better than i3" e.g. i7-2600 being outperformed by
| i3-10100 and "quad core is better than dual core".
| marketing is becoming more focused on generations now
| which is a prudent move: "10th Gen is better than 2nd
| Gen" but it will be at least a decade before the shitshow
| is swept
| mywittyname wrote:
| Intel was using Core i[3,5,7] names for multiple
| generations. A Core i7 could be faster or slower than a
| Core i5 depending on which generation each existed in.
|
| It is nice when products have a naming scheme where
| natural ordering of the name maps to performance.
| numpad0 wrote:
| We need a canonical, chronologically monotonic, marketing
| independent ID scheme. Marketing people always tries to
| disrupt naming schemes and that's the real problem.
| bee_rider wrote:
| I don't really mind the incomprehensible letters --
| looking up the generation is pretty easy, and these are
| data-center focused products... getting the name right is
| somebody's job and the easiest possible thing.
|
| However, is the number superfluous at this point?
| neogodless wrote:
| You just blew my mind. That did not occur to me, but it is
| obvious in retrospect.
| Uehreka wrote:
| Sure, but the way Nvidia names generations is far from
| obvious. It seems to be "names of famous scientists,
| progressing in alphabetical order, we skip some letters if
| we can't find a well known scientist with a matching last
| name and are excited about a scientist 2 letters from now,
| we wrap around to the beginning of the alphabet when we get
| to the end, and we just skipped from A to H, so expect
| another wraparound in the next 5-10 years."
| [deleted]
| ipsin wrote:
| I mean, I'd already internalized P100 < V100 < A100 as a Colab
| user.
|
| Schedule me on an H100 and I promise I won't mind the
| "confusing" naming.
| jjoonathan wrote:
| Also, the naming drives devs towards the architecture papers,
| which are important if you want to get within sight of
| theoretical perf. When NVidia changes the letter, it's like
| saying "hey, pay attention, at least skim the new
| whitepaper." Over the last decade, I feel like this
| convention has respected my time, so in turn it has earned my
| own respect. I'll read the Hopper whitepaper tonight, or
| whenever it pops up.
| nynx wrote:
| Sounds like we need some new training methods. If training could
| take place locally and asynchronously instead of globally through
| backpropagation, the amount of energy could probably be
| significantly reduced.
| hwers wrote:
| Trying to reduce energy consumption for ML like this is so
| silly.
| mlyle wrote:
| Training costs are growing exponentially bigger.
|
| The degree to which energy and capital costs can be optimized
| will determine how large they can go.
| thfuran wrote:
| Why?
| oblio wrote:
| Reducing energy consumption for computation is not silly.
|
| We're at a point we we're turning into a computation driven
| society and computation is becoming a globally relevant power
| consumption aspect.
|
| > global data centers likely consumed around 205 terawatt-
| hours (TWh) in 2018, or 1 percent of global electricity use
|
| And that's just data centers, if you add all client devices
| you probably double that.
|
| Plus that number will only continue to grow.
| moinnadeem wrote:
| Disclosure: I work at MosaicML
|
| Yeah, I strongly agree. While Nvidia is working on better
| hardware (and they're doing a great job at it!), we believe
| that better training methods should be a big source of
| efficiency. We've released a new PyTorch library for efficient
| training at http://github.com/mosaicml/composer.
|
| Our combinations of methods can train CV models ~4x faster to
| the same accuracy on CV tasks, and ~2x faster to the same
| perplexity/GLUE score on NLP tasks!
| jwuphysics wrote:
| I've been seeing a lot more about MosaicML on my Twitter
| feed. Just wanted to ask -- how are your priorities different
| than, say, Fastai?
| zozbot234 wrote:
| The principled way of doing this is via ensemble learning,
| combining the predictions of multiple separately-trained
| models. But perhaps there are ways of improving that by
| including "global" training as well, where the "separate"
| models are allowed to interact while limiting overall training
| costs.
| captainbland wrote:
| Those specs imply some pretty crazy architectural efficiency
| gains, massive theoretical compute performance per transistor
| compared to Ampere. It's all marketing numbers until the
| benchmarks are out, though.
|
| Edit: big TDP, though.
| ksec wrote:
| And may be taking this opportunity to ask, what happen to
| Nvidia's leak? The hacker hasn't made any more news, and Nvidia
| hasn't provide an update either.
| [deleted]
| Melatonic wrote:
| What was the leak?
| ksec wrote:
| https://news.ycombinator.com/item?id=30590752
| TomVDB wrote:
| In the keynote, Jensen made a sly remark about how they
| themselves could benefit a lot from one of their cyberthreat AI
| solutions.
| quotemstr wrote:
| The simplest explanation is that Nvidia just paid up.
| pixel_fcker wrote:
| The new block cluster shared memory and synchronisation stuff
| looks really really nice.
| [deleted]
| throw0101a wrote:
| ortusdux wrote:
| 80 billion transistors boggles my mind. How many molecules are
| their per transistor?
| martini333 wrote:
| wat
| Symmetry wrote:
| It's a crystal so just one molecule for all the transistors. In
| terms of atoms it's something on the order of the size of a 30
| nm cube and with each silicon atom being .2nm in diameter
| something like 3 million, give or take an order of magnitude or
| two.
| ortusdux wrote:
| That makes sense. My mistake, I did mean atoms, not
| molecules. Wolfram alpha estimates 1.35 million Si atoms, so
| well within 1 order of magnitude.
|
| https://www.wolframalpha.com/input?i=30%5E3+cubic+nanometers.
| ..
| virtuallynathan wrote:
| How does a DGX Pod w/ the new 3.2Tbps per machine NVLINK switch
| compare to Tesla Dojo?
| virtuallynathan wrote:
| Tesla Dojo Training tile (25x D1): 565 TF FP32 / 9 PF BF16/CFP8
| / 11GB SRAM / 10kW
|
| NVIDIA DGX H100 (8x H100): 480 TF FP32 / 8 PF+ TF16 / 16 PF
| INT8 / 640GB HBM3 / 10kW
|
| Dojo off-chip BW: 16 TB/s / 36TB/s off-tile
|
| H100 off-chip BW: 3.9TB/s / 400GB/s off-DGX
| TomVDB wrote:
| When you take software support into account, probably very
| favorable.
|
| I don't know anything about the state of Dojo, but Tesla was
| very hand wavy about their software stack during their
| presentation. And running AI algorithms efficiently on a piece
| of hardware is one of those things that many HW vendors have a
| hard time getting right.
| waynecochran wrote:
| This seems fast... TF32 ....... 1,000 TFLOPS
| (tensor core) FP64/FP32 ... 60 TFLOPS
|
| I am more interested in the 144-core Grace CPU Superchip. nVidia
| is getting into the CPU business...
| [deleted]
| macrolocal wrote:
| 50% sparsity and rated at 700W. The new DGX is 10kW!
| wmwmwm wrote:
| I was recently researching how you'd host systems like this
| in a datacentre and was blown away to find out that you can
| cool 40kW in a single air cooled rack - this might be old
| news for many, but it was 2x or 3x what I expected! Glad I'm
| not paying the electricity bill :)
| jmole wrote:
| Here's what a propane heater of similar output looks like:
| https://www.amazon.com/Dura-Heat-Propane-Forced-
| Heater/dp/B0...
| HelloNurse wrote:
| Most of the propane heater is a fan in a tube, the flame
| is probably quite smaller than a CPU package.
| baq wrote:
| I've got an 8kW wood stove and that thing gets rather hot
| to touch - as in, you will get a blister... 40kW is a
| small city car worth of power.
| cjbgkagh wrote:
| I think the 1PFLOPS figure for TF32 is with sparsity, which
| should be called out in the name. Maybe 'TFS32'? I mainly use
| dense FP16 so the 1PFLOPS for that looks pretty good.
| lostmsu wrote:
| Asked elsewhere, but why FP16 as opposed to BF16?
| cjbgkagh wrote:
| I'm using older Turing GPUs BF16 would require Ampere. The
| weights in my models tend to be normalized so the fraction
| would be more important than the exponent so I would
| probably still use FP16. I would need to test it though.
| Melatonic wrote:
| Same - plus its SUPER
| rafaelero wrote:
| "Combined with the additional memory on H100 and the faster
| NVLink 4 I/O, and NVIDIA claims that a large cluster of GPUs can
| train a transformer up to 9x faster, which would bring down
| training times on today's largest models down to a more
| reasonable period of time, and make even larger models more
| practical to tackle."
|
| Looking good.
| [deleted]
| ml_hardware wrote:
| The 9x speedup is a bit inflated... it's measured at a
| reference point of ~8k GPUs, on a workload that the A100
| cluster is particularly bad at.
|
| When measured at smaller #s of GPUs which are more realistic,
| the speedup is somewhere between 3.5x - 6x. See the GTC Keynote
| video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330
|
| Based on hardware specs alone, I think that training
| transformers with FP8 on H100 systems vs. FP16 on A100 systems
| should only be 3-4x faster. Definitely looking forward to
| external benchmarks over the coming months...
| Melatonic wrote:
| We have needed wide use of NVlink or something like it for a
| long time now......heres to hoping mobo manufacturers actually
| widely implement it!
| learndeeply wrote:
| The open standard version of NVLink is CXL. They're available
| in latest gen CPUs.
| Melatonic wrote:
| Interesting - I did not know that. Don't we also need
| motherboard manufacturers though to more widely implement
| the hardware required? It has been awhile since I have read
| about NVlink to be fair
| komuher wrote:
| 1000 TFLOPS so i can run my GPT3 in under 100 ms locally :D
|
| If 1000 TFLOPS is possible to do in inference time then im
| speechless
| edf13 wrote:
| At what costs I wonder?
| komuher wrote:
| I would assume about 30-40k usd but we'll see
| Melatonic wrote:
| Huge recurrent licensing costs is the killer with these
| ml_hardware wrote:
| At inference time it will be possible to do 4000 TFLOPS using
| sparse FP8 :)
|
| But keep in mind the model won't fit on a single H100 (80GB)
| because it's 175B params, and ~90GB even with sparse FP8 model
| weights, and then more needed for live activation memory. So
| you'll still want atleast 2+ H100s to run inference, and more
| realistically you would rent a 8xH100 cloud instance.
|
| But yeah the latency will be insanely fast given how massive
| these models are!
| TOMDM wrote:
| So, we're about a 25-50% memory increase off of being able to
| run GPT3 on a single machine?
|
| Sounds doable in a generation or two.
| ml_hardware wrote:
| Couple points:
|
| 1) NVIDIA will likely release a variant of H100 with 2x
| memory, so we may not even have to wait a generation. They
| did this for V100-16GB/32GB and A100-40GB/80GB.
|
| 2) In a generation or two, the SOTA model architecture will
| change, so it will be hard to predict the memory reqs...
| even today, for a fixed train+inference budget, it is much
| better to train Mixture-Of-Experts (MoE) models, and even
| NVIDIA advertises MoE models on their H100 page.
|
| MoEs are more efficient in compute, but occupy a lot more
| memory at runtime. To run an MoE with GPT3-like quality,
| you probably need to occupy a full 8xH100 box, or even
| several boxes. So your min-inference-hardware has gone up,
| but your efficiency will be much better (much higher
| queries/sec than GPT3 on the same system).
|
| So it's complicated!
| TOMDM wrote:
| Oh I totally expect the size of models to grow along with
| whatever hardware can provide.
|
| I really do wonder how much more you could squeeze out of
| a full pod of gen2-H100's, obviously the model size would
| be ludicrous, but how far are we into the realm of
| dimishing returns.
|
| Your point about MoE architectures certainly sounds like
| the more _useful_ deployment, but the research seems to
| be pushing towards ludicrously large models.
|
| You seem to know a fair amount about the field, is there
| anything you'd suggest if I wanted to read more into the
| subject?
| ml_hardware wrote:
| I agree! The models will definitely keep getting bigger,
| and MoEs are a part of that trend, sorry if that wasn't
| clear.
|
| A pod of gen2-H100s might have 256 GPUs with 40 TB of
| total memory, and could easily run a 10T param model. So
| I think we are far from diminishing returns on the
| hardware side :) The model quality also continues to get
| better at scale.
|
| Re. reading material, I would take a look at DeepSpeed's
| blog posts (not affiliated btw). That team is super super
| good at hardware+software optimization for ML. See their
| post on MoE models here: https://www.microsoft.com/en-
| us/research/blog/deepspeed-adva...
| algo_trader wrote:
| Is it difficult/desirable to squeeze/compress an open-
| sourced 200B parameter model to fit into 40GB?
|
| Are these techniques for specific architectures or can
| they be made generic ?
| algo_trader wrote:
| Ah, found some stuff already
|
| https://www.tensorflow.org/model_optimization/guide/pruni
| ng
|
| https://www.tensorflow.org/model_optimization/guide/pruni
| ng/...
| ml_hardware wrote:
| I think it depends what downstream task you're trying to
| do... DeepMind tried distilling big language models into
| smaller ones (think 7B -> 1B) but it didn't work too
| well... it definitely lost a lot of quality (for general
| language modeling) relative to the original model.
|
| See the paper here, Figure A28: https://kstatic.googleuse
| rcontent.com/files/b068c6c0e64d6f93...
|
| But if your downstream task is simple, like sequence
| classification, then it may be possible to compress the
| model without losing much quality.
| learndeeply wrote:
| GPT-3 can't fit in 80GB of RAM.
| savant_penguin wrote:
| 1 petaflop on a chip?? What is the catch?
| dragontamer wrote:
| Tensor petaflops are useful in only very few circumstances. One
| of which is the highly lucrative deep learning community
| though.
| cjbgkagh wrote:
| The main tensor op is a matmul intrinsic which is useful for
| way more than just deep learning.
|
| Edit; many of these speeds are low precision which is less
| useful outside of deep learning, but the higher precision
| matmul ops in the tensor cores are still very fast and very
| useful for wide variety of tasks.
| dragontamer wrote:
| > but the higher precision matmul ops in the tensor cores
| are still very fast and very useful for wide variety of
| tasks.
|
| The FP64 matrix-multiplication is only 60 TFlops, no where
| near the advertized 1000 TFlops. TF32 matrix-multiplication
| is a poorly named 16-bit operation.
| cjbgkagh wrote:
| You are indeed correct, I was (kinda) fooled by the
| marketing and I think that TF32 is deceptively named. I
| think the tensor cores are being used in this
| architecture for FP64 and 60 TFlops is still pretty
| decent.
|
| I'm on Turing architecture so I've never used TF32. I've
| only used FP32 and FP16 but FP32 isn't supported by these
| tensor cores.
| bcatanzaro wrote:
| Well the addition is done in FP32, and it's a 32-bit
| storage format in memory, so calling it a 16-bit format
| isn't right either. It's really a hybrid format where
| everything is 32-bit except multiplication.
|
| Given that it's 32-bit in memory (so all your data
| structures are 32-bit) and also that in my experience
| using it is very transparent (I haven't run into any
| numerical issues compared to full FP32), I think calling
| it a 32-bit format is a reasonable compromise.
| dragontamer wrote:
| > Well the addition is done in FP32
|
| Addition is done in 10-bit mantissa. So maybe TF19 might
| be the better name, since its a 19-bit format (slightly
| more than 16-bit BFloats).
|
| Really, its a BFloat with a 10-bit mantissa instead of a
| 7-bit mantissa. 10-bit mantissa matches FP16, while the
| 8-bit exponent matches FP32.
|
| So TF19 probably would have been the best name, but
| NVidia like marketing so they call it TF32 instead.
| bcatanzaro wrote:
| It's a 32-bit format in memory and the additions are done
| with 32-bits.
| dragontamer wrote:
| I admit that I don't have the hardware to test your
| claims. But pretty much all the whitepapers I can find on
| TF32 explicitly state the 10-bit mantissa, suggesting
| that this is at best, a 19-bit format. 1-bit sign + 8-bit
| exponent + 10-bit mantissa.
|
| Yes, the system will read/write the 32-bit value to RAM.
| But if there's only 10-bits of mantissa in the circuits,
| you're only going to get 10-bits of precision (best
| case). The 10-bit mantissa makes sense because these
| systems have FP16 circuits (1 + 5-bit exponent + 10-bit
| mantissa) and BFloat16 circuits (1 sign + 8-bit exponent
| + 7-bit mantissa). So the 8-bit exponent circuit + 10-bit
| mantissa circuit exists physically on those NVidia cores.
|
| -------
|
| But the 'Tensor Cores' do not support 32-bit (aka: 23-bit
| mantissa) or higher.
| my123 wrote:
| Yup, in a semi-related field, NVIDIA has 3xTF32 for cases
| needing higher precision:
| https://github.com/NVIDIA/cutlass/discussions/361
| touisteur wrote:
| There's a paper on getting fp32 accuracy using tf32
| tensor cores and losing 3x efficiency. Can't wait to try
| it with cutlass... once I get how to use cutlass, woof.
| peter303 wrote:
| DP Linpack flops is what counts in supercomputer ranking. Stuck
| at .44 Exoflops in 2021.
| aninteger wrote:
| Given that it's Nvidia, no Linux support. That's the catch.
| jamesfmilne wrote:
| All the AI software running on these data-centre chips is
| almost exclusively running on Linux.
|
| I wish people would stop talking rubbish about NVIDIA's Linux
| support.
| chockchocschoir wrote:
| That's because nvidias linux support for consumers is
| indeed trash, while their creators/business/creatives
| software (eg CUDA) is not trash, but you mostly hear
| consumers trashing nvidia.
| pjmlp wrote:
| Only FOSS zealots actually, the rest of us is quite ok
| with their binary drivers.
| oblio wrote:
| They don't make (relevant) money from consumer hardware
| on Linux.
| ScaleneTriangle wrote:
| I thought that only applied to their consumer products.
| jsheard wrote:
| Their consumer products have Linux support too, the catch
| is just that the drivers are proprietary binary blobs
| TheRealSteel wrote:
| Don't they provide Linux drivers for their gaming graphics
| cards too, just not open source?
| gpm wrote:
| Yes
| AHTERIX5000 wrote:
| No Linux support? Guess I'll have to keep using Solaris with
| my A4000!
| simulate-me wrote:
| Nvidia provides Linux drivers for their server chips.
| TheRealSteel wrote:
| Don't they provide them for their consumer cards too, just
| that it's a closed source binary blob?
| throw0101a wrote:
| And not just Linux: FreeBSD.
|
| * https://www.nvidia.com/en-
| us/drivers/unix/freebsd-x64-archiv...
|
| * https://www.freshports.org/x11/nvidia-driver
|
| Heck, _Solaris_ :
|
| * https://www.nvidia.com/en-us/drivers/unix/solaris-
| display-ar...
|
| * https://www.nvidia.com/en-us/drivers/unix/
| jxy wrote:
| CUDA and related software/libraries only work on Linux or
| Windows.
| lostmsu wrote:
| Some are even Linux only like nccl (AFAIK required to
| fully use NVLink)
| kcb wrote:
| That's a strange statement. The vast vast majority of these
| cards will be in systems running Linux.
| savant_penguin wrote:
| I for one suffer deeply when I try to install the nvidia
| drivers on Linux. The website binaries _always_ break my
| system
|
| Only the ppas from graphics-drivers work properly
|
| My experience on windows is much more automatic and it
| never breaks anything. But I'd rather pay the price
| (installing on Linux) to avoid windows at all costs
| riotnrrd wrote:
| If you installed the drivers using the PPAs, you can't
| then update using the NVIDIA-provided binaries without
| doing a very thorough purge, including deleting all
| dependent installs (CUDNN, CUBLAS, etc.)
|
| I highly recommend sticking with one technique or the
| other; never intermix them.
| kcb wrote:
| Yea it's not ideal but really no option is. Built in to
| Linux would be a problem too given the rate of GPU driver
| development. Most Linux installs in the corporate world
| are stuck on the major version of the kernel and system
| packages they shipped with.
| hughrr wrote:
| 700 watts so being NVidia it'll blow up in 6 months and you'll
| need to wait in a queue for 6 months to RMA it because all the
| miners had bought up the entire supply chain.
| touisteur wrote:
| Those datacenter/hpc GPUs don't seem to get bought so much by
| the mining community? I don't have problems sourcing some
| through the usual channels (HPE, dell,...?). But you need
| somehow deep pockets.
| p1esk wrote:
| The catch is it's only for TF32 computations (Nvidia
| proprietary 19 bit floating point number format)
| cjbgkagh wrote:
| I missed that, to me that makes the '32' in the name
| misleading.
| p1esk wrote:
| TF32 = FP32 range + FP16 precision
| cjbgkagh wrote:
| Why not call it TF19 then.
| 37ef_ced3 wrote:
| Because it's 32-bits wide in memory.
|
| The effective mantissa is like FP16 but it's padded out
| to be the same size as FP32.
|
| In other words, there's 1 sign bit, 8 exponent bits, 10
| mantissa bits that are USED, and 13 mantissa bits that
| are IGNORED.
|
| 1 + 8 + 10 + 13 = 32
|
| The 13 ignored mantissa bits are part of the memory
| image: they pad the number out to 32-bit alignment.
| cjbgkagh wrote:
| But the user never sees that memory right? Doesn't it go
| in FP32 and come out FP32? I still think it's deceptive
| marketing.
| bcatanzaro wrote:
| The user does see 32-bits and all bits are used because
| all the additions (and other operations besides the
| multiply in matrix ops) are in FP32. So the bottom bits
| are populated with useful information.
| p1esk wrote:
| Because your existing FP32 models should run fine when
| converted to TF32, so TF32 is equivalent to FP32 as far
| as DL practitioners are concerned.
| cjbgkagh wrote:
| There is a lot of redundancy in DL that forgives all
| manner of sins; think it's sneaky.
| fancyfredbot wrote:
| The Tensor cores will be great for machine learning and the
| FP32/FP64 fantastic for HPC, but I'd be surprised if there were a
| lot of applications using both of these features at once. I
| wonder if there's room for a competitor to come in and sell
| another huge accelerator but with only one of these two features
| either at a lower price or with more performance? Perhaps the
| power density would be too high if everything was in use at once?
___________________________________________________________________
(page generated 2022-03-22 23:00 UTC)