[HN Gopher] Nvidia Announces H100 NVL - Max Memory Server Card f...
___________________________________________________________________
Nvidia Announces H100 NVL - Max Memory Server Card for Large
Language Models
Author : neilmovva
Score : 102 points
Date : 2023-03-21 16:55 UTC (6 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| int_19h wrote:
| I wonder how soon we'll see something tailored specifically for
| local applications. Basically just tons of VRAM to be able to
| load large models, but not bleeding edge perf. And eGPU form
| factor, ideally.
| frankchn wrote:
| The Apple M-series CPUs with unified RAM is interesting in this
| regard. You can get an 16-inch MBP with an M2 Max 96GB of RAM
| for $4300 today, and I expect the M2 Ultra go to 192GB.
| pixl97 wrote:
| I'm not a ML scientist my any means, but Perf seems as
| important as RAM from what I'm reading. Running prompts in
| internal chain of thought (eating up more TPU time) appears to
| give much better output.
| int_19h wrote:
| It's not that perf is not important, but not having enough
| VRAM means you can't load the model of a given size _at all_.
|
| I'm not saying they shouldn't bother with RAM at all, mind
| you. But given some target price, it's a balance thing
| between compute and RAM, and right now it seems that RAM is
| the bigger hurdle.
| neilmovva wrote:
| A bit underwhelming - H100 was announced at GTC 2022, and
| represented a huge stride over A100. But a year later, H100 is
| still not generally available at any public cloud I can find, and
| I haven't yet seen ML researchers reporting any use of H100.
|
| The new "NVL" variant adds ~20% more memory per GPU by enabling
| the sixth HBM stack (previously only five out of six were used).
| Additionally, GPUs now come in pairs with 600GB/s bandwidth
| between the paired devices. However, the pair then uses PCIe as
| the sole interface to the rest of the system. This topology is an
| interesting hybrid of the previous DGX (put all GPUs onto a
| unified NVLink graph), and the more traditional PCIe accelerator
| cards (star topology of PCIe links, host CPU is the root node).
| Probably not an issue, I think PCIe 5.0 x16 is already fast
| enough to not bottleneck multi-GPU training too much.
| __anon-2023__ wrote:
| Yes, I was expecting a RAM-doubled edition of the H100, this is
| just a higher-binned version of the same part.
|
| I got an email from vultr, saying that they're "officially
| taking reservations for the NVIDIA HGX H100", so I guess all
| public clouds are going to get those soon.
| [deleted]
| rerx wrote:
| You can also join a pair of regular PCIe H100 GPUs with an
| NVLink bridge. So that topology is not so new either.
| binarymax wrote:
| It is interesting that hopper isn't widely available yet.
|
| I have seen some benchmarks from academia but nothing in the
| private sector.
|
| I wonder if they thought they were moving too fast and wanted
| to milk amphere/ada as long as possible.
|
| Not having any competition whatsoever means Nvidia can release
| what they like when they like.
| pixl97 wrote:
| The question is, do they not have much production, or is
| OpenAI and Microsoft buying every single one they produce?
| TylerE wrote:
| Why bother when you can get cryptobros paying way over MSRP
| for 3090s?
| binarymax wrote:
| Not just cryptobros. A100s are the current top of the line
| and it's hard to find them available on AWS and Lambda.
| Vast.AI has plenty if you trust renting from a stranger.
|
| AMD really needs to pick up the pace and make a solid
| competitive offering in deep learning. They're slowly
| getting there but they are at least 2 generations out.
| fbdab103 wrote:
| I would take a huge performance hit to just not deal with
| Nvidia drivers. Unless things have changed, it is still
| not really possible to operate on AMD hardware without a
| list of gotchas.
| brucethemoose2 wrote:
| Its still basically impossible to find MI200s in the
| cloud.
|
| On desktops, only the 7000 series is kinda competitive
| for AI in particular, and you have to go out of your way
| to get it running quick in PyTorch. The 6000 and 5000
| series just weren't designed for AI.
| breatheoften wrote:
| It's crazy to me that no other hardware company has
| sought to compete for the deep learning
| training/inference market yet ...
|
| The existing ecosystems (cuda, pytorch etc) are all
| pretty garbage anyway -- aside from the massive number of
| tutorials it doesn't seem like it would actually be hard
| to build a vertically integrated competitor ecosystem ...
| it feels a little like the rise of rails to me -- is a
| million articles about how to build a blog engine really
| that deep a moat ..?
| wmf wrote:
| There are tons of companies trying; they just aren't
| succeeding.
| andy81 wrote:
| GPU mining died last year.
|
| There's so little liquidity post-merge that it's only worth
| mining as a way to launder stolen electricity.
|
| The bitcoin people still waste raw materials, and prices
| are relatively sticky with so few suppliers and a backlog
| of demand, but we've already seen prices drop heavily since
| then.
| TylerE wrote:
| Right, that's why NVidia is acutally trying again. The
| money printer has run out of ink.
| ecshafer wrote:
| I was wondering today if we would start to see the reverse of
| this. Small ASICS or some kind of optimized for LLM Gpu for
| desktop / or maybe even laptops of mobile. It is evident I think
| that LLM are here to stay and will be a major part of computing
| for a while. Getting this local, so we aren't reliant on clouds
| would be a huge boon for personal computing. Even if its a
| "worse" experience, being able to load up an LLM into our
| computer, tell it to only look at this directory and help out
| would be cool.
| Sol- wrote:
| Why is the sentiment here so much that LLMs will somehow be
| decentralized and run locally at some point? Has the story of
| the internet so far not been that centralization has pretty
| much always won?
| jacquesm wrote:
| Because that is pretty much the pendulum swinging in the IT
| world. Right now it is solidly in 'centralization' territory,
| hopefully it will go back towards decentralization again in
| the future. The whole PC revolution was an excellent
| datapoint for decentralization, now we're back to 'dumb
| terminals' but as local compute strengthens the things that
| you need a whole farm of servers for today can probably fit
| in your pocket tomorrow, or at the latest in a few years.
| kaoD wrote:
| I think it's because it feels more similar to Google Stadia
| than to Facebook.
| throwaway743 wrote:
| Sure, for big business, but torrents are still alive and
| well.
| wmf wrote:
| Hackers want to run LLMs locally just because. It's not a
| mainstream thing.
| psychlops wrote:
| I think the sentiment is both. There will be advanced
| centralized LLM's and people want the option to have a
| personal one (or two). There needn't be a single solution.
| 01100011 wrote:
| A couple of the big players are already looking at developing
| their own chips.
| JonChesterfield wrote:
| Have been for years. Maybe lots of years. It's expensive to
| have a go (many engineers plus cost of making the things) and
| it's difficult to beat the established players unless you see
| something they're doing wrong or your particular niche really
| cares about something the off the shelf hardware doesn't.
| wmf wrote:
| Apple, Intel, AMD, Qualcomm, Samsung, etc. already have "neural
| engines" in their SoCs. These engines continue to evolve to
| better support common types of models.
| ethbr0 wrote:
| Software/hardware co-evolution. Wouldn't be the first time we
| went down that road to good effect.
|
| For anything that _can_ be run remotely, it 'll always be
| deployed and optimized server-side first. Higher utilization
| means more economy.
|
| Then trickle down to local and end user devices if it makes
| sense.
| wyldfire wrote:
| In fact, Qualcomm has announced a "Cloud AI" PCIe card designed
| for inference (as opposed to training & inference) [1, 2]. It's
| populated with NSPs like the ones in mobile SoCs.
|
| [1]
| https://www.qualcomm.com/products/technology/processors/clou...
|
| [2] https://github.com/quic/software-kit-for-qualcomm-cloud-
| ai-1...
| sargun wrote:
| What exactly is an SXM5 socket? It sounds like a PCIe competitor,
| but proprietary to nvidia. Looking at it, it seems specific to
| nvidia DGX (mother?)boards. Is this just a "better" alternative
| to PCIe (with power delivery, and such), or fundamentally a new
| technology?
| koheripbal wrote:
| Yes to all your questions. It's specifically designed for
| commercial compute servers. It provides significantly more
| bandwidth and speed over PCIe.
|
| It's also enormously more expensive and I'm not sure if you can
| buy it new without getting the nvidia compute server.
| 0xbadc0de5 wrote:
| It's one of those /If you have to ask, you can't afford it/
| scenarios.
| g42gregory wrote:
| I wonder how this compares to AMD Instinct MI300 128GB HBM3
| cards?
| tpmx wrote:
| Does AMD have a chance here in the short term (say 24 months)?
| Symmetry wrote:
| AMD seems to be focusing on traditional HPC, they've got a ton
| of 64 bit flops in their recent commercial model. I expect
| their server GPUs are mostly for chasing supercomputer
| contracts, which can be pretty lucrative, while they cede model
| training to NVidia.
| tromp wrote:
| The TDP row in the comparison table must be in error. It shows
| the card with dual GH100 GPUs at 700W and the one with a single
| GH100 GPU at 700-800W ?!
| rerx wrote:
| That's the SXM version, used for instance in servers like the
| DGX. It's also faster than the PCIe variation.
| metadat wrote:
| How is this card (which is really two physical cards occupying 2
| PCIe slots) exposed to the OS? Does it show up as a single
| /dev/gfx0 device, or is the unification a driver trick?
| rerx wrote:
| The two cards show as two distinct GPUs to the host, connected
| via NVLink. Unification / load balancing happens via software.
| sva_ wrote:
| Kinda depressing if you consider how they removed NVLink in
| the 4090, stating the following reason:
|
| > "The reason we took [NVLink] off is that we need I/O for
| other things, so we're using that area to cram in as many AI
| processors as possible," Jen-Hsun Huang explained of the
| reason for axing NVLink.[0]
|
| "NVLink is bad for your games and AI, trust me bro."
|
| But then this card, actually aimed at ML applications, uses
| it.
|
| 0. https://www.techgoing.com/nvidia-rtx-4090-no-longer-
| supports...
| aliljet wrote:
| I'm super duper curious if there are ways to glob together VRAM
| between consumer-grade hardware to make this whole market more
| accessible to the common hacker?
| bick_nyers wrote:
| I remember reading about a guy who soldered 2GB VRAM modules on
| his 3060 12GB (replacing the 1GB modules) and was able to
| attain 24GB on that card. Or something along those lines.
| rerx wrote:
| You can, for instance, connect two RTX 3090 with an NVLink
| bridge. That gives you 48 GB in total. The 4090 doesn't support
| NVLink anymore.
| mk_stjames wrote:
| You actually can split a model [0] onto multiple GPUs even
| without NVLink, just using the PCIe for the transfers.
|
| Depending on the model the performance is sometimes not all
| that different. I believe for solely inference on some models
| the speed difference may barely be noticeable, where for
| other training activities it may make 10+% difference [1]
|
| [0] https://pytorch.org/tutorials/intermediate/model_parallel
| _tu...
|
| [1]
| https://huggingface.co/transformers/v4.9.2/performance.html
| koheripbal wrote:
| > The 4090 doesn't support NVLink anymore.
|
| Are you sure about that?
| homarp wrote:
| that's what the press said:
| https://www.tomshardware.com/news/gigabyte-leaves-nvlink-
| tra...
| andrewstuart wrote:
| GPUs are going to be weird, underconfigured and overpriced until
| there is real competition.
|
| Whether or not there is real competition depends entirely on
| whether Intels Arc line of GPUs stays in the market.
|
| AMD strangely has decided not to compete. Its newest GPU the 7900
| XTX is an extremely powerful card, close to the top of the line
| Nvidia RTX 4090 in raster performance.
|
| If AMD had introduced it with an aggressively low price then then
| they could have wedged Nvidia, which is determinbed to exploit
| it's market dominance by squeezing the maximum money out of
| buyers.
|
| Instead, AMD has decided to simply follow Nvidia in squeezing for
| maximum prices, with AM prices slightly behind Nvidia.
|
| It's a strange decision from AMD who is well behind in market and
| apparently seems disinterested in increasing that market share by
| competing aggressively.
|
| So a third player is needed - Intel - it's alot harder for three
| companies to sit on outrageously high prices for years rather
| than compete with each other for market share.
| dragontamer wrote:
| The root cause is that TSMC raised prices in everyone.
|
| Since Intel GPUs are again TSMC manufactured, you really aren't
| going to see price improvements unless Intel subsidizes all of
| this.
| andrewstuart wrote:
| >> The root cause is that TSMC raised prices in everyone.
|
| This is not correct.
| dragontamer wrote:
| https://www.tomshardware.com/news/tsmc-ups-chip-
| production-p...
| andrewstuart wrote:
| You are correct that the manufacturing cost has gone up.
|
| You are incorrect that this is the root cause of GPU
| prices being sky high.
|
| If manufacturing cost was the root cause then it would be
| simply impossible to bring prices down without losing
| money.
|
| The root cause of GPU prices being so high is lack of
| competition - AMD and Nvidia are choosing to maximise
| profit, and they are deliberately undersupplying the
| market to create scarcity and therefore prop up prices.
|
| "AMD 'undershipping' chips to help prop prices up"
| https://www.pcgamer.com/amd-undershipping-chips-to-help-
| prop...
|
| "AMD is 'undershipping' chips to balance CPU, GPU supply
| Less supply to balance out demand--and keep prices high."
| https://www.pcworld.com/article/1499957/amd-is-
| undershipping...
|
| In summary, GPOU prices are ridiculously high because
| Nvidia and AMD are overpricing them because they believe
| this is what gamers will pay, NOT because manufacturing
| costs have forced prices to be high.
| JonChesterfield wrote:
| GPUs strike me as absurdly cheap given the performance they can
| offer. I'd just like them to be easier to program.
| andrewstuart wrote:
| Depends on the GPU of course but at the top end of the market
| AUD$3000 / USD$1,600 is not cheap and certainly not absurdly
| cheap.
|
| Much less powerful GPUs represent better value but the market
| is ridiculously overpriced at the moment.
| brucethemoose2 wrote:
| The really interesting upcoming LLM products are from AMD and
| Intel... with catches.
|
| - The Intel Falcon Shores XPU is basically a big GPU that can use
| DDR5 DIMMS directly, hence it can fit absolutely enormous models
| into a single pool. But it has been delayed to 2025 :/
|
| - AMD have not mentioned anything about the (not delayed) MI300
| supporting DIMMs. If it doesn't, its capped to 128GB, and its
| being marketed as an HPC product like the MI200 anyway (which you
| basically cannot find on cloud services).
|
| Nvidia also has some DDR5 grace CPUs, but the memory is embedded
| and I'm not sure how much of a GPU they have. Other startups
| (Tenstorrent, Cerebras, Graphcore and such) seemed to have
| underestimated the memory requirements of future models.
| YetAnotherNick wrote:
| > DDR5 DIMMS directly
|
| That's the problem. Good DDR5 RAM's memory speed is <100GB/s,
| while nvidia could has up to 2TB/s, and still the bottleneck
| lies on memory speed for most applications.
| brucethemoose2 wrote:
| Not if the bus is wide enough :P. EPYC Genoa is already
| ~450GB/s, and the M2 max is 400GB/s.
|
| Anyway, what I was implying is that simply _fitting_ a
| trillion parameter model into a single pool is probably more
| efficient than splitting it up over a power hungry
| interconnect. Bandwidth is much lower, but latency is also
| slower, you are shuffling _much_ less data around.
| virtuallynathan wrote:
| Grace can be paired with Hopper via a 900GB/s NVLINK bus
| (500GB/s memory bandwidth), 1TB of LPDDR5 on the CPU and
| 80-94GB of HBM3 on the GPU.
| brucethemoose2 wrote:
| That does sound pretty good, but its still going chip to chip
| over NVLink.
| enlyth wrote:
| Please give us consumer cards with more than 24GB VRAM, Nvidia.
|
| It was a slap in the face when the 4090 had the same memory
| capacity as the 3090.
|
| A6000 is 5000 dollars, ain't no hobbyist at home paying for that.
| nullc wrote:
| Nvidia can't do a large 'consumer' card without cannibalizing
| their commercial ML business. ATI doesn't have that problem.
|
| ATI seems to be holding the idiot ball.
|
| Port stable diffusion and clip to their hardware. Train an
| upsized version sized for a 48GB card. Release a prosumer 48gb
| card... get huge uptake from artists and creators using the
| tech.
| andrewstuart wrote:
| Nvidia don't want consumers using consumer GPUs for business.
|
| If you are a business user then you must pay Nvidia gargantuan
| amounts of money.
|
| This is the outcome of a market leader with no real competition
| - you pay much more for lower power than the consumer GPUs and
| you are forced into ujsing their business GPUs through software
| license restrictions on the drivers.
| Melatonic wrote:
| That was always why the Titan line was so great - they
| typically unlocked features in between Quadro and Gaming
| cards. Sometimes it was subtle (like very good FP32 AND FP16
| performance) or adding full 10 bit colour support if you had
| a Titan only. Now it seems like they have opened up even more
| of those features to consumer cards (at least the creative
| ones) with the studio drivers.
| koheripbal wrote:
| Isn't a new Titan RTX 4090 coming out soon?
| andrewstuart wrote:
| Hmmm ... "Studio Drivers" ... how are these tangibly
| different to gaming drivers?
|
| According to this, the difference seems to be that Studio
| Drivers are older and better tested, nothing else.
|
| https://nvidia.custhelp.com/app/answers/detail/a_id/4931/~/
| n...
|
| What am I missing in my understanding of Studio Drivers?
|
| """ How do Studio Drivers differ from Game Ready Drivers
| (GRD)?
|
| In 2014, NVIDIA created the Game Ready Driver program to
| provide the best day-0 gaming experience. In order to
| accomplish this, the release cadence for Game Ready Drivers
| is driven by the release of major new game content giving
| our driver team as much time as possible to work on a given
| title. In similar fashion, NVIDIA now offers the Studio
| Driver program. Designed to provide the ultimate in
| functionality and stability for creative applications,
| Studio Drivers provide extensive testing against top
| creative applications and workflows for the best
| performance possible, and support any major creative app
| updates to ensure that you are ready to update any apps on
| Day 1. ""
| koheripbal wrote:
| We're NOT business users, we just want to run our own LLM at
| home.
|
| Given the size of LLMs, this should be possible with just a
| little bit of extra VRAM.
| enlyth wrote:
| Exactly, we're just below that sweet spot right now.
|
| For example on 24GB, Llama 30B runs only in 4bit mode and
| very slowly, but I can imagine a RLHF finetuned 30B or 65B
| version running in at least 8bit would be actually useful,
| and you could run it on your own computer easily.
| riku_iki wrote:
| > For example on 24GB, Llama 30B runs only in 4bit mode
| and very slowly
|
| why do you think adding vram, but not cores will make it
| run faster?..
| enlyth wrote:
| I've been told the 4 bit quantization slows it down, but
| don't quote me on this since I was unable to benchmark at
| 8 bit locally
|
| In any case, you're right it might not be as significant,
| however, the quality of the output increases with
| 8/16bit, and running 65B is completely impossible on 24GB
| bick_nyers wrote:
| Do you know where the cutoff is? Does 32GB VRAM give us
| 30B int8 with/without a RLHF layer? I don't think 5090 is
| going to go straight to 48GB, I'm thinking either 32 or
| 40GB (if not 24GB).
| bick_nyers wrote:
| I don't think you understand though, they don't WANT you.
| They WANT the version of you who makes $150k+ a year and
| will splurge $5k on a Quadro.
|
| If they had trouble selling stock we would see this niche
| market get catered to.
| garbagecoder wrote:
| Sarah Connor is totally coming for NVIDIA.
| ipsum2 wrote:
| I would sell a kidney for one of these. It's basically impossible
| to train language models on a consumer 24GB card. The jump up is
| the A6000 ADA, at 48GB for $8,000. This one will probably be
| priced somewhere in the $100k+ range.
| solarmist wrote:
| You think? It's double 48 GB (per card) so why wouldn't it be
| in the $20k range?
| ipsum2 wrote:
| Machine learning is so hyped right now (with good reason) so
| customers are price insensitive.
| solarmist wrote:
| I guess we'll see.
| YetAnotherNick wrote:
| Use 4 consumer grade 4090 then. It would be much cheaper and
| better in almost every aspect. Also even with this, forget
| about training foundational models. Meta spent 82k GPU hours on
| the smallest llama and 1M hours on largest.
| throwaway743 wrote:
| Go with 2x 3090s instead. 4000 series doesn't support SLI, so
| you're stuck with the max of whatever one card you get.
| bick_nyers wrote:
| If I remember correctly the NVLINK adds 100GB/s (where PCIE
| 4.0 is 64GB/s). Is it really worth getting 3090 performance
| (roughly half) for that extra bus speed?
| 0xbadc0de5 wrote:
| So it's essentially two H100's in a trenchcoat? (plus a
| sprinkling of "latest")
___________________________________________________________________
(page generated 2023-03-21 23:01 UTC)