[HN Gopher] Nvidia Announces H100 NVL - Max Memory Server Card f...
       ___________________________________________________________________
        
       Nvidia Announces H100 NVL - Max Memory Server Card for Large
       Language Models
        
       Author : neilmovva
       Score  : 102 points
       Date   : 2023-03-21 16:55 UTC (6 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | int_19h wrote:
       | I wonder how soon we'll see something tailored specifically for
       | local applications. Basically just tons of VRAM to be able to
       | load large models, but not bleeding edge perf. And eGPU form
       | factor, ideally.
        
         | frankchn wrote:
         | The Apple M-series CPUs with unified RAM is interesting in this
         | regard. You can get an 16-inch MBP with an M2 Max 96GB of RAM
         | for $4300 today, and I expect the M2 Ultra go to 192GB.
        
         | pixl97 wrote:
         | I'm not a ML scientist my any means, but Perf seems as
         | important as RAM from what I'm reading. Running prompts in
         | internal chain of thought (eating up more TPU time) appears to
         | give much better output.
        
           | int_19h wrote:
           | It's not that perf is not important, but not having enough
           | VRAM means you can't load the model of a given size _at all_.
           | 
           | I'm not saying they shouldn't bother with RAM at all, mind
           | you. But given some target price, it's a balance thing
           | between compute and RAM, and right now it seems that RAM is
           | the bigger hurdle.
        
       | neilmovva wrote:
       | A bit underwhelming - H100 was announced at GTC 2022, and
       | represented a huge stride over A100. But a year later, H100 is
       | still not generally available at any public cloud I can find, and
       | I haven't yet seen ML researchers reporting any use of H100.
       | 
       | The new "NVL" variant adds ~20% more memory per GPU by enabling
       | the sixth HBM stack (previously only five out of six were used).
       | Additionally, GPUs now come in pairs with 600GB/s bandwidth
       | between the paired devices. However, the pair then uses PCIe as
       | the sole interface to the rest of the system. This topology is an
       | interesting hybrid of the previous DGX (put all GPUs onto a
       | unified NVLink graph), and the more traditional PCIe accelerator
       | cards (star topology of PCIe links, host CPU is the root node).
       | Probably not an issue, I think PCIe 5.0 x16 is already fast
       | enough to not bottleneck multi-GPU training too much.
        
         | __anon-2023__ wrote:
         | Yes, I was expecting a RAM-doubled edition of the H100, this is
         | just a higher-binned version of the same part.
         | 
         | I got an email from vultr, saying that they're "officially
         | taking reservations for the NVIDIA HGX H100", so I guess all
         | public clouds are going to get those soon.
        
           | [deleted]
        
         | rerx wrote:
         | You can also join a pair of regular PCIe H100 GPUs with an
         | NVLink bridge. So that topology is not so new either.
        
         | binarymax wrote:
         | It is interesting that hopper isn't widely available yet.
         | 
         | I have seen some benchmarks from academia but nothing in the
         | private sector.
         | 
         | I wonder if they thought they were moving too fast and wanted
         | to milk amphere/ada as long as possible.
         | 
         | Not having any competition whatsoever means Nvidia can release
         | what they like when they like.
        
           | pixl97 wrote:
           | The question is, do they not have much production, or is
           | OpenAI and Microsoft buying every single one they produce?
        
           | TylerE wrote:
           | Why bother when you can get cryptobros paying way over MSRP
           | for 3090s?
        
             | binarymax wrote:
             | Not just cryptobros. A100s are the current top of the line
             | and it's hard to find them available on AWS and Lambda.
             | Vast.AI has plenty if you trust renting from a stranger.
             | 
             | AMD really needs to pick up the pace and make a solid
             | competitive offering in deep learning. They're slowly
             | getting there but they are at least 2 generations out.
        
               | fbdab103 wrote:
               | I would take a huge performance hit to just not deal with
               | Nvidia drivers. Unless things have changed, it is still
               | not really possible to operate on AMD hardware without a
               | list of gotchas.
        
               | brucethemoose2 wrote:
               | Its still basically impossible to find MI200s in the
               | cloud.
               | 
               | On desktops, only the 7000 series is kinda competitive
               | for AI in particular, and you have to go out of your way
               | to get it running quick in PyTorch. The 6000 and 5000
               | series just weren't designed for AI.
        
               | breatheoften wrote:
               | It's crazy to me that no other hardware company has
               | sought to compete for the deep learning
               | training/inference market yet ...
               | 
               | The existing ecosystems (cuda, pytorch etc) are all
               | pretty garbage anyway -- aside from the massive number of
               | tutorials it doesn't seem like it would actually be hard
               | to build a vertically integrated competitor ecosystem ...
               | it feels a little like the rise of rails to me -- is a
               | million articles about how to build a blog engine really
               | that deep a moat ..?
        
               | wmf wrote:
               | There are tons of companies trying; they just aren't
               | succeeding.
        
             | andy81 wrote:
             | GPU mining died last year.
             | 
             | There's so little liquidity post-merge that it's only worth
             | mining as a way to launder stolen electricity.
             | 
             | The bitcoin people still waste raw materials, and prices
             | are relatively sticky with so few suppliers and a backlog
             | of demand, but we've already seen prices drop heavily since
             | then.
        
               | TylerE wrote:
               | Right, that's why NVidia is acutally trying again. The
               | money printer has run out of ink.
        
       | ecshafer wrote:
       | I was wondering today if we would start to see the reverse of
       | this. Small ASICS or some kind of optimized for LLM Gpu for
       | desktop / or maybe even laptops of mobile. It is evident I think
       | that LLM are here to stay and will be a major part of computing
       | for a while. Getting this local, so we aren't reliant on clouds
       | would be a huge boon for personal computing. Even if its a
       | "worse" experience, being able to load up an LLM into our
       | computer, tell it to only look at this directory and help out
       | would be cool.
        
         | Sol- wrote:
         | Why is the sentiment here so much that LLMs will somehow be
         | decentralized and run locally at some point? Has the story of
         | the internet so far not been that centralization has pretty
         | much always won?
        
           | jacquesm wrote:
           | Because that is pretty much the pendulum swinging in the IT
           | world. Right now it is solidly in 'centralization' territory,
           | hopefully it will go back towards decentralization again in
           | the future. The whole PC revolution was an excellent
           | datapoint for decentralization, now we're back to 'dumb
           | terminals' but as local compute strengthens the things that
           | you need a whole farm of servers for today can probably fit
           | in your pocket tomorrow, or at the latest in a few years.
        
           | kaoD wrote:
           | I think it's because it feels more similar to Google Stadia
           | than to Facebook.
        
           | throwaway743 wrote:
           | Sure, for big business, but torrents are still alive and
           | well.
        
           | wmf wrote:
           | Hackers want to run LLMs locally just because. It's not a
           | mainstream thing.
        
           | psychlops wrote:
           | I think the sentiment is both. There will be advanced
           | centralized LLM's and people want the option to have a
           | personal one (or two). There needn't be a single solution.
        
         | 01100011 wrote:
         | A couple of the big players are already looking at developing
         | their own chips.
        
           | JonChesterfield wrote:
           | Have been for years. Maybe lots of years. It's expensive to
           | have a go (many engineers plus cost of making the things) and
           | it's difficult to beat the established players unless you see
           | something they're doing wrong or your particular niche really
           | cares about something the off the shelf hardware doesn't.
        
         | wmf wrote:
         | Apple, Intel, AMD, Qualcomm, Samsung, etc. already have "neural
         | engines" in their SoCs. These engines continue to evolve to
         | better support common types of models.
        
         | ethbr0 wrote:
         | Software/hardware co-evolution. Wouldn't be the first time we
         | went down that road to good effect.
         | 
         | For anything that _can_ be run remotely, it 'll always be
         | deployed and optimized server-side first. Higher utilization
         | means more economy.
         | 
         | Then trickle down to local and end user devices if it makes
         | sense.
        
         | wyldfire wrote:
         | In fact, Qualcomm has announced a "Cloud AI" PCIe card designed
         | for inference (as opposed to training & inference) [1, 2]. It's
         | populated with NSPs like the ones in mobile SoCs.
         | 
         | [1]
         | https://www.qualcomm.com/products/technology/processors/clou...
         | 
         | [2] https://github.com/quic/software-kit-for-qualcomm-cloud-
         | ai-1...
        
       | sargun wrote:
       | What exactly is an SXM5 socket? It sounds like a PCIe competitor,
       | but proprietary to nvidia. Looking at it, it seems specific to
       | nvidia DGX (mother?)boards. Is this just a "better" alternative
       | to PCIe (with power delivery, and such), or fundamentally a new
       | technology?
        
         | koheripbal wrote:
         | Yes to all your questions. It's specifically designed for
         | commercial compute servers. It provides significantly more
         | bandwidth and speed over PCIe.
         | 
         | It's also enormously more expensive and I'm not sure if you can
         | buy it new without getting the nvidia compute server.
        
         | 0xbadc0de5 wrote:
         | It's one of those /If you have to ask, you can't afford it/
         | scenarios.
        
       | g42gregory wrote:
       | I wonder how this compares to AMD Instinct MI300 128GB HBM3
       | cards?
        
       | tpmx wrote:
       | Does AMD have a chance here in the short term (say 24 months)?
        
         | Symmetry wrote:
         | AMD seems to be focusing on traditional HPC, they've got a ton
         | of 64 bit flops in their recent commercial model. I expect
         | their server GPUs are mostly for chasing supercomputer
         | contracts, which can be pretty lucrative, while they cede model
         | training to NVidia.
        
       | tromp wrote:
       | The TDP row in the comparison table must be in error. It shows
       | the card with dual GH100 GPUs at 700W and the one with a single
       | GH100 GPU at 700-800W ?!
        
         | rerx wrote:
         | That's the SXM version, used for instance in servers like the
         | DGX. It's also faster than the PCIe variation.
        
       | metadat wrote:
       | How is this card (which is really two physical cards occupying 2
       | PCIe slots) exposed to the OS? Does it show up as a single
       | /dev/gfx0 device, or is the unification a driver trick?
        
         | rerx wrote:
         | The two cards show as two distinct GPUs to the host, connected
         | via NVLink. Unification / load balancing happens via software.
        
           | sva_ wrote:
           | Kinda depressing if you consider how they removed NVLink in
           | the 4090, stating the following reason:
           | 
           | > "The reason we took [NVLink] off is that we need I/O for
           | other things, so we're using that area to cram in as many AI
           | processors as possible," Jen-Hsun Huang explained of the
           | reason for axing NVLink.[0]
           | 
           | "NVLink is bad for your games and AI, trust me bro."
           | 
           | But then this card, actually aimed at ML applications, uses
           | it.
           | 
           | 0. https://www.techgoing.com/nvidia-rtx-4090-no-longer-
           | supports...
        
       | aliljet wrote:
       | I'm super duper curious if there are ways to glob together VRAM
       | between consumer-grade hardware to make this whole market more
       | accessible to the common hacker?
        
         | bick_nyers wrote:
         | I remember reading about a guy who soldered 2GB VRAM modules on
         | his 3060 12GB (replacing the 1GB modules) and was able to
         | attain 24GB on that card. Or something along those lines.
        
         | rerx wrote:
         | You can, for instance, connect two RTX 3090 with an NVLink
         | bridge. That gives you 48 GB in total. The 4090 doesn't support
         | NVLink anymore.
        
           | mk_stjames wrote:
           | You actually can split a model [0] onto multiple GPUs even
           | without NVLink, just using the PCIe for the transfers.
           | 
           | Depending on the model the performance is sometimes not all
           | that different. I believe for solely inference on some models
           | the speed difference may barely be noticeable, where for
           | other training activities it may make 10+% difference [1]
           | 
           | [0] https://pytorch.org/tutorials/intermediate/model_parallel
           | _tu...
           | 
           | [1]
           | https://huggingface.co/transformers/v4.9.2/performance.html
        
           | koheripbal wrote:
           | > The 4090 doesn't support NVLink anymore.
           | 
           | Are you sure about that?
        
             | homarp wrote:
             | that's what the press said:
             | https://www.tomshardware.com/news/gigabyte-leaves-nvlink-
             | tra...
        
       | andrewstuart wrote:
       | GPUs are going to be weird, underconfigured and overpriced until
       | there is real competition.
       | 
       | Whether or not there is real competition depends entirely on
       | whether Intels Arc line of GPUs stays in the market.
       | 
       | AMD strangely has decided not to compete. Its newest GPU the 7900
       | XTX is an extremely powerful card, close to the top of the line
       | Nvidia RTX 4090 in raster performance.
       | 
       | If AMD had introduced it with an aggressively low price then then
       | they could have wedged Nvidia, which is determinbed to exploit
       | it's market dominance by squeezing the maximum money out of
       | buyers.
       | 
       | Instead, AMD has decided to simply follow Nvidia in squeezing for
       | maximum prices, with AM prices slightly behind Nvidia.
       | 
       | It's a strange decision from AMD who is well behind in market and
       | apparently seems disinterested in increasing that market share by
       | competing aggressively.
       | 
       | So a third player is needed - Intel - it's alot harder for three
       | companies to sit on outrageously high prices for years rather
       | than compete with each other for market share.
        
         | dragontamer wrote:
         | The root cause is that TSMC raised prices in everyone.
         | 
         | Since Intel GPUs are again TSMC manufactured, you really aren't
         | going to see price improvements unless Intel subsidizes all of
         | this.
        
           | andrewstuart wrote:
           | >> The root cause is that TSMC raised prices in everyone.
           | 
           | This is not correct.
        
             | dragontamer wrote:
             | https://www.tomshardware.com/news/tsmc-ups-chip-
             | production-p...
        
               | andrewstuart wrote:
               | You are correct that the manufacturing cost has gone up.
               | 
               | You are incorrect that this is the root cause of GPU
               | prices being sky high.
               | 
               | If manufacturing cost was the root cause then it would be
               | simply impossible to bring prices down without losing
               | money.
               | 
               | The root cause of GPU prices being so high is lack of
               | competition - AMD and Nvidia are choosing to maximise
               | profit, and they are deliberately undersupplying the
               | market to create scarcity and therefore prop up prices.
               | 
               | "AMD 'undershipping' chips to help prop prices up"
               | https://www.pcgamer.com/amd-undershipping-chips-to-help-
               | prop...
               | 
               | "AMD is 'undershipping' chips to balance CPU, GPU supply
               | Less supply to balance out demand--and keep prices high."
               | https://www.pcworld.com/article/1499957/amd-is-
               | undershipping...
               | 
               | In summary, GPOU prices are ridiculously high because
               | Nvidia and AMD are overpricing them because they believe
               | this is what gamers will pay, NOT because manufacturing
               | costs have forced prices to be high.
        
         | JonChesterfield wrote:
         | GPUs strike me as absurdly cheap given the performance they can
         | offer. I'd just like them to be easier to program.
        
           | andrewstuart wrote:
           | Depends on the GPU of course but at the top end of the market
           | AUD$3000 / USD$1,600 is not cheap and certainly not absurdly
           | cheap.
           | 
           | Much less powerful GPUs represent better value but the market
           | is ridiculously overpriced at the moment.
        
       | brucethemoose2 wrote:
       | The really interesting upcoming LLM products are from AMD and
       | Intel... with catches.
       | 
       | - The Intel Falcon Shores XPU is basically a big GPU that can use
       | DDR5 DIMMS directly, hence it can fit absolutely enormous models
       | into a single pool. But it has been delayed to 2025 :/
       | 
       | - AMD have not mentioned anything about the (not delayed) MI300
       | supporting DIMMs. If it doesn't, its capped to 128GB, and its
       | being marketed as an HPC product like the MI200 anyway (which you
       | basically cannot find on cloud services).
       | 
       | Nvidia also has some DDR5 grace CPUs, but the memory is embedded
       | and I'm not sure how much of a GPU they have. Other startups
       | (Tenstorrent, Cerebras, Graphcore and such) seemed to have
       | underestimated the memory requirements of future models.
        
         | YetAnotherNick wrote:
         | > DDR5 DIMMS directly
         | 
         | That's the problem. Good DDR5 RAM's memory speed is <100GB/s,
         | while nvidia could has up to 2TB/s, and still the bottleneck
         | lies on memory speed for most applications.
        
           | brucethemoose2 wrote:
           | Not if the bus is wide enough :P. EPYC Genoa is already
           | ~450GB/s, and the M2 max is 400GB/s.
           | 
           | Anyway, what I was implying is that simply _fitting_ a
           | trillion parameter model into a single pool is probably more
           | efficient than splitting it up over a power hungry
           | interconnect. Bandwidth is much lower, but latency is also
           | slower, you are shuffling _much_ less data around.
        
         | virtuallynathan wrote:
         | Grace can be paired with Hopper via a 900GB/s NVLINK bus
         | (500GB/s memory bandwidth), 1TB of LPDDR5 on the CPU and
         | 80-94GB of HBM3 on the GPU.
        
           | brucethemoose2 wrote:
           | That does sound pretty good, but its still going chip to chip
           | over NVLink.
        
       | enlyth wrote:
       | Please give us consumer cards with more than 24GB VRAM, Nvidia.
       | 
       | It was a slap in the face when the 4090 had the same memory
       | capacity as the 3090.
       | 
       | A6000 is 5000 dollars, ain't no hobbyist at home paying for that.
        
         | nullc wrote:
         | Nvidia can't do a large 'consumer' card without cannibalizing
         | their commercial ML business. ATI doesn't have that problem.
         | 
         | ATI seems to be holding the idiot ball.
         | 
         | Port stable diffusion and clip to their hardware. Train an
         | upsized version sized for a 48GB card. Release a prosumer 48gb
         | card... get huge uptake from artists and creators using the
         | tech.
        
         | andrewstuart wrote:
         | Nvidia don't want consumers using consumer GPUs for business.
         | 
         | If you are a business user then you must pay Nvidia gargantuan
         | amounts of money.
         | 
         | This is the outcome of a market leader with no real competition
         | - you pay much more for lower power than the consumer GPUs and
         | you are forced into ujsing their business GPUs through software
         | license restrictions on the drivers.
        
           | Melatonic wrote:
           | That was always why the Titan line was so great - they
           | typically unlocked features in between Quadro and Gaming
           | cards. Sometimes it was subtle (like very good FP32 AND FP16
           | performance) or adding full 10 bit colour support if you had
           | a Titan only. Now it seems like they have opened up even more
           | of those features to consumer cards (at least the creative
           | ones) with the studio drivers.
        
             | koheripbal wrote:
             | Isn't a new Titan RTX 4090 coming out soon?
        
             | andrewstuart wrote:
             | Hmmm ... "Studio Drivers" ... how are these tangibly
             | different to gaming drivers?
             | 
             | According to this, the difference seems to be that Studio
             | Drivers are older and better tested, nothing else.
             | 
             | https://nvidia.custhelp.com/app/answers/detail/a_id/4931/~/
             | n...
             | 
             | What am I missing in my understanding of Studio Drivers?
             | 
             | """ How do Studio Drivers differ from Game Ready Drivers
             | (GRD)?
             | 
             | In 2014, NVIDIA created the Game Ready Driver program to
             | provide the best day-0 gaming experience. In order to
             | accomplish this, the release cadence for Game Ready Drivers
             | is driven by the release of major new game content giving
             | our driver team as much time as possible to work on a given
             | title. In similar fashion, NVIDIA now offers the Studio
             | Driver program. Designed to provide the ultimate in
             | functionality and stability for creative applications,
             | Studio Drivers provide extensive testing against top
             | creative applications and workflows for the best
             | performance possible, and support any major creative app
             | updates to ensure that you are ready to update any apps on
             | Day 1. ""
        
           | koheripbal wrote:
           | We're NOT business users, we just want to run our own LLM at
           | home.
           | 
           | Given the size of LLMs, this should be possible with just a
           | little bit of extra VRAM.
        
             | enlyth wrote:
             | Exactly, we're just below that sweet spot right now.
             | 
             | For example on 24GB, Llama 30B runs only in 4bit mode and
             | very slowly, but I can imagine a RLHF finetuned 30B or 65B
             | version running in at least 8bit would be actually useful,
             | and you could run it on your own computer easily.
        
               | riku_iki wrote:
               | > For example on 24GB, Llama 30B runs only in 4bit mode
               | and very slowly
               | 
               | why do you think adding vram, but not cores will make it
               | run faster?..
        
               | enlyth wrote:
               | I've been told the 4 bit quantization slows it down, but
               | don't quote me on this since I was unable to benchmark at
               | 8 bit locally
               | 
               | In any case, you're right it might not be as significant,
               | however, the quality of the output increases with
               | 8/16bit, and running 65B is completely impossible on 24GB
        
               | bick_nyers wrote:
               | Do you know where the cutoff is? Does 32GB VRAM give us
               | 30B int8 with/without a RLHF layer? I don't think 5090 is
               | going to go straight to 48GB, I'm thinking either 32 or
               | 40GB (if not 24GB).
        
             | bick_nyers wrote:
             | I don't think you understand though, they don't WANT you.
             | They WANT the version of you who makes $150k+ a year and
             | will splurge $5k on a Quadro.
             | 
             | If they had trouble selling stock we would see this niche
             | market get catered to.
        
       | garbagecoder wrote:
       | Sarah Connor is totally coming for NVIDIA.
        
       | ipsum2 wrote:
       | I would sell a kidney for one of these. It's basically impossible
       | to train language models on a consumer 24GB card. The jump up is
       | the A6000 ADA, at 48GB for $8,000. This one will probably be
       | priced somewhere in the $100k+ range.
        
         | solarmist wrote:
         | You think? It's double 48 GB (per card) so why wouldn't it be
         | in the $20k range?
        
           | ipsum2 wrote:
           | Machine learning is so hyped right now (with good reason) so
           | customers are price insensitive.
        
             | solarmist wrote:
             | I guess we'll see.
        
         | YetAnotherNick wrote:
         | Use 4 consumer grade 4090 then. It would be much cheaper and
         | better in almost every aspect. Also even with this, forget
         | about training foundational models. Meta spent 82k GPU hours on
         | the smallest llama and 1M hours on largest.
        
           | throwaway743 wrote:
           | Go with 2x 3090s instead. 4000 series doesn't support SLI, so
           | you're stuck with the max of whatever one card you get.
        
             | bick_nyers wrote:
             | If I remember correctly the NVLINK adds 100GB/s (where PCIE
             | 4.0 is 64GB/s). Is it really worth getting 3090 performance
             | (roughly half) for that extra bus speed?
        
       | 0xbadc0de5 wrote:
       | So it's essentially two H100's in a trenchcoat? (plus a
       | sprinkling of "latest")
        
       ___________________________________________________________________
       (page generated 2023-03-21 23:01 UTC)