[HN Gopher] GPUs for Deep Learning in 2023 - An In-depth Analysis
___________________________________________________________________
GPUs for Deep Learning in 2023 - An In-depth Analysis
Author : itvision
Score : 173 points
Date : 2023-01-18 18:48 UTC (4 hours ago)
(HTM) web link (timdettmers.com)
(TXT) w3m dump (timdettmers.com)
| moneycantbuy wrote:
| Anyone know if high-RAM Apple silicon such as the 128 GB M1 Ultra
| is useful for training large models? RAM seems like the limiting
| factor in DL, so I'm hoping apple can put some pressure on nvidia
| to offer consumer GPUs with more than 24GB RAM.
| threeseed wrote:
| It is definitely useful but does not beat the top Nvidia cards.
| [1]
|
| Would be very interesting to retest with the M2 and also in the
| months/years to come as the software reaches the level of
| optimisations we see on the PC side.
|
| [1] https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-
| Learning...
| carbocation wrote:
| Not exactly an answer to your question, but from a pytorch
| standpoint, there are still many operations that are not
| supported on MPS[1]. I can't recall a circumstance where an
| architecture I wanted to train was fully supported on MPS, so
| some of the training ends up happening in CPU at that point.
|
| 1 = https://github.com/pytorch/pytorch/issues/77764
| JonathanFly wrote:
| It's an interesting option. My gut instinct is that if you need
| need 128GB of memory for a giant model, but you don't need much
| compute - like fine tuning a very large model maybe - you might
| as well just use a consumer high core CPU and wait 10x as long.
|
| 5950X CPU ($500) with 128GB of memory ($400).
| metadat wrote:
| CPU-addressable RAM is not interchangeable with graphics card
| RAM, so unfortunately this strategy isn't quite in the right
| direction.
|
| AFAIK it flat out won't work with the DL frameworks.
|
| If I'm mistaken please do speak up.
| dwrodri wrote:
| Having large amounts of unified memory is useful for training
| large models, but there are two problems with using Apple
| Silicon for model training:
|
| 1. Apple doesn't provide public interfaces to the ANE or the
| AMX modules, making it difficult to fully leverage the
| hardware, even from PyTorch's MPS runtime or Apple's
| "Tensorflow for macOS" package
|
| 2. Even if we could fully leverage the onboard hardware, even
| the M1 Ultra isn't going to outpace top-of-the-line consumer-
| grade Nvidia GPUs (3090Ti/4090) because it doesn't have the
| compute.
|
| The upcoming Mac Pro is rumored to be preserving the
| configurability of a workstation [1]. If that's the case, then
| there's a (slim) chance we might see Apple Silicon GPUs in the
| future, which would then potentially make it possible to build
| an Apple Silicon Machine that could compete with an Nvidia
| workstation.
|
| At the end of the day, Nvidia is winning the software war with
| CUDA. There are far, far more resources which enable software
| developers to write compute intensive code which runs on Nvidia
| hardware than any other GPU compute ecosystem out there.
|
| Apple's Metal API, Intel's SYCL, and AMD's ROCm/HIP platform
| are closing the gap each year, but their success in the ML
| space is dependent upon how many people they can peel away from
| the Nvidia Hegemony.
|
| 1:https://www.bloomberg.com/news/newsletters/2022-12-18/when-w.
| ..
| samspenc wrote:
| This basically seems to line up with what another tech
| community (outside of AI / ML) seem to agree on: that Apple
| hardware is not that great for 3D modeling or animation work.
| Their M1 and M2 chips rank terribly on the open-source
| Blender benchmark: https://opendata.blender.org/
|
| Despite Apple engineers contributing hardware support to
| Blender: https://9to5mac.com/2022/03/09/blender-3-1-update-
| adds-metal...
|
| So looks like NVidia (and AMD to some extent) are winning the
| 3D usage war as well for now.
| zone411 wrote:
| If you are training a model that requires this much memory, it
| will also require a lot of compute, so it would be too slow and
| not cost-effective. It may be useful for inference.
| ianbutler wrote:
| Flow chart was nice but I am not an organization and I like
| training multi billion param models. My next two cards were going
| to be the be the rtx 6000 ada. The memory capacity alone almost
| makes it necessary.
| leecarraher wrote:
| is there more context for: "Well, with the addition of the sparse
| matrix multiplication feature for Tensor Cores, my algorithm, or
| other sparse training algorithms, now actually provide speedups
| of up to 2x during training" 2x what? CPU, then it's hardly worth
| it. 2x per core, so #cores x 2. I'd imagine the level of sparsity
| is also important, but there is no mention of it.
| ipsum2 wrote:
| The graphs ranking GPUs may not be accurate, as they don't
| represent real-world results and have factual inaccuracies. For
| example:
|
| > Shown is raw relative performance of GPUs. For example, an RTX
| 4090 has about 0.33x performance of a H100 SMX for 8-bit
| inference. In other words, a H100 SMX is three times faster for
| 8-bit inference compared to a RTX 4090.
|
| RTX 4090 GPUs are not able to use 8-bit inference (fp8 cores)
| because NVIDIA has not (yet) made the capability available via
| CUDA.
|
| > 8-bit Inference and training are much more effective on
| Ada/Hopper GPUs because of Tensor Memory Accelerator (TMA) which
| saves a lot of registers
|
| Ada does not have TMA, only Hopper does.
|
| If people are interested I can run my own benchmarks on the
| latest 4090 and compare it to previous generations.
| varunkmohan wrote:
| Would be curious to see your benchmarks. Btw, Nvidia will be
| providing support for fp8 in a future release of CUDA -
| https://github.com/NVIDIA/TransformerEngine/issues/15
|
| I think TMA may not matter as much for consumer cards given the
| disproportionate amount of fp32 / int32 compute that they have.
|
| Would be interesting to see how close to theoretical folks are
| able to get once CUDA support comes through.
| touisteur wrote:
| Well every since I've read these papers doing FFTs faster
| than cuFFT using tensor cores (although FFT isn't supposed to
| be helped by more flops but only better memory bandwidth) and
| also fp32-level accuracy on convolutions with 3x tf32 tensor
| cores sweeps (available in cutlass) I'm quite ready to
| believe some hype about TMA.
|
| Anything improving memory bandwidth for any compute part of
| the GPU is welcome.
|
| Also I'd like for someone to crack open RT cores and get the
| ray-triangle intersection acceleration out of Optics. Have
| you seen the FLOPS on these things?
| lostmsu wrote:
| Can you please run bench from
| https://github.com/karpathy/nanoGPT ?
| moneycantbuy wrote:
| Is a 4090 practically better than a 3090? I just built a new home
| DL PC with two 3090s because I knew I could fit them both in the
| case, whereas with the 4090 it seems more than one could be
| difficult. Also wondering if I can pool the RAM somehow, nvlink
| won't work because the 3090s are different sizes, and apparently
| nvlink doesn't do much more than pcie anyway.
| shaunsingh0207 wrote:
| NVLink was designed for exactly that use case, pooling memory.
| JonathanFly wrote:
| Basically 'it depends' is the answer to all your questions, but
| dual 3090s is a perfectly fine choice. Though ideally you
| _would_ have NVLINK since it is an advantage over the 4090. In
| some specialized situations it is possible to have NVLINK act a
| lot like 48GB of memory, but if you don 't already know if you
| can leverage NVLINK, you very likely aren't in that situation.
| lvl102 wrote:
| RTX 3090/3080 are great value for ML/DL especially at used
| prices.
| kika wrote:
| I hope I'm not getting downvoted into complete white on white for
| asking: Is there a good resource to learn myself a little ML for
| greater good, if I'm a complete math idiot? Something very
| practical, start with {this} and build an image recognition to
| tell birds from dogs. Or start with {that} and build a algo
| trading machine that will make me a trillionaire. I do have a
| 4090 in a windows machine which I can turn into a linux machine.
| gmac wrote:
| I found Francois Chollet's _Deep Learning with Python_ to be an
| excellent intro. https://www.manning.com/books/deep-learning-
| with-python-seco...
| moneycantbuy wrote:
| https://course.fast.ai/
| jamessb wrote:
| If you're interested in Deep Learning specifically, the fast.ai
| "Practical Deep Learning for coders" course [1] is often
| recommended. It says "You don't need any university math either
| -- we'll teach you the calculus and linear algebra you need
| during the course".
|
| [1]: https://course.fast.ai/
| kika wrote:
| Thanks! I don't know if it's DL or not, but what I'm mostly
| interested is "finding patterns". Like I have this very long
| stream of numbers (or pair of numbers, like coordinates) and
| I use such streams to train. Then I have a shorter stream and
| the model infers the rest. Not sure if I'm talking complete
| nonsense or not :-)
| knolan wrote:
| Andrew Ng's introductory courses on Coursera are a good way to
| start.
|
| If you're a bit more comfortable with command line stuff the
| fast.ai is good.
| mjburgess wrote:
| Youtube is pretty full with things like "Make a Netflix Clone"
|
| See, then, eg.,
|
| https://www.youtube.com/results?search_query=build+an+image+...
| kika wrote:
| What I'm looking for is more like Rust Book. Concepts
| explained with examples of more or less real world problems.
| Like they are not just telling you there's this awkward thing
| like "interior mutability" but why you may need it.
| gmiller123456 wrote:
| Math isn't an absolute requirement for training an AI, even
| programming isn't a requirement, as there are quite a few pre-
| built tools. I would recommend browsing through some of the
| many, many books available and see which one(s) speak to your
| current experience level. You won't compete against world class
| systems without going deeper, but as long as you're willing to
| start small, you have to start somewhere just to see if it's
| something you like doing.
| fxtentacle wrote:
| I find this article odd with its fixation on computing speed and
| 8bit.
|
| For most current models, you need 40+ GB of RAM to train them.
| Gradient accumulation doesn't work with batch norms so you really
| need that memory.
|
| That means either dual 3090/4090 or one of the extra expensive
| A100/H100 options. Their table suggests the 3080 would be a good
| deal, but it's not. It doesn't have enough RAM for most problems.
|
| If you can do 8bit inference, don't use a GPU. CPU will be much
| cheaper and potentially also lower latency.
|
| Also: Almost everyone using GPUs for work will join NVIDIA's
| Inception program and get rebates... So why look at retail
| prices?
| lostmsu wrote:
| > Almost everyone using GPUs for work will join NVIDIA's
| Inception program and get rebates... So why look at retail
| prices?
|
| They need to advertise it better. First time I hear about it.
|
| What are the prices like there? GPUs/workstations?
| nonbirithm wrote:
| The 4090 Ti is rumored to have 48GB of VRAM, so one can only
| hope.
| meragrin_ wrote:
| > Almost everyone using GPUs for work will join NVIDIA's
| Inception program and get rebates... So why look at retail
| prices?
|
| So maybe they were including information for the
| hobbyists/students which do not need or cannot afford the
| latest and greatest professional cards?
| varunkmohan wrote:
| I'm not sure any of this is accurate. 8 bit inference on a 4090
| can do 660 Tflops and on an H100 can do 2 Pflops. Not to
| mention, there is no native support for FP8 (which are
| significantly better for deep learning) on existing CPUs.
|
| The memory on a 4090 can serve extremely large models.
| Currently, int4 is started to become proven out. With 24GB of
| memory, you can serve 40 billion parameter models. That coupled
| with the fact that GPU memory bandwidth is significantly higher
| than CPU memory bandwidth means that CPUs should rarely ever be
| cheaper / lower latency than GPUs.
| ftufek wrote:
| > Also: Almost everyone using GPUs for work will join NVIDIA's
| Inception program and get rebates... So why look at retail
| prices?
|
| Out of curiosity, does that also apply for consumer grade GPUs?
| tambre wrote:
| I would be surprised if it did. But you probably shouldn't do
| professional work on GPUs that lack ECC memory.
| chrisMyzel wrote:
| you can get RTX/A6000s but not 3090s or 4090s via inception.
| mota7 wrote:
| > Gradient accumulation doesn't work with batch norms so you
| really need that memory.
|
| Last I looked, very few SOTA models are trained with batch
| normalization. Most of the LLMs use layer norms which can be
| accumulated? (precisely because of the need to avoid the memory
| blowup).
|
| Note also that batch normalization can be done in a memory
| efficient way: It just requires aggregating the batch
| statistics outside the gradient aggregation.
| jszymborski wrote:
| > It doesn't have enough RAM for most problems.
|
| It might not be as glamorous or make as many headlines, but
| there is plenty of research that goes on below 40Gb.
|
| While I most commonly use A100s for my research, all my models
| fit on my personal RTX 2080.
| theLiminator wrote:
| Anyone know if the GPUs are relatively affordable through
| Inception?
| binarymax wrote:
| Thanks to Tim for writing and sharing this awesome analysis. I
| stumbled upon the previously updated version this summer, and so
| glad to see it updated with new hardware. Really great stuff and
| I learned a ton.
| JayStavis wrote:
| I would love to see the NVIDIA T4 and A10 in the inference price
| comparison chart given their prevalence / cloud availability
| xeyownt wrote:
| All those FPS wasted... Why do we keep calling these chips GPUs?
| camel-cdr wrote:
| (almost) general processing unit
| asciimike wrote:
| > What is the carbon footprint of GPUs? How can I use GPUs
| without polluting the environment?
|
| > I worked on a project that produced carbon offsets about ten
| years ago. The carbon offsets were generated by burning leaking
| methane from mines in China. UN officials tracked the process,
| and they required clean digital data and physical inspections of
| the project site. In that case, the carbon offsets that were
| produced were highly reliable. I believe many other projects have
| similar quality standards.
|
| Crusoe Cloud (https://crusoecloud.com) does the same thing;
| powering GPUs off otherwise flared methane (and behind the meter
| renewables), to get carbon-reducing GPUs. A year of A100 usage
| offsets the equivalent emissions of taking a car off the road.
|
| Disclosure: I run product at Crusoe Cloud
| throw_pm23 wrote:
| > burning methane
|
| probably not the first thing that comes to mind to people when
| they hear about carbon offsets. Things are gone further than I
| thought.
| CrazyStat wrote:
| Methane that has been burned to convert it to CO2 is much
| less bad of a greenhouse gas than methane that has been
| allowed to just float off into the atmosphere.
|
| Perhaps counterintuitive given how much we usually go around
| saying burning fossil fuels is bad for the environment, but
| the science is sound.
| asciimike wrote:
| Methane has an 80x higher GWP than CO2 over the first 20
| years, so ending routine flaring (even if the answer is
| "complete combustion") is an immensely important problem to
| help address the effects of climate change. Another source:
| https://www.edf.org/climate/methane-crucial-opportunity-
| clim...
|
| Some other info we've published:
| https://www.crusoeenergy.com/digital-flare-mitigation
| karolist wrote:
| You can score 3090 for 800-900 used, with 24GB VRAM it's superb
| value
| humanistbot wrote:
| But how many of those used 3090s have been run 24/7 by crypto
| farms?
| kkielhofner wrote:
| I'll take a retired crypto mining card over a random
| desktop/gamer card any day.
|
| Crypto cards are almost always run with lower power limits,
| managed operating temperature ranges, fixed fan speeds,
| cleaner environments, etc. They're also typically installed,
| configured, and managed by "professionals". Eventual resell
| of cards is part of the profit/business model of miners so
| they're generally much better about care and feeding of the
| cards, storing boxes/accessories/packing supplies, etc.
|
| Compared to a desktop/gamer card with sporadic usage (more
| hot/cold cycles), power maxed out for a couple more FPS,
| running in unknown environmental conditions (temperature
| control for humans, vape/smoke residue, cat hair, who knows)
| and installed by anyone.
| smoldesu wrote:
| Hard disagree, honestly. A mining card might be
| undervolted, but it will always have lived under sustained
| VRAM temps of 80c+. That's awful for the lifespan of the
| GPU (even relative to bursty gaming workloads) and once the
| memory dies, it's game over for the card. Used GPUs are
| always a gamble, but mostly because it depends on how used
| they are. No matter how you slice it, a mining card is more
| likely to hit the bathtub curve than a gaming one.
| kkielhofner wrote:
| Source on the VRAM temp issue?
|
| The common lore after the last mining crash (~2018) was
| to avoid mining cards at all costs. We're far enough
| along at this point where even most of the people at
| /r/pcmasterrace no longer subscribe to the "mining card
| bad" viewpoint - presumably because plenty of people
| chime in with "bought an ex-mining card in 2018, was
| skeptical, still running strong".
|
| Time will tell again in this latest post-mining era.
| smoldesu wrote:
| While you're not wrong, PCMR is basically the single
| largest central of survivorship bias on the internet.
|
| There probably are people with functional mining cards,
| but like I said in my original comment, hardware failure
| runs along a bathtub curve[0]. The chance of mechanical
| failure increases proportionally with use; leaving mining
| GPUs particularly affected. How _strong_ that influence
| is can be debated, but people are right to highlight an
| increased failure rate in mining cards.
|
| [0] https://en.wikipedia.org/wiki/Bathtub_curve
| metadat wrote:
| You know what also contributes to hardware failure?
| Thermal cycles, because over time they stress the solder
| joints.
|
| If a component has been powered on and running
| continuously under load within it's rated temperature
| band, it'll have fewer heat cycles. This seems preferable
| to a second-hand card from a gamer which gets run at
| random loads and powered down a lot.
| kika wrote:
| And also often crazily overclocked
| kkielhofner wrote:
| Exactly. It's more or less common practice for the
| average gamer to push the card right up to the point of
| visual artifacts, desktop freezing, etc.
|
| Basically "go in MSI Afterburner and crank it up until
| the machine crashes".
| jszymborski wrote:
| a surprising number are claiming to be Brand New in Box.
| Perhaps old scalper stock or maybe even resealed crypto
| cards.
| jszymborski wrote:
| Depends on your market. I just checked in Montreal, Canada and
| you can't get anything used for less than 1600 CAD (around 1200
| USD).
| nightski wrote:
| Wow that is crazy, Microcenter in the USA had 4090s going for
| under $1100 USD not too long ago.
| wellthisisgreat wrote:
| 4090s?? No way
| nanidin wrote:
| I just checked Canadian eBay recently sold and there were
| several that went for 1000-1100 CAD in the last week.
| jszymborski wrote:
| I was looking at local classifieds (Kijiji), but it is true
| there are plenty of pretty reasonable (~1K CAD) listings on
| ebay.ca that ship from abroad.
| sebow wrote:
| I really hope AMD cleans up and invests some money into their
| software stack. Granted it's really hard to catch up to Nvidia,
| but I think it's doable in ~5 years. The barrier to entry into
| rocm compared to cuda is pretty high, and even accounting for the
| fact that things got better in the last years. AMD has potent
| hardware for AI/ML (see instinct), they just don't have it in the
| consumer space. However one of the key factors of getting
| adoption in the consumer space is the fore-mentioned software
| stack, which I recall was a pain to setup. The fact that they're
| going FOSS for rocm shows promise in this regard.
| karolist wrote:
| In current state Nvidia has no competition, and you buy AMD
| only because you hate Nvidia, not because it's better. I had
| tons of AMD cards my first being ATI 9000 but if you want to do
| more than game the hassle is not worth it. My last AMD was
| Radeon VII, once I got a bit serious into Blender there's just
| no comparison, even with enterprise drivers the crashes, random
| issues and slowness comparatively is just not worth it. What
| took me 3 minutes to render on Radeon VII takes 10s on 3090 Ti,
| StableDiffusion renders take 5-6s without any hassle playing
| with rocm, gaming is also no comparison with RTX (I don't even
| use DLSS). Fun fact I sold my RVII after 3 years and added $500
| for a new 3090 Ti. Nvidia sux for their business practices but
| technically they dominate because they invested in software
| really early on and established themselves with Cuda, OptiX,
| RTX, DLSS. Older AMD cards are nice for hackintosh if you're
| into that though, Apple dropped Nvidia hard. Also the Linux
| driver blob thing if you want to be a purist, but IIRC it's
| supposedly changing (sorry don't have a source right now).
___________________________________________________________________
(page generated 2023-01-18 23:00 UTC)