[HN Gopher] GPUs for Deep Learning in 2023 - An In-depth Analysis
       ___________________________________________________________________
        
       GPUs for Deep Learning in 2023 - An In-depth Analysis
        
       Author : itvision
       Score  : 173 points
       Date   : 2023-01-18 18:48 UTC (4 hours ago)
        
 (HTM) web link (timdettmers.com)
 (TXT) w3m dump (timdettmers.com)
        
       | moneycantbuy wrote:
       | Anyone know if high-RAM Apple silicon such as the 128 GB M1 Ultra
       | is useful for training large models? RAM seems like the limiting
       | factor in DL, so I'm hoping apple can put some pressure on nvidia
       | to offer consumer GPUs with more than 24GB RAM.
        
         | threeseed wrote:
         | It is definitely useful but does not beat the top Nvidia cards.
         | [1]
         | 
         | Would be very interesting to retest with the M2 and also in the
         | months/years to come as the software reaches the level of
         | optimisations we see on the PC side.
         | 
         | [1] https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-
         | Learning...
        
         | carbocation wrote:
         | Not exactly an answer to your question, but from a pytorch
         | standpoint, there are still many operations that are not
         | supported on MPS[1]. I can't recall a circumstance where an
         | architecture I wanted to train was fully supported on MPS, so
         | some of the training ends up happening in CPU at that point.
         | 
         | 1 = https://github.com/pytorch/pytorch/issues/77764
        
         | JonathanFly wrote:
         | It's an interesting option. My gut instinct is that if you need
         | need 128GB of memory for a giant model, but you don't need much
         | compute - like fine tuning a very large model maybe - you might
         | as well just use a consumer high core CPU and wait 10x as long.
         | 
         | 5950X CPU ($500) with 128GB of memory ($400).
        
           | metadat wrote:
           | CPU-addressable RAM is not interchangeable with graphics card
           | RAM, so unfortunately this strategy isn't quite in the right
           | direction.
           | 
           | AFAIK it flat out won't work with the DL frameworks.
           | 
           | If I'm mistaken please do speak up.
        
         | dwrodri wrote:
         | Having large amounts of unified memory is useful for training
         | large models, but there are two problems with using Apple
         | Silicon for model training:
         | 
         | 1. Apple doesn't provide public interfaces to the ANE or the
         | AMX modules, making it difficult to fully leverage the
         | hardware, even from PyTorch's MPS runtime or Apple's
         | "Tensorflow for macOS" package
         | 
         | 2. Even if we could fully leverage the onboard hardware, even
         | the M1 Ultra isn't going to outpace top-of-the-line consumer-
         | grade Nvidia GPUs (3090Ti/4090) because it doesn't have the
         | compute.
         | 
         | The upcoming Mac Pro is rumored to be preserving the
         | configurability of a workstation [1]. If that's the case, then
         | there's a (slim) chance we might see Apple Silicon GPUs in the
         | future, which would then potentially make it possible to build
         | an Apple Silicon Machine that could compete with an Nvidia
         | workstation.
         | 
         | At the end of the day, Nvidia is winning the software war with
         | CUDA. There are far, far more resources which enable software
         | developers to write compute intensive code which runs on Nvidia
         | hardware than any other GPU compute ecosystem out there.
         | 
         | Apple's Metal API, Intel's SYCL, and AMD's ROCm/HIP platform
         | are closing the gap each year, but their success in the ML
         | space is dependent upon how many people they can peel away from
         | the Nvidia Hegemony.
         | 
         | 1:https://www.bloomberg.com/news/newsletters/2022-12-18/when-w.
         | ..
        
           | samspenc wrote:
           | This basically seems to line up with what another tech
           | community (outside of AI / ML) seem to agree on: that Apple
           | hardware is not that great for 3D modeling or animation work.
           | Their M1 and M2 chips rank terribly on the open-source
           | Blender benchmark: https://opendata.blender.org/
           | 
           | Despite Apple engineers contributing hardware support to
           | Blender: https://9to5mac.com/2022/03/09/blender-3-1-update-
           | adds-metal...
           | 
           | So looks like NVidia (and AMD to some extent) are winning the
           | 3D usage war as well for now.
        
         | zone411 wrote:
         | If you are training a model that requires this much memory, it
         | will also require a lot of compute, so it would be too slow and
         | not cost-effective. It may be useful for inference.
        
       | ianbutler wrote:
       | Flow chart was nice but I am not an organization and I like
       | training multi billion param models. My next two cards were going
       | to be the be the rtx 6000 ada. The memory capacity alone almost
       | makes it necessary.
        
       | leecarraher wrote:
       | is there more context for: "Well, with the addition of the sparse
       | matrix multiplication feature for Tensor Cores, my algorithm, or
       | other sparse training algorithms, now actually provide speedups
       | of up to 2x during training" 2x what? CPU, then it's hardly worth
       | it. 2x per core, so #cores x 2. I'd imagine the level of sparsity
       | is also important, but there is no mention of it.
        
       | ipsum2 wrote:
       | The graphs ranking GPUs may not be accurate, as they don't
       | represent real-world results and have factual inaccuracies. For
       | example:
       | 
       | > Shown is raw relative performance of GPUs. For example, an RTX
       | 4090 has about 0.33x performance of a H100 SMX for 8-bit
       | inference. In other words, a H100 SMX is three times faster for
       | 8-bit inference compared to a RTX 4090.
       | 
       | RTX 4090 GPUs are not able to use 8-bit inference (fp8 cores)
       | because NVIDIA has not (yet) made the capability available via
       | CUDA.
       | 
       | > 8-bit Inference and training are much more effective on
       | Ada/Hopper GPUs because of Tensor Memory Accelerator (TMA) which
       | saves a lot of registers
       | 
       | Ada does not have TMA, only Hopper does.
       | 
       | If people are interested I can run my own benchmarks on the
       | latest 4090 and compare it to previous generations.
        
         | varunkmohan wrote:
         | Would be curious to see your benchmarks. Btw, Nvidia will be
         | providing support for fp8 in a future release of CUDA -
         | https://github.com/NVIDIA/TransformerEngine/issues/15
         | 
         | I think TMA may not matter as much for consumer cards given the
         | disproportionate amount of fp32 / int32 compute that they have.
         | 
         | Would be interesting to see how close to theoretical folks are
         | able to get once CUDA support comes through.
        
           | touisteur wrote:
           | Well every since I've read these papers doing FFTs faster
           | than cuFFT using tensor cores (although FFT isn't supposed to
           | be helped by more flops but only better memory bandwidth) and
           | also fp32-level accuracy on convolutions with 3x tf32 tensor
           | cores sweeps (available in cutlass) I'm quite ready to
           | believe some hype about TMA.
           | 
           | Anything improving memory bandwidth for any compute part of
           | the GPU is welcome.
           | 
           | Also I'd like for someone to crack open RT cores and get the
           | ray-triangle intersection acceleration out of Optics. Have
           | you seen the FLOPS on these things?
        
         | lostmsu wrote:
         | Can you please run bench from
         | https://github.com/karpathy/nanoGPT ?
        
       | moneycantbuy wrote:
       | Is a 4090 practically better than a 3090? I just built a new home
       | DL PC with two 3090s because I knew I could fit them both in the
       | case, whereas with the 4090 it seems more than one could be
       | difficult. Also wondering if I can pool the RAM somehow, nvlink
       | won't work because the 3090s are different sizes, and apparently
       | nvlink doesn't do much more than pcie anyway.
        
         | shaunsingh0207 wrote:
         | NVLink was designed for exactly that use case, pooling memory.
        
         | JonathanFly wrote:
         | Basically 'it depends' is the answer to all your questions, but
         | dual 3090s is a perfectly fine choice. Though ideally you
         | _would_ have NVLINK since it is an advantage over the 4090. In
         | some specialized situations it is possible to have NVLINK act a
         | lot like 48GB of memory, but if you don 't already know if you
         | can leverage NVLINK, you very likely aren't in that situation.
        
       | lvl102 wrote:
       | RTX 3090/3080 are great value for ML/DL especially at used
       | prices.
        
       | kika wrote:
       | I hope I'm not getting downvoted into complete white on white for
       | asking: Is there a good resource to learn myself a little ML for
       | greater good, if I'm a complete math idiot? Something very
       | practical, start with {this} and build an image recognition to
       | tell birds from dogs. Or start with {that} and build a algo
       | trading machine that will make me a trillionaire. I do have a
       | 4090 in a windows machine which I can turn into a linux machine.
        
         | gmac wrote:
         | I found Francois Chollet's _Deep Learning with Python_ to be an
         | excellent intro. https://www.manning.com/books/deep-learning-
         | with-python-seco...
        
         | moneycantbuy wrote:
         | https://course.fast.ai/
        
         | jamessb wrote:
         | If you're interested in Deep Learning specifically, the fast.ai
         | "Practical Deep Learning for coders" course [1] is often
         | recommended. It says "You don't need any university math either
         | -- we'll teach you the calculus and linear algebra you need
         | during the course".
         | 
         | [1]: https://course.fast.ai/
        
           | kika wrote:
           | Thanks! I don't know if it's DL or not, but what I'm mostly
           | interested is "finding patterns". Like I have this very long
           | stream of numbers (or pair of numbers, like coordinates) and
           | I use such streams to train. Then I have a shorter stream and
           | the model infers the rest. Not sure if I'm talking complete
           | nonsense or not :-)
        
         | knolan wrote:
         | Andrew Ng's introductory courses on Coursera are a good way to
         | start.
         | 
         | If you're a bit more comfortable with command line stuff the
         | fast.ai is good.
        
         | mjburgess wrote:
         | Youtube is pretty full with things like "Make a Netflix Clone"
         | 
         | See, then, eg.,
         | 
         | https://www.youtube.com/results?search_query=build+an+image+...
        
           | kika wrote:
           | What I'm looking for is more like Rust Book. Concepts
           | explained with examples of more or less real world problems.
           | Like they are not just telling you there's this awkward thing
           | like "interior mutability" but why you may need it.
        
         | gmiller123456 wrote:
         | Math isn't an absolute requirement for training an AI, even
         | programming isn't a requirement, as there are quite a few pre-
         | built tools. I would recommend browsing through some of the
         | many, many books available and see which one(s) speak to your
         | current experience level. You won't compete against world class
         | systems without going deeper, but as long as you're willing to
         | start small, you have to start somewhere just to see if it's
         | something you like doing.
        
       | fxtentacle wrote:
       | I find this article odd with its fixation on computing speed and
       | 8bit.
       | 
       | For most current models, you need 40+ GB of RAM to train them.
       | Gradient accumulation doesn't work with batch norms so you really
       | need that memory.
       | 
       | That means either dual 3090/4090 or one of the extra expensive
       | A100/H100 options. Their table suggests the 3080 would be a good
       | deal, but it's not. It doesn't have enough RAM for most problems.
       | 
       | If you can do 8bit inference, don't use a GPU. CPU will be much
       | cheaper and potentially also lower latency.
       | 
       | Also: Almost everyone using GPUs for work will join NVIDIA's
       | Inception program and get rebates... So why look at retail
       | prices?
        
         | lostmsu wrote:
         | > Almost everyone using GPUs for work will join NVIDIA's
         | Inception program and get rebates... So why look at retail
         | prices?
         | 
         | They need to advertise it better. First time I hear about it.
         | 
         | What are the prices like there? GPUs/workstations?
        
         | nonbirithm wrote:
         | The 4090 Ti is rumored to have 48GB of VRAM, so one can only
         | hope.
        
         | meragrin_ wrote:
         | > Almost everyone using GPUs for work will join NVIDIA's
         | Inception program and get rebates... So why look at retail
         | prices?
         | 
         | So maybe they were including information for the
         | hobbyists/students which do not need or cannot afford the
         | latest and greatest professional cards?
        
         | varunkmohan wrote:
         | I'm not sure any of this is accurate. 8 bit inference on a 4090
         | can do 660 Tflops and on an H100 can do 2 Pflops. Not to
         | mention, there is no native support for FP8 (which are
         | significantly better for deep learning) on existing CPUs.
         | 
         | The memory on a 4090 can serve extremely large models.
         | Currently, int4 is started to become proven out. With 24GB of
         | memory, you can serve 40 billion parameter models. That coupled
         | with the fact that GPU memory bandwidth is significantly higher
         | than CPU memory bandwidth means that CPUs should rarely ever be
         | cheaper / lower latency than GPUs.
        
         | ftufek wrote:
         | > Also: Almost everyone using GPUs for work will join NVIDIA's
         | Inception program and get rebates... So why look at retail
         | prices?
         | 
         | Out of curiosity, does that also apply for consumer grade GPUs?
        
           | tambre wrote:
           | I would be surprised if it did. But you probably shouldn't do
           | professional work on GPUs that lack ECC memory.
        
           | chrisMyzel wrote:
           | you can get RTX/A6000s but not 3090s or 4090s via inception.
        
         | mota7 wrote:
         | > Gradient accumulation doesn't work with batch norms so you
         | really need that memory.
         | 
         | Last I looked, very few SOTA models are trained with batch
         | normalization. Most of the LLMs use layer norms which can be
         | accumulated? (precisely because of the need to avoid the memory
         | blowup).
         | 
         | Note also that batch normalization can be done in a memory
         | efficient way: It just requires aggregating the batch
         | statistics outside the gradient aggregation.
        
         | jszymborski wrote:
         | > It doesn't have enough RAM for most problems.
         | 
         | It might not be as glamorous or make as many headlines, but
         | there is plenty of research that goes on below 40Gb.
         | 
         | While I most commonly use A100s for my research, all my models
         | fit on my personal RTX 2080.
        
         | theLiminator wrote:
         | Anyone know if the GPUs are relatively affordable through
         | Inception?
        
       | binarymax wrote:
       | Thanks to Tim for writing and sharing this awesome analysis. I
       | stumbled upon the previously updated version this summer, and so
       | glad to see it updated with new hardware. Really great stuff and
       | I learned a ton.
        
       | JayStavis wrote:
       | I would love to see the NVIDIA T4 and A10 in the inference price
       | comparison chart given their prevalence / cloud availability
        
       | xeyownt wrote:
       | All those FPS wasted... Why do we keep calling these chips GPUs?
        
         | camel-cdr wrote:
         | (almost) general processing unit
        
       | asciimike wrote:
       | > What is the carbon footprint of GPUs? How can I use GPUs
       | without polluting the environment?
       | 
       | > I worked on a project that produced carbon offsets about ten
       | years ago. The carbon offsets were generated by burning leaking
       | methane from mines in China. UN officials tracked the process,
       | and they required clean digital data and physical inspections of
       | the project site. In that case, the carbon offsets that were
       | produced were highly reliable. I believe many other projects have
       | similar quality standards.
       | 
       | Crusoe Cloud (https://crusoecloud.com) does the same thing;
       | powering GPUs off otherwise flared methane (and behind the meter
       | renewables), to get carbon-reducing GPUs. A year of A100 usage
       | offsets the equivalent emissions of taking a car off the road.
       | 
       | Disclosure: I run product at Crusoe Cloud
        
         | throw_pm23 wrote:
         | > burning methane
         | 
         | probably not the first thing that comes to mind to people when
         | they hear about carbon offsets. Things are gone further than I
         | thought.
        
           | CrazyStat wrote:
           | Methane that has been burned to convert it to CO2 is much
           | less bad of a greenhouse gas than methane that has been
           | allowed to just float off into the atmosphere.
           | 
           | Perhaps counterintuitive given how much we usually go around
           | saying burning fossil fuels is bad for the environment, but
           | the science is sound.
        
             | asciimike wrote:
             | Methane has an 80x higher GWP than CO2 over the first 20
             | years, so ending routine flaring (even if the answer is
             | "complete combustion") is an immensely important problem to
             | help address the effects of climate change. Another source:
             | https://www.edf.org/climate/methane-crucial-opportunity-
             | clim...
             | 
             | Some other info we've published:
             | https://www.crusoeenergy.com/digital-flare-mitigation
        
       | karolist wrote:
       | You can score 3090 for 800-900 used, with 24GB VRAM it's superb
       | value
        
         | humanistbot wrote:
         | But how many of those used 3090s have been run 24/7 by crypto
         | farms?
        
           | kkielhofner wrote:
           | I'll take a retired crypto mining card over a random
           | desktop/gamer card any day.
           | 
           | Crypto cards are almost always run with lower power limits,
           | managed operating temperature ranges, fixed fan speeds,
           | cleaner environments, etc. They're also typically installed,
           | configured, and managed by "professionals". Eventual resell
           | of cards is part of the profit/business model of miners so
           | they're generally much better about care and feeding of the
           | cards, storing boxes/accessories/packing supplies, etc.
           | 
           | Compared to a desktop/gamer card with sporadic usage (more
           | hot/cold cycles), power maxed out for a couple more FPS,
           | running in unknown environmental conditions (temperature
           | control for humans, vape/smoke residue, cat hair, who knows)
           | and installed by anyone.
        
             | smoldesu wrote:
             | Hard disagree, honestly. A mining card might be
             | undervolted, but it will always have lived under sustained
             | VRAM temps of 80c+. That's awful for the lifespan of the
             | GPU (even relative to bursty gaming workloads) and once the
             | memory dies, it's game over for the card. Used GPUs are
             | always a gamble, but mostly because it depends on how used
             | they are. No matter how you slice it, a mining card is more
             | likely to hit the bathtub curve than a gaming one.
        
               | kkielhofner wrote:
               | Source on the VRAM temp issue?
               | 
               | The common lore after the last mining crash (~2018) was
               | to avoid mining cards at all costs. We're far enough
               | along at this point where even most of the people at
               | /r/pcmasterrace no longer subscribe to the "mining card
               | bad" viewpoint - presumably because plenty of people
               | chime in with "bought an ex-mining card in 2018, was
               | skeptical, still running strong".
               | 
               | Time will tell again in this latest post-mining era.
        
               | smoldesu wrote:
               | While you're not wrong, PCMR is basically the single
               | largest central of survivorship bias on the internet.
               | 
               | There probably are people with functional mining cards,
               | but like I said in my original comment, hardware failure
               | runs along a bathtub curve[0]. The chance of mechanical
               | failure increases proportionally with use; leaving mining
               | GPUs particularly affected. How _strong_ that influence
               | is can be debated, but people are right to highlight an
               | increased failure rate in mining cards.
               | 
               | [0] https://en.wikipedia.org/wiki/Bathtub_curve
        
               | metadat wrote:
               | You know what also contributes to hardware failure?
               | Thermal cycles, because over time they stress the solder
               | joints.
               | 
               | If a component has been powered on and running
               | continuously under load within it's rated temperature
               | band, it'll have fewer heat cycles. This seems preferable
               | to a second-hand card from a gamer which gets run at
               | random loads and powered down a lot.
        
             | kika wrote:
             | And also often crazily overclocked
        
               | kkielhofner wrote:
               | Exactly. It's more or less common practice for the
               | average gamer to push the card right up to the point of
               | visual artifacts, desktop freezing, etc.
               | 
               | Basically "go in MSI Afterburner and crank it up until
               | the machine crashes".
        
           | jszymborski wrote:
           | a surprising number are claiming to be Brand New in Box.
           | Perhaps old scalper stock or maybe even resealed crypto
           | cards.
        
         | jszymborski wrote:
         | Depends on your market. I just checked in Montreal, Canada and
         | you can't get anything used for less than 1600 CAD (around 1200
         | USD).
        
           | nightski wrote:
           | Wow that is crazy, Microcenter in the USA had 4090s going for
           | under $1100 USD not too long ago.
        
             | wellthisisgreat wrote:
             | 4090s?? No way
        
           | nanidin wrote:
           | I just checked Canadian eBay recently sold and there were
           | several that went for 1000-1100 CAD in the last week.
        
             | jszymborski wrote:
             | I was looking at local classifieds (Kijiji), but it is true
             | there are plenty of pretty reasonable (~1K CAD) listings on
             | ebay.ca that ship from abroad.
        
       | sebow wrote:
       | I really hope AMD cleans up and invests some money into their
       | software stack. Granted it's really hard to catch up to Nvidia,
       | but I think it's doable in ~5 years. The barrier to entry into
       | rocm compared to cuda is pretty high, and even accounting for the
       | fact that things got better in the last years. AMD has potent
       | hardware for AI/ML (see instinct), they just don't have it in the
       | consumer space. However one of the key factors of getting
       | adoption in the consumer space is the fore-mentioned software
       | stack, which I recall was a pain to setup. The fact that they're
       | going FOSS for rocm shows promise in this regard.
        
         | karolist wrote:
         | In current state Nvidia has no competition, and you buy AMD
         | only because you hate Nvidia, not because it's better. I had
         | tons of AMD cards my first being ATI 9000 but if you want to do
         | more than game the hassle is not worth it. My last AMD was
         | Radeon VII, once I got a bit serious into Blender there's just
         | no comparison, even with enterprise drivers the crashes, random
         | issues and slowness comparatively is just not worth it. What
         | took me 3 minutes to render on Radeon VII takes 10s on 3090 Ti,
         | StableDiffusion renders take 5-6s without any hassle playing
         | with rocm, gaming is also no comparison with RTX (I don't even
         | use DLSS). Fun fact I sold my RVII after 3 years and added $500
         | for a new 3090 Ti. Nvidia sux for their business practices but
         | technically they dominate because they invested in software
         | really early on and established themselves with Cuda, OptiX,
         | RTX, DLSS. Older AMD cards are nice for hackintosh if you're
         | into that though, Apple dropped Nvidia hard. Also the Linux
         | driver blob thing if you want to be a purist, but IIRC it's
         | supposedly changing (sorry don't have a source right now).
        
       ___________________________________________________________________
       (page generated 2023-01-18 23:00 UTC)