[HN Gopher] Reasonable GPUs
       ___________________________________________________________________
        
       Reasonable GPUs
        
       What is the status of GPUs for general compute?  Last I looked,
       NVidia worked well, and AMD was horrible. Right now, it looks like
       the major limiting factor (if you don't care about a [?]3x
       difference in performance, which I don't) is RAM. More is better,
       and good models need >10GB, while LLMs can be up to 350GB.  * Intel
       Arc A770 has 16GB for <$300. I have no idea about compatibility
       with Hugging Face, Blender, etc.  * NVidia 4060 has 16GB for <$500.
       100% compatible with everything.  * Older NVidia (e.g. Pascal era)
       can be had with 24GB for <$300 used, without a graphics port. Not
       clear how CUDA compute capability lines up to what's needed for
       modern tools, or how well things work without a graphics port.  *
       Several cards may or may not work together. I'm not sure.  Is there
       any way to figure this stuff out, and what's reasonable / practical
       / easy? Something which explains CUDA compute levels, vendor
       compatibility, multi-card compatibility, and all that jazz. It'd be
       nice to have a generic enough guide to understand both pro and
       amateur use, e.g.:  - A770 x21, if someone got it working, could
       handle Facebook's OPT-175 for <$10k via Alpa. That brings it into
       "rich hobbyist" or "justifiable business expense" range. Not clear
       if that's practical.  - Kids learning AI would be much easier if
       it's cheaper (e.g. A770)  - "General compute" also includes things
       like Blender or accelerating rendering in kdenlive, etc.  - Etc.
       This stuff is getting useful to a broader and broader audience, but
       it's confusing.
        
       Author : frognumber
       Score  : 57 points
       Date   : 2023-11-26 16:26 UTC (6 hours ago)
        
       | pizza wrote:
       | nvidia specific, best write up I know:
       | https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
       | 
       | Across vendors, generally, Nvidia still dominates currently.
       | People are adding more support into ML libraries for other
       | vendors via (second-class imo) alternate backends but expect to
       | be patient if you're waiting for the day when there is healthy
       | competition.
       | 
       | IMO, I'd say: if you can save up for it, get a 4090; if you can
       | save up for half a 4090, get a 3090 - seen many going for 600-800
       | now. If you can save up for half a 3090, I'm not sure - depends
       | on if you prefer speed or VRAM. If it were me, I'd pick more VRAM
       | first.
       | 
       | re: compute capability, you can see here:
       | 
       | - which GPUs have what cc: https://developer.nvidia.com/cuda-gpus
       | 
       | - what cc comes with what features:
       | https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
       | 
       | I think the main qualitative change (beyond bigger numbers in the
       | spec) for an enduser of machine learning libraries from 8.6 ->
       | 8.9 (ie 3090 -> 4090) is this line:
       | 
       | > 4 mixed-precision Fourth-Generation Tensor Cores supporting
       | fp8, fp16, __nv_bfloat16, tf32, sub-byte and fp64 for compute
       | capability 8.9 (see Warp matrix functions for details)
       | 
       | ie new precisions will be builtin to eg pytorch with hw-
       | level/tensor core support
       | 
       | edit: btw you probably ought to stick to a consumer gpu (ie not
       | professional) if you want it to be generally versatile while also
       | easy to use at home.
        
         | smoldesu wrote:
         | This. If you insist on being as cheap as possible, shoot for a
         | 12gb card but be aware that you'll be missing out on the
         | throughput of higher-end models. The 3060 is popular for this I
         | think, but you'll probably want a better card with more CUDA
         | cores to max out performance.
         | 
         | Cards like the A770 are awesome, but barely even support raster
         | drivers on DirectX. Your best bang-for-buck options are going
         | to be Nvidia-only for now, with a few competing AMD cards that
         | have fast-tracked Pytorch support.
        
           | matthewaveryusa wrote:
           | I purchased a 3060 specifically for the 12gb of memory last
           | November and I've been able to run llama, alpaca, stable
           | diffusion out of the box for everything without ever having
           | any memory issues. Training is usually overnight, and a
           | stable diffusion will render in ~5 seconds, llama will do 20
           | tokens/second.
           | 
           | I would say start with the 3060 for 250 bucks, and if you're
           | still loving it after a couple months, drop 10x more on a
           | quadro.
           | 
           | My only word of advice is get docker setup and install the
           | nvidia docker toolkit to passthrough your gpu to docker
           | images -- the package management for all these python ai
           | tools is a hell-scape, especially if you want to try a bunch
           | of different things.
        
         | frognumber wrote:
         | Thank you. This is super-helpful.
         | 
         | > re: compute capability, you can see here:
         | 
         | My key question is much more pragmatic:
         | 
         | 1) If I grab a random model from Hugging Face, will it
         | accelerate?
         | 
         | 2) If I run Blender, kdenlive, or DaVinci Resolve, will it
         | accelerate?
         | 
         | Is there a line where things break?
         | 
         | I definitely prefer more VRAM to more speed. As an occasional
         | user, speed doesn't really matter. Things working does.
        
           | smoldesu wrote:
           | > If I grab a random model from Hugging Face, will it
           | accelerate?
           | 
           | Probably, it depends more on how you configure the
           | inferencing software. Most software that supports
           | acceleration starts with CUDA or CUBLAS, so you should be
           | good.
           | 
           | > If I run Blender, kdenlive, or DaVinci Resolve, will it
           | accelerate?
           | 
           | Yep. If you're running Linux, some distros might be a little
           | iffy about shipping the proprietary/accelerated versions of
           | this software, but most are fine. The Flatpak versions should
           | all have Nvidia acceleration working out-of-box, if you do
           | encounter any issues.
           | 
           | > Is there a line where things break?
           | 
           | Yes, but you can avoid it by choosing smaller quantizations
           | and giving yourself a few gigs of VRAM headroom. In my
           | experience, it's always better to select a model smaller than
           | you need so you're not risking an OOM crash (I've got a
           | 3070ti).
           | 
           | Lotta other great advice in this thread, though! Good luck
           | picking something out.
        
         | yread wrote:
         | What about 4060 ti 16gb? It was released after this guide,
         | costs ~500eur and is a bit faster, newer (and a lot more
         | efficient) than a 3060
        
       | ilaksh wrote:
       | I would really like a "reasonable" monthly price for a VPS with a
       | GPU. Even a consumer card like a 3090.
       | 
       | vasti.ai have the best prices I have seen, but comes with
       | limitations, and still not the best deal for an entire month.
        
         | cma wrote:
         | Nvidia has a datacenter tax in their driver terms of use, so
         | you won't find consumer card vpses at consumer like prices.
        
         | xcv123 wrote:
         | Given the cost of the hardware and power and everything else
         | required to run it and support it, how can you say that 20
         | cents an hour is unreasonable? There's very little profit
         | margin there. At this price it would take roughly a year for
         | them to make a profit. If you need continuous usage at the
         | lowest price then you need to buy a GPU on ebay.
        
         | tommy_axle wrote:
         | Checked out vast.ai but you can get down to the ~$0.34/hr at
         | Runpod depending on how much vram you need.
        
         | theyinwhy wrote:
         | Most offers are fine, the A100 I rented was a scam, however. A
         | scam in terms of: advertised as A100, performing like a 1080. I
         | guess the seller partitioned the card or rigged the id. You can
         | report frauds like this on their page but only while you are
         | renting.
        
           | buildbot wrote:
           | I would hope vast.ai would be able to detect MIG at least.
           | 
           | It could also be a low power cap - I had a Dell C4140 for a
           | bit with 220V power supplies and 120V power, locking the
           | entire thing to ~50% of the max power cap per GPU basically.
        
       | startupsfail wrote:
       | Older generation RTX 8000 / 48GB are reasonable.
       | 
       | One big disadvantage for older Turing card, no bfloat16. But if
       | you run a quantized/mixed precision model or QLoRA, it doesn't
       | hurt as much.
        
       | diffeomorphism wrote:
       | Is shared ram any useful? For instance a mini PC with an AMD
       | 7940HS chip and 64GB of ddr5 ram costs about 800EUR. At less than
       | half the price of just a GPU, I am not expecting any great
       | results, but is it usable?
        
         | pella wrote:
         | as I know - only 16GB VRAM allowed in 7940HS ( via BIOS )
         | 
         | so probably you can expect similar results:
         | 
         | https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...
         | 
         | HN https://news.ycombinator.com/item?id=37162762
        
         | maerF0x0 wrote:
         | I was curious about this topic too because M3 macs have
         | "Unified memory" which is shared amongst their CPU/GPUs. Anyone
         | have a link or explanation of how this works?
        
           | treesciencebot wrote:
           | One of the main bottlenecks for inference is memory bandwith
           | (esp when dealing with huge models, like SD/SDXL) and for
           | that, nothing I know of comes close to matching memory speeds
           | on Apple Silicon (up to 400GB/s).
        
             | svnt wrote:
             | You can run very large models on a Macbook M2 with 96 GB.
             | They run 1/3 to 1/4 slower in tokens/s than the faster
             | hardware, but they fit in memory.
             | 
             | (400 GB/s is a lot in the form factor but the 4090 and
             | equivalent have 1 TB/s, and H100s several times that)
             | 
             | Edit: Here someone asked the same question: https://www.red
             | dit.com/r/LocalLLaMA/comments/14319ra/rtx_409...
        
       | pella wrote:
       | Raja Koduri - 2023-11-25 -
       | https://twitter.com/RajaXg/status/1728465097243406482
       | 
       |  _" " Very encouraging to see the steady increase of viable
       | hardware options that can handle various AI models.
       | 
       | At the beginning of the year, there was only one practical option
       | - nVidia. Now we see at-least 3 vendors providing reasonable
       | options. Apple, AMD and Intel. We have been profiling several
       | options and I will share some of our findings here.
       | 
       | The good stuff
       | 
       | - Apple Macs were a pleasant surprise on how easy it is to get
       | various models running
       | 
       | - AMD also made impressive progress with PyTorch and a lot more
       | models run now than even 4-5 months ago on MI2XX and Radeon
       | 
       | - We tried both Intel Arc and Ponte Vecchio and they were able to
       | execute everything we have thrown at them.
       | 
       | - Intel Gaudi has very impressive performance on the models that
       | work on that architecture. It's our current best option for LLM
       | inference on select models.
       | 
       | - Ponte Vecchio surprised us with its performance on our custom
       | face swap model, beating everyone including the mighty H100. We
       | suspect that our model may be fitting largely in the Rambo cache.
       | 
       | The wishlist
       | 
       | - For training and inference of large models that don't fit in
       | memory - nVidia is still the only practical option. Wishing that
       | there are more options in 2024 here
       | 
       | - While compatibility is getting better, a ton of performance is
       | still left on the table on Apple, AMD and Intel. Wishing that
       | software will keep getting better and increase their HW
       | utilization. There is still room on compatibility as well,
       | particularly with supporting various encodings and model
       | parameter sizes on AMD.
       | 
       | - Intel Gaudi looks very promising performance-wise and wishing
       | that more models seamlessly work out of the box without Intel
       | intervention.
       | 
       | - Wishing that both AMD and Intel release new gaming GPUs with
       | more memory capacity and bandwidth.
       | 
       | - Wishing that Intel releases a PVC kicker with more memory
       | capacity and bandwidth. Currently it's the best option we have to
       | bring our artists workflow with face swap training from 3-days to
       | a few hours. It scales linearly from 1-GPU to 16-GPUs.
       | 
       | - Wishing Intel support for PyTorch is as frictionless as AMD and
       | nVdia. May be Intel should consider supporting PyTorch RocM or
       | up-stream OneAPI support under CUDA device.
       | 
       | really grateful to all vendors for providing access to hardware
       | and developer support.
       | 
       | Looking forward to continue filling our data center with
       | interesting mix of architectures. ""_
        
       | noughtme wrote:
       | If I just needed a GPU for learning purposes, is 2xGPUs
       | necessary? Would a single 24GB GPU significantly bottleneck
       | training with any publicly available datasets? Just need
       | something faster than my laptop, but if it takes twice as long,
       | not really an issue.
        
         | buildbot wrote:
         | Training what on what - Resnet50 on imagenet? Yeah sure a
         | single 4090 is fine. Will take a bit.
         | 
         | A 1.5B parameter LLM? That's a few weeks with 64 V100s - on a
         | small dataset.
         | 
         | Training something Lllama 7b class? (Not using lora)? Weeks
         | with the same number of A100s.
         | 
         | With lora? Back to a single 4090 - depending on your dataset.
         | It still might take weeks to go through 2000 examples for
         | finetuning with a large context size.
        
       | buildbot wrote:
       | Ebay 3090 or new @ 1599$ 4090 (founders edition, gigabyte
       | windforce v2), are the best price/performance/ease of use in my
       | opinion.
       | 
       | AMD is too funky for most still. I have an Mi60 that won't load
       | drivers due to some PSP (platform security processor) missing
       | firmware on the GPU...
        
       | uniqueuid wrote:
       | To be honest, the best idea for most people is probably just any
       | GPU that you can easily afford and then rent a big iron GPU.
       | 
       | There is almost no way you will make back the $5k for a 40GB+ ram
       | card, so just save yourself all the hassle and go for something
       | that ticks all the rest of your boxes.
       | 
       | Non-CUDA cards may be ok if you have very simple requirements,
       | but I'd expect many hours of debugging if you want something
       | that's not ready to go out of the box.
        
         | civilitty wrote:
         | I agree. Availability is a pain in the ass which might a
         | dealbreaker for urgent interactive use cases but a 48GB A6000
         | on LambdaLabs is $0.80/hr [1]. A newer 80GB H100 is $1.99/hr so
         | especially if you're trying to do batch processing and can
         | script a bot to wait for availability, it's often a much better
         | option.
         | 
         | With that aforementioned A6000 ($5k retail) you'd have to use
         | it for at least _six thousand_ hours to break even on the cloud
         | cost.
         | 
         | [1] https://lambdalabs.com/service/gpu-cloud#pricing
        
           | buildbot wrote:
           | That seems like a lot, but that's only ~8 months of usage. If
           | you are doing consistent work with large models, or plan to
           | for over a year, then it makes sense to at least have some
           | hardware.
           | 
           | Something people forget too is that if you have no Nvidia
           | GPUs at all locally, you'll need to spend an significant
           | amount of time installing a new node, copying data, and
           | debugging in your cloud instance, each time you want to do
           | something, while being charged for it. It's a pretty big
           | boost in terms of my time to develop locally and then scale
           | to the cloud once something smaller scale is working.
        
             | uniqueuid wrote:
             | I agree especially with the second argument.
             | 
             | But most people who toy with LLMs will probably never make
             | money out of them. Even those who do will often spend a lot
             | of time getting their bearings during which the GPU sits
             | idle. Then you begin to ramp up your use but by the time,
             | there's a new generation of GPUs out.
             | 
             | That's why my recommendation is to start with something
             | lightweight.
             | 
             | It's also much less frustrating to start working for a few
             | hours on a rented A100 rather than running into OOMs all
             | the time while fine-tuning batch sizes and waiting for the
             | nth highly quantized model to download.
        
       | qwertyforce wrote:
       | used 3090 is the best option imo
        
       | jszymborski wrote:
       | It really depends on what you're trying to do.
       | 
       | This is sorta _the_ guide on GPUs for DL and has a great decision
       | tree https://timdettmers.com/2023/01/30/which-gpu-for-deep-
       | learni...
       | 
       | Personally, I'm limited to an RTX 2080 for my personal projects
       | at the moment, and I find the constraint pretty rewarding. It
       | forces me to find alternatives to the huge models, and you'd be
       | surprised what you can eek out when you pour in the time to tweak
       | models. Of course, good data is also pinnacle.
        
       | ChrisArchitect wrote:
       | Ask HN:
        
       | Avlin67 wrote:
       | Last I looked, NVidia worked well, and AMD was horrible.
       | 
       | are you sure ?
        
       ___________________________________________________________________
       (page generated 2023-11-26 23:01 UTC)