[HN Gopher] Reasonable GPUs
___________________________________________________________________
Reasonable GPUs
What is the status of GPUs for general compute? Last I looked,
NVidia worked well, and AMD was horrible. Right now, it looks like
the major limiting factor (if you don't care about a [?]3x
difference in performance, which I don't) is RAM. More is better,
and good models need >10GB, while LLMs can be up to 350GB. * Intel
Arc A770 has 16GB for <$300. I have no idea about compatibility
with Hugging Face, Blender, etc. * NVidia 4060 has 16GB for <$500.
100% compatible with everything. * Older NVidia (e.g. Pascal era)
can be had with 24GB for <$300 used, without a graphics port. Not
clear how CUDA compute capability lines up to what's needed for
modern tools, or how well things work without a graphics port. *
Several cards may or may not work together. I'm not sure. Is there
any way to figure this stuff out, and what's reasonable / practical
/ easy? Something which explains CUDA compute levels, vendor
compatibility, multi-card compatibility, and all that jazz. It'd be
nice to have a generic enough guide to understand both pro and
amateur use, e.g.: - A770 x21, if someone got it working, could
handle Facebook's OPT-175 for <$10k via Alpa. That brings it into
"rich hobbyist" or "justifiable business expense" range. Not clear
if that's practical. - Kids learning AI would be much easier if
it's cheaper (e.g. A770) - "General compute" also includes things
like Blender or accelerating rendering in kdenlive, etc. - Etc.
This stuff is getting useful to a broader and broader audience, but
it's confusing.
Author : frognumber
Score : 57 points
Date : 2023-11-26 16:26 UTC (6 hours ago)
| pizza wrote:
| nvidia specific, best write up I know:
| https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
|
| Across vendors, generally, Nvidia still dominates currently.
| People are adding more support into ML libraries for other
| vendors via (second-class imo) alternate backends but expect to
| be patient if you're waiting for the day when there is healthy
| competition.
|
| IMO, I'd say: if you can save up for it, get a 4090; if you can
| save up for half a 4090, get a 3090 - seen many going for 600-800
| now. If you can save up for half a 3090, I'm not sure - depends
| on if you prefer speed or VRAM. If it were me, I'd pick more VRAM
| first.
|
| re: compute capability, you can see here:
|
| - which GPUs have what cc: https://developer.nvidia.com/cuda-gpus
|
| - what cc comes with what features:
| https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....
|
| I think the main qualitative change (beyond bigger numbers in the
| spec) for an enduser of machine learning libraries from 8.6 ->
| 8.9 (ie 3090 -> 4090) is this line:
|
| > 4 mixed-precision Fourth-Generation Tensor Cores supporting
| fp8, fp16, __nv_bfloat16, tf32, sub-byte and fp64 for compute
| capability 8.9 (see Warp matrix functions for details)
|
| ie new precisions will be builtin to eg pytorch with hw-
| level/tensor core support
|
| edit: btw you probably ought to stick to a consumer gpu (ie not
| professional) if you want it to be generally versatile while also
| easy to use at home.
| smoldesu wrote:
| This. If you insist on being as cheap as possible, shoot for a
| 12gb card but be aware that you'll be missing out on the
| throughput of higher-end models. The 3060 is popular for this I
| think, but you'll probably want a better card with more CUDA
| cores to max out performance.
|
| Cards like the A770 are awesome, but barely even support raster
| drivers on DirectX. Your best bang-for-buck options are going
| to be Nvidia-only for now, with a few competing AMD cards that
| have fast-tracked Pytorch support.
| matthewaveryusa wrote:
| I purchased a 3060 specifically for the 12gb of memory last
| November and I've been able to run llama, alpaca, stable
| diffusion out of the box for everything without ever having
| any memory issues. Training is usually overnight, and a
| stable diffusion will render in ~5 seconds, llama will do 20
| tokens/second.
|
| I would say start with the 3060 for 250 bucks, and if you're
| still loving it after a couple months, drop 10x more on a
| quadro.
|
| My only word of advice is get docker setup and install the
| nvidia docker toolkit to passthrough your gpu to docker
| images -- the package management for all these python ai
| tools is a hell-scape, especially if you want to try a bunch
| of different things.
| frognumber wrote:
| Thank you. This is super-helpful.
|
| > re: compute capability, you can see here:
|
| My key question is much more pragmatic:
|
| 1) If I grab a random model from Hugging Face, will it
| accelerate?
|
| 2) If I run Blender, kdenlive, or DaVinci Resolve, will it
| accelerate?
|
| Is there a line where things break?
|
| I definitely prefer more VRAM to more speed. As an occasional
| user, speed doesn't really matter. Things working does.
| smoldesu wrote:
| > If I grab a random model from Hugging Face, will it
| accelerate?
|
| Probably, it depends more on how you configure the
| inferencing software. Most software that supports
| acceleration starts with CUDA or CUBLAS, so you should be
| good.
|
| > If I run Blender, kdenlive, or DaVinci Resolve, will it
| accelerate?
|
| Yep. If you're running Linux, some distros might be a little
| iffy about shipping the proprietary/accelerated versions of
| this software, but most are fine. The Flatpak versions should
| all have Nvidia acceleration working out-of-box, if you do
| encounter any issues.
|
| > Is there a line where things break?
|
| Yes, but you can avoid it by choosing smaller quantizations
| and giving yourself a few gigs of VRAM headroom. In my
| experience, it's always better to select a model smaller than
| you need so you're not risking an OOM crash (I've got a
| 3070ti).
|
| Lotta other great advice in this thread, though! Good luck
| picking something out.
| yread wrote:
| What about 4060 ti 16gb? It was released after this guide,
| costs ~500eur and is a bit faster, newer (and a lot more
| efficient) than a 3060
| ilaksh wrote:
| I would really like a "reasonable" monthly price for a VPS with a
| GPU. Even a consumer card like a 3090.
|
| vasti.ai have the best prices I have seen, but comes with
| limitations, and still not the best deal for an entire month.
| cma wrote:
| Nvidia has a datacenter tax in their driver terms of use, so
| you won't find consumer card vpses at consumer like prices.
| xcv123 wrote:
| Given the cost of the hardware and power and everything else
| required to run it and support it, how can you say that 20
| cents an hour is unreasonable? There's very little profit
| margin there. At this price it would take roughly a year for
| them to make a profit. If you need continuous usage at the
| lowest price then you need to buy a GPU on ebay.
| tommy_axle wrote:
| Checked out vast.ai but you can get down to the ~$0.34/hr at
| Runpod depending on how much vram you need.
| theyinwhy wrote:
| Most offers are fine, the A100 I rented was a scam, however. A
| scam in terms of: advertised as A100, performing like a 1080. I
| guess the seller partitioned the card or rigged the id. You can
| report frauds like this on their page but only while you are
| renting.
| buildbot wrote:
| I would hope vast.ai would be able to detect MIG at least.
|
| It could also be a low power cap - I had a Dell C4140 for a
| bit with 220V power supplies and 120V power, locking the
| entire thing to ~50% of the max power cap per GPU basically.
| startupsfail wrote:
| Older generation RTX 8000 / 48GB are reasonable.
|
| One big disadvantage for older Turing card, no bfloat16. But if
| you run a quantized/mixed precision model or QLoRA, it doesn't
| hurt as much.
| diffeomorphism wrote:
| Is shared ram any useful? For instance a mini PC with an AMD
| 7940HS chip and 64GB of ddr5 ram costs about 800EUR. At less than
| half the price of just a GPU, I am not expecting any great
| results, but is it usable?
| pella wrote:
| as I know - only 16GB VRAM allowed in 7940HS ( via BIOS )
|
| so probably you can expect similar results:
|
| https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...
|
| HN https://news.ycombinator.com/item?id=37162762
| maerF0x0 wrote:
| I was curious about this topic too because M3 macs have
| "Unified memory" which is shared amongst their CPU/GPUs. Anyone
| have a link or explanation of how this works?
| treesciencebot wrote:
| One of the main bottlenecks for inference is memory bandwith
| (esp when dealing with huge models, like SD/SDXL) and for
| that, nothing I know of comes close to matching memory speeds
| on Apple Silicon (up to 400GB/s).
| svnt wrote:
| You can run very large models on a Macbook M2 with 96 GB.
| They run 1/3 to 1/4 slower in tokens/s than the faster
| hardware, but they fit in memory.
|
| (400 GB/s is a lot in the form factor but the 4090 and
| equivalent have 1 TB/s, and H100s several times that)
|
| Edit: Here someone asked the same question: https://www.red
| dit.com/r/LocalLLaMA/comments/14319ra/rtx_409...
| pella wrote:
| Raja Koduri - 2023-11-25 -
| https://twitter.com/RajaXg/status/1728465097243406482
|
| _" " Very encouraging to see the steady increase of viable
| hardware options that can handle various AI models.
|
| At the beginning of the year, there was only one practical option
| - nVidia. Now we see at-least 3 vendors providing reasonable
| options. Apple, AMD and Intel. We have been profiling several
| options and I will share some of our findings here.
|
| The good stuff
|
| - Apple Macs were a pleasant surprise on how easy it is to get
| various models running
|
| - AMD also made impressive progress with PyTorch and a lot more
| models run now than even 4-5 months ago on MI2XX and Radeon
|
| - We tried both Intel Arc and Ponte Vecchio and they were able to
| execute everything we have thrown at them.
|
| - Intel Gaudi has very impressive performance on the models that
| work on that architecture. It's our current best option for LLM
| inference on select models.
|
| - Ponte Vecchio surprised us with its performance on our custom
| face swap model, beating everyone including the mighty H100. We
| suspect that our model may be fitting largely in the Rambo cache.
|
| The wishlist
|
| - For training and inference of large models that don't fit in
| memory - nVidia is still the only practical option. Wishing that
| there are more options in 2024 here
|
| - While compatibility is getting better, a ton of performance is
| still left on the table on Apple, AMD and Intel. Wishing that
| software will keep getting better and increase their HW
| utilization. There is still room on compatibility as well,
| particularly with supporting various encodings and model
| parameter sizes on AMD.
|
| - Intel Gaudi looks very promising performance-wise and wishing
| that more models seamlessly work out of the box without Intel
| intervention.
|
| - Wishing that both AMD and Intel release new gaming GPUs with
| more memory capacity and bandwidth.
|
| - Wishing that Intel releases a PVC kicker with more memory
| capacity and bandwidth. Currently it's the best option we have to
| bring our artists workflow with face swap training from 3-days to
| a few hours. It scales linearly from 1-GPU to 16-GPUs.
|
| - Wishing Intel support for PyTorch is as frictionless as AMD and
| nVdia. May be Intel should consider supporting PyTorch RocM or
| up-stream OneAPI support under CUDA device.
|
| really grateful to all vendors for providing access to hardware
| and developer support.
|
| Looking forward to continue filling our data center with
| interesting mix of architectures. ""_
| noughtme wrote:
| If I just needed a GPU for learning purposes, is 2xGPUs
| necessary? Would a single 24GB GPU significantly bottleneck
| training with any publicly available datasets? Just need
| something faster than my laptop, but if it takes twice as long,
| not really an issue.
| buildbot wrote:
| Training what on what - Resnet50 on imagenet? Yeah sure a
| single 4090 is fine. Will take a bit.
|
| A 1.5B parameter LLM? That's a few weeks with 64 V100s - on a
| small dataset.
|
| Training something Lllama 7b class? (Not using lora)? Weeks
| with the same number of A100s.
|
| With lora? Back to a single 4090 - depending on your dataset.
| It still might take weeks to go through 2000 examples for
| finetuning with a large context size.
| buildbot wrote:
| Ebay 3090 or new @ 1599$ 4090 (founders edition, gigabyte
| windforce v2), are the best price/performance/ease of use in my
| opinion.
|
| AMD is too funky for most still. I have an Mi60 that won't load
| drivers due to some PSP (platform security processor) missing
| firmware on the GPU...
| uniqueuid wrote:
| To be honest, the best idea for most people is probably just any
| GPU that you can easily afford and then rent a big iron GPU.
|
| There is almost no way you will make back the $5k for a 40GB+ ram
| card, so just save yourself all the hassle and go for something
| that ticks all the rest of your boxes.
|
| Non-CUDA cards may be ok if you have very simple requirements,
| but I'd expect many hours of debugging if you want something
| that's not ready to go out of the box.
| civilitty wrote:
| I agree. Availability is a pain in the ass which might a
| dealbreaker for urgent interactive use cases but a 48GB A6000
| on LambdaLabs is $0.80/hr [1]. A newer 80GB H100 is $1.99/hr so
| especially if you're trying to do batch processing and can
| script a bot to wait for availability, it's often a much better
| option.
|
| With that aforementioned A6000 ($5k retail) you'd have to use
| it for at least _six thousand_ hours to break even on the cloud
| cost.
|
| [1] https://lambdalabs.com/service/gpu-cloud#pricing
| buildbot wrote:
| That seems like a lot, but that's only ~8 months of usage. If
| you are doing consistent work with large models, or plan to
| for over a year, then it makes sense to at least have some
| hardware.
|
| Something people forget too is that if you have no Nvidia
| GPUs at all locally, you'll need to spend an significant
| amount of time installing a new node, copying data, and
| debugging in your cloud instance, each time you want to do
| something, while being charged for it. It's a pretty big
| boost in terms of my time to develop locally and then scale
| to the cloud once something smaller scale is working.
| uniqueuid wrote:
| I agree especially with the second argument.
|
| But most people who toy with LLMs will probably never make
| money out of them. Even those who do will often spend a lot
| of time getting their bearings during which the GPU sits
| idle. Then you begin to ramp up your use but by the time,
| there's a new generation of GPUs out.
|
| That's why my recommendation is to start with something
| lightweight.
|
| It's also much less frustrating to start working for a few
| hours on a rented A100 rather than running into OOMs all
| the time while fine-tuning batch sizes and waiting for the
| nth highly quantized model to download.
| qwertyforce wrote:
| used 3090 is the best option imo
| jszymborski wrote:
| It really depends on what you're trying to do.
|
| This is sorta _the_ guide on GPUs for DL and has a great decision
| tree https://timdettmers.com/2023/01/30/which-gpu-for-deep-
| learni...
|
| Personally, I'm limited to an RTX 2080 for my personal projects
| at the moment, and I find the constraint pretty rewarding. It
| forces me to find alternatives to the huge models, and you'd be
| surprised what you can eek out when you pour in the time to tweak
| models. Of course, good data is also pinnacle.
| ChrisArchitect wrote:
| Ask HN:
| Avlin67 wrote:
| Last I looked, NVidia worked well, and AMD was horrible.
|
| are you sure ?
___________________________________________________________________
(page generated 2023-11-26 23:01 UTC)