[HN Gopher] The state of silicon and the GPU poors
___________________________________________________________________
The state of silicon and the GPU poors
Author : swyx
Score : 46 points
Date : 2023-11-17 19:41 UTC (3 hours ago)
(HTM) web link (www.latent.space)
(TXT) w3m dump (www.latent.space)
| brucethemoose2 wrote:
| A flood of giant H200s and MI300s is great and all, but what
| about personal accelerators?
|
| The current state of local inference/finetuning is _insane_ ,
| where the hardware that makes any financial sense is ancient
| Nvidia GPUs, like the 3090 (2020) or the RTX 8000 (2018). Or
| _maybe_ the rare used AMD Radeon Pro, if you can actually find
| one.
|
| The only hope seems to be the rumored large APUs Intel/AMD APUs
| and maybe Intel Battlemage, since AMD is seemingly complicit in
| preserving the low VRAM status quo with Nvidia.
| treesciencebot wrote:
| VRAM is not the main constraint, is it? The computational power
| of any of the new graphical cards (beside from the highest end
| models, where the VRAM is the actual constraint, like RTX
| 4090s) is absurdly low on the stuff that actually matters
| (tensor cores, cuda cores, etc.). They are graphical cards,
| equipped with consumer grade VRAM (instead of HBM) and IMHO it
| will take a very big shift before we see them being used as
| real AI accelerators.
| sp332 wrote:
| Well if you don't have enough VRAM to hold the whole model,
| you have to swap out to main RAM for every iteration, which
| for most home/local setups means once for every single token
| generated. So having enough is kind of a minimum.
| Me1000 wrote:
| Nah, vram is definitely the main constraint for most people
| trying to do local inference of LLMs. If you look at a lot of
| the local LLM communities, for people who aren't super
| interested in training, many people suggest the M2 Ultra or
| M3 Max with > 128GB of unified RAM just because it has so
| much memory. Again, you might not be able to train or fine
| tune as well, but the first step is just being able to keep
| the weights in memory. Inference isn't that computationally
| intensive relative to just how much RAM you need.
| brucethemoose2 wrote:
| Yeah. I didn't mention them because the 64GB+ configs are
| very expensive, and I don't think you can even finetune on
| them.
| Me1000 wrote:
| They're not cheap, but they're not _that_ expensive
| compared to buying four NVLink'd Nvidia cards with a
| combined similar amount of VRAM. Plus you get a whole
| computer with it and you don't have to worry about
| casings and power supplies, etc. But yeah, you will be
| more compute constrained if you go down that route.
| brucethemoose2 wrote:
| The price is similar to 2x RTX 8000s, or even A6000s, but
| yeah your point stands. Power efficiency is something
| too.
|
| You run into immense pain the moment you venture outside
| of llama inference though.
| brucethemoose2 wrote:
| VRAM is _everything_.
|
| The more VRAM you have, less aggressively you have to
| quantize models for inference, which in turn has huge
| speed/quality implications. You can run higher batch sizes,
| or draft models, or more caching, which increases efficiency.
| For LLMs specifically, you can load bigger models into VRAM
| in the first place. You can load more of a multimodal
| pipeline in VRAM without having to constantly swap everything
| out.
|
| This is all 10x true for finetuning. Quality and speed is
| essentially determined by VRAM capacity, as long as you are
| not on truly ancient GPU like a P40 than't can't even do
| fp16.
|
| As for architecture... TBH, many operations are heavily
| bandwidth bound these days. Sometimes a 3090 and a 4090 are
| essentially the same speed. And waiting a little longer for a
| finetune is no big deal vs not being able to do it at all, or
| doing it at low quality.
|
| > They are graphical cards, equipped with consumer grade VRAM
| (instead of HBM) and IMHO it will take a very big shift
| before we see them being used as real AI accelerators
|
| I think its important for users to break away from the cloud
| and APIs, and try to run stuff themself, lest we get locked
| into an OpenAI monopoly.
|
| But setting that aside, running local is also extremely
| useful for prototyping and testing. You can see if something
| works without burning dollars every second you spend
| debugging on a big cloud instance. Even if that's affordable,
| just feeling like I am under the clock when
| debugging/optimizing is stressful to me.
| kimixa wrote:
| Looks like you can get an MI60 with 32gb vram for ~$500 on ebay
| - if you're vram limited might be a good option for less than a
| 3090 seems to be going for, and seems supported by rocm (being
| an mi50 just with more hbm).
|
| Not sure how well it will perform as it lacks some of the
| smaller data type acceleration available on newer cores,
| though.
| superkuh wrote:
| AMD has a bad habit of dropping ROCm support for their
| devices after 4 years. Trying to work with an old dependency
| chain is something I wouldn't wish on an enemy. I don't know
| about the MI60 but much of the MI* line has already had
| support for compute under ROCm dropped.
| kimixa wrote:
| It's probably not surprising that cheaper options sacrifice
| some level of software or hardware support - that's why
| they're cheaper. Software support is why the "pro" versions
| exist as separate (more expensive) SKUs, after all.
|
| Of course you'll get a better service if you pay more.
| Expecting a high performance, easy, turnkey solution for a
| low price is expecting to have your cake and eat it. It's
| never happened on any other technology with this level of
| demand, and expecting it now seems naive at best. There's
| too much money (and hype) sloshing around in the sector,
| and limited supply.
|
| Enthusiasts and experimenters have always traded their time
| for cost - if only because their time investment and
| learning (and maybe fun) is the whole point, not the end
| results of their experiments. If you expect financial
| returns due to your incredible idea, you should be looking
| for investors not old hardware.
| brucethemoose2 wrote:
| Its kinda funky, and AMD dropped the MI50 already. See this
| very interesting reddit discussion:
|
| https://old.reddit.com/r/LocalLLaMA/comments/17vcsf9/somethi.
| ..
|
| Another thing that jumps out:
|
| > I also have a couple W6800's and they are actually as fast
| or faster than the MI100s with the same software...
|
| That's insane. The MI100 should be so much faster than the
| W6800 (a ~6900XT) that its not even funny.
| wmf wrote:
| Apple Silicon
| coldcode wrote:
| I wonder if we can afford the energy cost of the ever-increasing
| need to compile more and more information into ever larger AI
| baskets. Is there a point when the cost exceeds even what a
| Zuckerberg can pay?
| fermuch wrote:
| I am hopeful developments like Mistral (a 7B model with the
| performance of ChatGPT for most of the tasks I've used it for)
| keeps happening. Seems like we can still compress a lot of
| knowledge.
| ReactiveJelly wrote:
| I'm GPU poor. Sick of buying hardware that is not only not fast
| enough after a year, but not even supported by software that just
| multiplies and adds numbers in fancy ways.
|
| Debian runs like a champ on my 8 year old CPU. When will gpus
| last that long?
| tedunangst wrote:
| Can I theoretically buy a 4090 and rent it out to some cloud?
| petercooper wrote:
| Yes - https://vast.ai/ is one option.
| codetrotter wrote:
| I'm not a cloud but I'd be happy to rent a 4090 from you now
| and then :D
| darklycan51 wrote:
| Ever since the 3000 gen of Nvidia GPUs have simply gone out of
| being a possibility for anyone in the 3rd world, they are way too
| expensive.
|
| Add in the increase in energy costs... I don't know, really.
| brucethemoose2 wrote:
| Yeah, CPUs too. Power use is unacceptable, desktops and big
| laptops turbo like mad to achieve really tiny performance
| boosts.
| gumby wrote:
| I've been experimenting with using fixed-point weights using the
| integer vector hardware (AVX) with pretty good results, though
| haven't built enormous models yet. The GPU may turn out to be a
| spur line.
___________________________________________________________________
(page generated 2023-11-17 23:00 UTC)