[HN Gopher] The state of silicon and the GPU poors
       ___________________________________________________________________
        
       The state of silicon and the GPU poors
        
       Author : swyx
       Score  : 46 points
       Date   : 2023-11-17 19:41 UTC (3 hours ago)
        
 (HTM) web link (www.latent.space)
 (TXT) w3m dump (www.latent.space)
        
       | brucethemoose2 wrote:
       | A flood of giant H200s and MI300s is great and all, but what
       | about personal accelerators?
       | 
       | The current state of local inference/finetuning is _insane_ ,
       | where the hardware that makes any financial sense is ancient
       | Nvidia GPUs, like the 3090 (2020) or the RTX 8000 (2018). Or
       | _maybe_ the rare used AMD Radeon Pro, if you can actually find
       | one.
       | 
       | The only hope seems to be the rumored large APUs Intel/AMD APUs
       | and maybe Intel Battlemage, since AMD is seemingly complicit in
       | preserving the low VRAM status quo with Nvidia.
        
         | treesciencebot wrote:
         | VRAM is not the main constraint, is it? The computational power
         | of any of the new graphical cards (beside from the highest end
         | models, where the VRAM is the actual constraint, like RTX
         | 4090s) is absurdly low on the stuff that actually matters
         | (tensor cores, cuda cores, etc.). They are graphical cards,
         | equipped with consumer grade VRAM (instead of HBM) and IMHO it
         | will take a very big shift before we see them being used as
         | real AI accelerators.
        
           | sp332 wrote:
           | Well if you don't have enough VRAM to hold the whole model,
           | you have to swap out to main RAM for every iteration, which
           | for most home/local setups means once for every single token
           | generated. So having enough is kind of a minimum.
        
           | Me1000 wrote:
           | Nah, vram is definitely the main constraint for most people
           | trying to do local inference of LLMs. If you look at a lot of
           | the local LLM communities, for people who aren't super
           | interested in training, many people suggest the M2 Ultra or
           | M3 Max with > 128GB of unified RAM just because it has so
           | much memory. Again, you might not be able to train or fine
           | tune as well, but the first step is just being able to keep
           | the weights in memory. Inference isn't that computationally
           | intensive relative to just how much RAM you need.
        
             | brucethemoose2 wrote:
             | Yeah. I didn't mention them because the 64GB+ configs are
             | very expensive, and I don't think you can even finetune on
             | them.
        
               | Me1000 wrote:
               | They're not cheap, but they're not _that_ expensive
               | compared to buying four NVLink'd Nvidia cards with a
               | combined similar amount of VRAM. Plus you get a whole
               | computer with it and you don't have to worry about
               | casings and power supplies, etc. But yeah, you will be
               | more compute constrained if you go down that route.
        
               | brucethemoose2 wrote:
               | The price is similar to 2x RTX 8000s, or even A6000s, but
               | yeah your point stands. Power efficiency is something
               | too.
               | 
               | You run into immense pain the moment you venture outside
               | of llama inference though.
        
           | brucethemoose2 wrote:
           | VRAM is _everything_.
           | 
           | The more VRAM you have, less aggressively you have to
           | quantize models for inference, which in turn has huge
           | speed/quality implications. You can run higher batch sizes,
           | or draft models, or more caching, which increases efficiency.
           | For LLMs specifically, you can load bigger models into VRAM
           | in the first place. You can load more of a multimodal
           | pipeline in VRAM without having to constantly swap everything
           | out.
           | 
           | This is all 10x true for finetuning. Quality and speed is
           | essentially determined by VRAM capacity, as long as you are
           | not on truly ancient GPU like a P40 than't can't even do
           | fp16.
           | 
           | As for architecture... TBH, many operations are heavily
           | bandwidth bound these days. Sometimes a 3090 and a 4090 are
           | essentially the same speed. And waiting a little longer for a
           | finetune is no big deal vs not being able to do it at all, or
           | doing it at low quality.
           | 
           | > They are graphical cards, equipped with consumer grade VRAM
           | (instead of HBM) and IMHO it will take a very big shift
           | before we see them being used as real AI accelerators
           | 
           | I think its important for users to break away from the cloud
           | and APIs, and try to run stuff themself, lest we get locked
           | into an OpenAI monopoly.
           | 
           | But setting that aside, running local is also extremely
           | useful for prototyping and testing. You can see if something
           | works without burning dollars every second you spend
           | debugging on a big cloud instance. Even if that's affordable,
           | just feeling like I am under the clock when
           | debugging/optimizing is stressful to me.
        
         | kimixa wrote:
         | Looks like you can get an MI60 with 32gb vram for ~$500 on ebay
         | - if you're vram limited might be a good option for less than a
         | 3090 seems to be going for, and seems supported by rocm (being
         | an mi50 just with more hbm).
         | 
         | Not sure how well it will perform as it lacks some of the
         | smaller data type acceleration available on newer cores,
         | though.
        
           | superkuh wrote:
           | AMD has a bad habit of dropping ROCm support for their
           | devices after 4 years. Trying to work with an old dependency
           | chain is something I wouldn't wish on an enemy. I don't know
           | about the MI60 but much of the MI* line has already had
           | support for compute under ROCm dropped.
        
             | kimixa wrote:
             | It's probably not surprising that cheaper options sacrifice
             | some level of software or hardware support - that's why
             | they're cheaper. Software support is why the "pro" versions
             | exist as separate (more expensive) SKUs, after all.
             | 
             | Of course you'll get a better service if you pay more.
             | Expecting a high performance, easy, turnkey solution for a
             | low price is expecting to have your cake and eat it. It's
             | never happened on any other technology with this level of
             | demand, and expecting it now seems naive at best. There's
             | too much money (and hype) sloshing around in the sector,
             | and limited supply.
             | 
             | Enthusiasts and experimenters have always traded their time
             | for cost - if only because their time investment and
             | learning (and maybe fun) is the whole point, not the end
             | results of their experiments. If you expect financial
             | returns due to your incredible idea, you should be looking
             | for investors not old hardware.
        
           | brucethemoose2 wrote:
           | Its kinda funky, and AMD dropped the MI50 already. See this
           | very interesting reddit discussion:
           | 
           | https://old.reddit.com/r/LocalLLaMA/comments/17vcsf9/somethi.
           | ..
           | 
           | Another thing that jumps out:
           | 
           | > I also have a couple W6800's and they are actually as fast
           | or faster than the MI100s with the same software...
           | 
           | That's insane. The MI100 should be so much faster than the
           | W6800 (a ~6900XT) that its not even funny.
        
         | wmf wrote:
         | Apple Silicon
        
       | coldcode wrote:
       | I wonder if we can afford the energy cost of the ever-increasing
       | need to compile more and more information into ever larger AI
       | baskets. Is there a point when the cost exceeds even what a
       | Zuckerberg can pay?
        
         | fermuch wrote:
         | I am hopeful developments like Mistral (a 7B model with the
         | performance of ChatGPT for most of the tasks I've used it for)
         | keeps happening. Seems like we can still compress a lot of
         | knowledge.
        
       | ReactiveJelly wrote:
       | I'm GPU poor. Sick of buying hardware that is not only not fast
       | enough after a year, but not even supported by software that just
       | multiplies and adds numbers in fancy ways.
       | 
       | Debian runs like a champ on my 8 year old CPU. When will gpus
       | last that long?
        
       | tedunangst wrote:
       | Can I theoretically buy a 4090 and rent it out to some cloud?
        
         | petercooper wrote:
         | Yes - https://vast.ai/ is one option.
        
         | codetrotter wrote:
         | I'm not a cloud but I'd be happy to rent a 4090 from you now
         | and then :D
        
       | darklycan51 wrote:
       | Ever since the 3000 gen of Nvidia GPUs have simply gone out of
       | being a possibility for anyone in the 3rd world, they are way too
       | expensive.
       | 
       | Add in the increase in energy costs... I don't know, really.
        
         | brucethemoose2 wrote:
         | Yeah, CPUs too. Power use is unacceptable, desktops and big
         | laptops turbo like mad to achieve really tiny performance
         | boosts.
        
       | gumby wrote:
       | I've been experimenting with using fixed-point weights using the
       | integer vector hardware (AVX) with pretty good results, though
       | haven't built enormous models yet. The GPU may turn out to be a
       | spur line.
        
       ___________________________________________________________________
       (page generated 2023-11-17 23:00 UTC)