hngopher.com

       [HN Gopher] Building a deep learning rig
       ___________________________________________________________________
        
       Building a deep learning rig
        
       Author : dvcoolarun
       Score  : 139 points
       Date   : 2024-02-23 13:52 UTC (1 days ago)
        
 (HTM) web link (samsja.github.io)
 (TXT) w3m dump (samsja.github.io)
        
       | infogulch wrote:
       | I'm eyeing Tinybox as a deep learning rig.
       | 
       | https://tinygrad.org/
       | 
       | https://twitter.com/__tinygrad__/status/1760988080754856210
        
         | Smith42 wrote:
         | $15k!
        
           | KeplerBoy wrote:
           | Which is not unreasonable for that amount of hardware.
           | 
           | You have to ask yourself if you want to drop that kind of
           | money on consumer GPUs, which launched late 2022. But then
           | again, with that kind of money you are stuck with consumer
           | GPUs either way, unless you want to buy Ada workstation cards
           | for 6k each and those are just 4090s with p2p memory enabled.
           | Hardly worth the premium, if you don't absolutely have to
           | have that.
        
             | cyanydeez wrote:
             | I believe the ada workstation cards are typically 1slot
             | cards
             | 
             | which means you could build a 4gpu server from normal
             | cases.
             | 
             | most of the 4090 cards are 2-3 slot cards
        
               | KeplerBoy wrote:
               | The beefy workstation cards are 2 slots, but yeah the
               | 4090 cards are usually 3.something slots, which is
               | ridiculous. The few dual slot ones are water cooled.
        
               | cyanydeez wrote:
               | the work station cards also run on 300 watts and looks
               | like the 4090 goes to 450.
               | 
               | so you are getting a better practical card for the price
               | 
               | if you are making a mining type rig, then yeah, the extra
               | price is wasting money.
               | 
               | but if you wanted to build a normal machine, the
               | workstation cards are the most reasonable choice for
               | anything more than 2 gpus
        
               | kkielhofner wrote:
               | I find it challenging to get my 4090s to consume more
               | than 300 watts. There are also a lot of articles,
               | benchmarks, etc around showing you can dramatically limit
               | power while reducing perf by insignificant mounts (single
               | digit %).
        
               | justsomehnguy wrote:
               | > which means you could build a 4gpu server from normal
               | cases.
               | 
               | Only if you are already live near an airport and you are
               | accustomed to the sounds of the lift off and flying away.
        
         | treprinum wrote:
         | Sure, if you want to waste your time on getting stuff working
         | on AMD instead of spending it on actual model training...
        
           | kkielhofner wrote:
           | This.
           | 
           | People complain about the "Nvidia tax". I don't like
           | monopolies and I fully support the efforts of AMD, Intel,
           | Apple, anyone to chip away at this.
           | 
           | That said as-is with ROCm you will:
           | 
           | - Absolutely burn hours/days/weeks getting many (most?)
           | things to work at all. If you get it working you need to
           | essentially "freeze" the configuration because an upgrade
           | means do it all over again.
           | 
           | - In the event you get it to work at all you'll realize
           | performance is nowhere near the hardware specs.
           | 
           | - Throw up your hands and go back to CUDA.
           | 
           | Between what it takes to get ROCm to work and the performance
           | issues the Nvidia tax becomes a dividend nearly instantly
           | once you factor in human time, less-than-optimal performance,
           | and opportunity cost.
           | 
           | Nvidia says roughly 30% of their costs are on software.
           | That's what you need to do to deliver something that's
           | actually usable in the real world. With the "Nvidia tax"
           | they're also reaping the benefit of the ~15 years they've
           | been sinking resources into CUDA.
        
             | dkjaudyeqooe wrote:
             | Wow! It's incredible how Nvidia has created the dark voodoo
             | magic, and how only they can deliver the strong juju for
             | AI. How are the so incredibly smart and powerful!?
             | 
             | I wonder if it has anything to do with the strategy they
             | used in 3D graphics, where game developers ended up writing
             | for NV drivers in order to maximise performance, and Nvidia
             | abused their market position and used every trick in the
             | book to make AMD cards run poorly. People complained about
             | AMD driver quality, but the actual problem was that they
             | were not NV drivers and AMD couldn't defeat their software
             | moat.
             | 
             | So here we are again, this time with AI. You'd think we'd
             | learnt our lesson but instead people are fooled yet again
             | and instead of understanding the importance of diversity
             | and competition in the lifeblood of their art, myopia and
             | amnesia is the order of the day.
             | 
             | Tinygrad are doing god's work and I won't be giving Nvidia
             | a single fucking cent of my money until the software is
             | hardware neutral and there is real competition.
        
               | whimsicalism wrote:
               | No, I think it is much more due to AMD absolutely failing
               | to invest heavily in software at all. Honestly, they have
               | had years - it is difficult to see how Nvidia abused
               | their market position in AI when this is effectively a
               | new market.
               | 
               | I find this vague reflexive anti-corpo leftism that seems
               | to have become extremely popular post-2020 really
               | tiresome.
        
               | infogulch wrote:
               | For my part I'm not a leftist, and I'm not so much anti-
               | corpo as pro-free market. I can still acknowledge that
               | Nvidia is persistently pursuing anticompetitive policies
               | and abusing their market position, without holing AMD on
               | a pedestal assuming it would be different if the shoe was
               | on the other foot.
        
               | dkjaudyeqooe wrote:
               | > I find this vague reflexive anti-corpo leftism that
               | seems to have become extremely popular post-2020 really
               | tiresome.
               | 
               | Ah ideology, such a great alternative to actual thinking.
               | Don't investigate or reason, just just blame it on the
               | 'lefties'. Tiresome indeed.
               | 
               | Not sure how me simply stating the obvious makes me a
               | 'lefty'. If you think monopiles, regardless of how they
               | come about, are a good idea, that companies should be
               | allowed to lock up an important market for any reason,
               | then that makes you a corporatist fascist, right? Wow,
               | this mindless name calling is so much fun! I feel like a
               | total genius.
               | 
               | The simple fact is that it is the nature of software, its
               | complexity and dependence on a multitude of fairly
               | arbitrary technical choices makes it a very effective as
               | a moat, even if its not intentional. CUDA, etc is 100% a
               | software compatibility issue, and that's it. There's more
               | than one way to skin a cat but we're stuck with this one.
               | Nvidia isn't interested in interoperability, even though
               | it's critical for the industry in the longer term. I'd
               | wouldn't be either if it was money in my pocket.
               | 
               | The point that is entirely missed here is that we, as a
               | community, are screwing up by not steering the field
               | toward better hardware compatibility, as in anyone being
               | able to produce new hardware. In the rush to improve or
               | or try out the latest model or software we have lost
               | sight of this, and it will be to our great detriment.
               | With the concentration of money in one company we will
               | have a lot less innovation overall. Prices will be higher
               | and resources will be misallocated. Everyone suffers.
               | 
               | It's very possible that AI withers on the vine due to
               | lagging hardware. It's going to need a lot of compute,
               | and maybe a different kind of compute to boot. We may
               | need a million or a billion times what we have to even
               | get close to AGI. But if one company locks that up, and
               | uses that position to squeeze out every dollar from its
               | customers (really, have a look at the almost comical
               | 'upgrades' Nvidia offers in their GPUs other than at the
               | very high end) then it's going to take much longer to
               | progress, and maybe we never get there because some small
               | group of talented maverick researchers were never able to
               | get their hands on the hardware they needed and never
               | produce some critical breakthrough.
        
               | kkielhofner wrote:
               | Sarcasm aside...
               | 
               | Can we drop the "Nvidia is the only self-interested evil
               | company in existence" schtick?
               | 
               | I'm not being "fooled" by anyone. I've been trying to use
               | ROCm since the initial release six years ago (on Vega at
               | the time). I've spent thousands of dollars on AMD
               | hardware over the years hoping to see progress for
               | myself. I've burned untold amounts of time fighting with
               | ROCm, hoping it's even remotely a viable competitor to
               | CUDA/Nvidia.
               | 
               | Here we are in 2024 and they're still doing braindead
               | stuff like dropping a new ROCm release to support their
               | flagship $1000 consumer card a full year after release...
               | 
               | ROCm 6 looks good? Check the docker containers[0]. Their
               | initial release for ROCm only supported Python 3.9 for
               | some strange reason even though the previous ROCm 5.7
               | containers were based on Python 3.10. Python 3.10 is
               | more-or-less the minimum for nearly anything out there.
               | 
               | It took them 1.5 months to address this... This is merely
               | one example, spend some time actually working with this
               | and you will find dozens of similar "WTF?!?" bombs all
               | over the place.
               | 
               | I suggest you put your money and time where your mouth is
               | (as I have) to actually try to work with ROCm. You will
               | find that it is nowhere near the point of actually being
               | a viable competitor to CUDA/Nvidia for anyone who's
               | trying to get work done.
               | 
               | > Tinygrad are doing god's work
               | 
               | Tinygrad is packaging hardware with off the shelf
               | components plus substantial markup. There is nothing
               | special about this hardware and they aren't doing
               | anything you couldn't have done in the past year. They
               | have been vocal on calling out AMD but show me their
               | commits to ROCm and I'll agree they are "doing god's
               | work".
               | 
               | We'll save the work being done on their framework for
               | another thread.
               | 
               | [0] - https://hub.docker.com/r/rocm/pytorch/tags
        
               | latchkey wrote:
               | > There is nothing special about this hardware and they
               | aren't doing anything you couldn't have done in the past
               | year.
               | 
               | What they are doing is all of the hardware engineering
               | work that it takes to build something like this. You're
               | dismissing the amount of time they spent on figuring
               | stuff like this out:
               | 
               | "Beating back all the PCI-E AER errors was hard, as
               | anyone knows who has tried to build a system like this."
        
               | kkielhofner wrote:
               | > "Beating back all the PCI-E AER errors was hard, as
               | anyone knows who has tried to build a system like this."
               | 
               | Define "hard".
               | 
               | The crypto mining community has had this working for at
               | least half a decade with AMD cards. With Nvidia it's a
               | non-issue. I'd be very, very curious to get more
               | technical details on what new work they did here.
        
               | latchkey wrote:
               | I ran 150,000 AMD cards for mining and we didn't run into
               | that problem because we bought systems with PCIe
               | baseboards (12x cards) instead of dumb risers. I'd be
               | interested in finding out more details as well, but it
               | seems he doesn't want to share that in public.
               | 
               | That said, if you think any of this is easy, you're the
               | one who should define that word.
        
               | kkielhofner wrote:
               | I never used the word easy, I never used the word hard.
               | He used the word hard, you used the word easy.
               | 
               | With that said.
               | 
               | Easy: Assembling off the shelf PC components to provide
               | what is fundamentally no different than what
               | gamers/miners build every day. Six cards in a machine and
               | two power supplies is low-end mining. Also see the x8 GPU
               | machines with multiple power supplies that have been
               | around forever. I'm not quite sure why you're arguing
               | this so hard, you're more than familiar with these
               | things.
               | 
               | Hard: Show me something with a BOM. Some manufacturing?
               | PCB? Fab? Anything.
               | 
               | FWIW for someone that is frequently promoting their
               | startup here you come across as pretty antagonistic. I'm
               | not attacking you, just saying that for someone like
               | myself that has been intrigued by what you're working on
               | it gives me pause in terms of what I'd charitably refer
               | to as potential personality/relationship issues.
               | 
               | Everyone has those days, just thought it was worth
               | mentioning.
        
               | fragmede wrote:
               | isn't tinygrad's value add the software they provide on
               | top of the open source drivers to make it all work? why
               | would should they commit to ROCm if that's the product
               | they're trying to sell?
        
             | hackerlight wrote:
             | > Between what it takes to get ROCm to work
             | 
             | It's not that bad. You just copy and paste ~10 bash
             | commands from their the official guide. 7900XTX is now
             | officially supported by AMD. Andrew Ng says it's much
             | better than 1 year ago and isn't as bad as people say.
        
           | dkjaudyeqooe wrote:
           | Resistance is useless! Lets just accept our fate and toe the
           | line. Why feel bad about paying essentially double, or
           | getting half the compute for our money when we can just
           | choose the easy route and accept our fate and feed the
           | monopoly a little more money so they can charge us even more
           | money. Who needs competition!
        
           | lbotos wrote:
           | Isn't the entire point of tinygrad and the tinybox the "apple
           | style" of we are building this software to work best on our
           | hardware?
        
             | whimsicalism wrote:
             | it's not there yet - and i don't really understand why they
             | don't just try to upstream stuff to pytorch
        
         | tutfbhuf wrote:
         | This is the new startup from George Hotz. I would like him to
         | succeed, but I'm not so optimistic about their chances of
         | selling a $15k box that is most likely less than $10k in parts.
         | Most people would do much better by buying a second-hand 3090
         | or similar and connecting them into a rig.
        
           | segmondy wrote:
           | Not necessarily, I'm not sure about AMD GPUs, but he tweeted
           | that AMD supports linking all 6 together. If that's the case,
           | then 6 of those XTX should crush 6 3090's. For us techies we
           | definitely will decide to build vs buy. However businesses
           | would definitely decide to buy vs build.
        
           | downrightmike wrote:
           | He keeps hopping from thing to thing
        
       | abra0 wrote:
       | I was thinking of doing something similar, but I am a bit
       | sceptical about how the economics on this works out. On vast.ai
       | renting a 3x3090 rig is $0.6/hour. The electricity price of
       | operating this in e.g. Germany is somewhere about $0.05/hour. If
       | the OP paid 1700 EUR for the cards, the breakeven point would be
       | around (haha) 3090 hours in, or ~128 days, assuming non-stop
       | usage. It's probably cool to do that if you have a specific goal
       | in mind, but to tinker around with LLMs and for unfocused
       | exploration I'd advise folks to just rent.
        
         | cyanydeez wrote:
         | the current economics is a low ball to get costumers. it's
         | absolutely not going to be the market price once commercial
         | interests have locked in their products.
         | 
         | but if you're just goofing around and not planning to create
         | anything production worthy, it's a great deal.
        
           | whimsicalism wrote:
           | > the current economics is a low ball to get costumers.
           | 
           | vast.ai is basically a clearinghouse. they are not doing some
           | VC subsidy thing
           | 
           | in general, community clouds are not suitable for commercial
           | use.
        
         | imiric wrote:
         | > On vast.ai renting a 3x3090 rig is $0.6/hour. The electricity
         | price of operating this in e.g. Germany is somewhere about
         | $0.05/hour.
         | 
         | Are you factoring in the varying power usage in that
         | electricity price?
         | 
         | The electricity cost of operating locally will vary depending
         | on the actual system usage. When idle, it should be much
         | cheaper. Whereas in cloud hosts you pay the same price whether
         | the system is in use or not.
         | 
         | Plus with cloud hosts reliability is not guaranteed. Especially
         | with vast.ai, where you're renting other people's home
         | infrastructure. You might get good bandwidth and availability
         | on one host, but when that host disappears, you should hope
         | that you did a backup, which vast.ai charges for separately,
         | and if so, you need to spend time restoring the backup to
         | another, hopefully equally reliable host, which can take hours
         | depending on the amount of data and bandwidth.
         | 
         | I recently built an AI rig and went with 2x3090s, and am very
         | happy with the setup. I evaluated vast.ai beforehand, and my
         | local experience is much better, while my electricity bill is
         | not much higher (also in EU).
        
           | KeplerBoy wrote:
           | Well rented cloud instances shouldn't idle in the first
           | place.
        
             | imiric wrote:
             | Sure, but unless you're using them for training, the power
             | usage for inference will vary a lot. And it's cumbersome to
             | shutdown the instance while you're working on something
             | else, and have to start it back up when you need to use it
             | again. During that time, the vast.ai host could disappear.
        
               | segmondy wrote:
               | Most people don't think of storage costs and network
               | bandwidth. I have about 2tb of local models. What's the
               | cost of storing this in the cloud? If I decide not to
               | store them in the cloud, I have to transfer them in
               | anytime I want to run experiments. Build your own rig so
               | you can run experiments daily. This is a budget rig and
               | you can even build cheaper.
        
               | nightski wrote:
               | Data as well. I have a 100TB NAS I can use for data
               | storage and it was honesty pretty cheap overall.
        
               | isoprophlex wrote:
               | Let me add that moving data in and out of vast.ai is
               | extremely painful. I might be overprivileged with a 1000
               | MBit line but these vast.ai instances have highly
               | variable bandwidth in my experience; plus even when
               | advertising good speeds I'm sometimes doing transfers in
               | the 10-100 KiB/s range.
        
           | abra0 wrote:
           | Well if you are not using a rented machine during a period of
           | time, you should release it.
           | 
           | Agreed on reliability and data transfer, that's a good point.
           | 
           | Out of curiosity, what do you use a 2x3090 rig for? Bulk not
           | time-sensitive inference on down quanted models?
        
             | imiric wrote:
             | > Well if you are not using a rented machine during a
             | period of time, you should release it.
             | 
             | If you're using them for inference, your usage pattern is
             | unpredictable. I could spend hours between having to use
             | it, or minutes. If you shut it down and release it, the
             | host might be gone the next time you want to use it.
             | 
             | > what do you use a 2x3090 rig for? Bulk not time-sensitive
             | inference on down quanted models?
             | 
             | Yeah. I can run 7B models unquantized, ~13-33B at q8, and
             | ~70B at q4, at fairly acceptable speeds (>10tk/s).
        
               | whimsicalism wrote:
               | if you are just using it for inference, i think an
               | appropriate comparison would just be like a together.ai
               | endpoint or something - which allows you to scale up
               | pretty immediately and likely is more economical as well.
        
               | imiric wrote:
               | Perhaps, but self-hosting is non-negotiable for me. It's
               | much more flexible, gives me control of my data and
               | privacy, and allows me to experiment and learn about how
               | these systems work. Plus, like others mentioned, I can
               | always use the GPUs for other purposes.
        
               | whimsicalism wrote:
               | to each their own. if you are having really high-
               | sensitive conversations with your GAI that someone would
               | bother snooping in your docker container, figuring out
               | how you are doing inference, and then capturing it real-
               | time - you have a different risk tolerance than me.
               | 
               | i do think that cloud GPUs can cover most of this
               | experimentation/learning need.
        
               | algo_trader wrote:
               | together.ai is really good but there is a price mismatch
               | for small models (a 1BN model is not x10 cheaper than
               | 10BN models)
               | 
               | This is obviously because their are forced to use high
               | memory cards.
               | 
               | Are there ideal cards for low memory (1-2BN) models? So
               | higher flops/$ on crippled memory
        
           | whimsicalism wrote:
           | with runpod/vast, you can request a set amount of time -
           | generally if I request from Western EU or North America the
           | availability is fine on the week-to-month timescale.
           | 
           | fwiw I find runpod's vast clone significantly better than
           | vast and there isn't really a price premium.
        
           | algo_trader wrote:
           | > built an AI rig and went with 2x3090s,
           | 
           | Is there a goto card for low memory (1-2BN) models?
           | 
           | Something with much better flops/$ but purposely crippled
           | with low memory.
        
         | mirekrusin wrote:
         | For me "economics" are:
         | 
         | - if I have it locally, I'll play with it
         | 
         | - if not, I won't (especially with my data)
         | 
         | - if I have something ready for a long run I may or may not
         | want to send it somewhere (it's not going to be on 3090s for
         | sure if I send it)
         | 
         | - if I have requirement to have something public I'd probably
         | go for per usage with ie [0].
         | 
         | [0] https://www.runpod.io/serverless-gpu
        
         | kkielhofner wrote:
         | With the current more-or-less dependency on CUDA and thus
         | Nvidia hardware it's about making sure you actually have the
         | hardware available consistently.
         | 
         | I've had VERY hit-miss results with Vast.ai and I'm convinced
         | people are cheating their evaluation stuff because when the
         | rubber meets the road it's very clear performance isn't what
         | it's claimed to be. Then you still need to be able to actually
         | get them...
        
           | whimsicalism wrote:
           | use runpod and yeah i think vast.ai has some scams,
           | especially in the asian and eastern european nodes.
        
         | KuriousCat wrote:
         | When you compute the break even point did you factor in that
         | you still own the cards and you can resell them? I bought my
         | 3090s for 1000$ and after 1 year I think they go for more in
         | the open market if I resell them now.
        
         | wiradikusuma wrote:
         | For me the economics is when I'm not using it to do AI stuff, I
         | can use it to play games with max settings.
         | 
         | Unfortunately my CFO (a.k.a Wife) does not share the same
         | understanding.
        
           | ejb999 wrote:
           | I fear that someday I will die and my wife will sell off all
           | my stuff for what I said I paid for it.
           | 
           | (not really, but it is a joke I read someplace and I think it
           | applies to a lot of couples).
        
         | verticalscaler wrote:
         | Well maybe you could rent it out to others for 256 days at
         | $0.3/hour, tinker, and sell it for parts after you get bored
         | with it. ;)
        
         | ametrau wrote:
         | Interesting. I checked it out. The providers running your
         | docker container have access to all your data.
        
         | Luc wrote:
         | Breakeven point would be less than 128 days due to the
         | (depreciating) resale value of the rig.
        
           | segmondy wrote:
           | Well, almost. GPUs have not be depreciating. The cost of
           | 3090's and 4090's have gone up. Folks are selling it for what
           | they paid for or even more. With the recent 40's SUPER series
           | from Nvidia, I'm not expecting any new releases in a year.
           | AMD & Intel still have ways to go before major adoption.
           | Startups are buying up consumer cards. So I sadly expect
           | prices to stay more or less the same.
        
             | svnt wrote:
             | If it isn't depreciating that supports the parent's bigger
             | point even more.
        
         | segmondy wrote:
         | Unless you are training, you never hit peak watts. When
         | inferring, the watt is still minimal. I'm running inference now
         | and using 20%. GPU 0 is using more because I have it as main
         | GPU. Idle watt sits at about 5%.
         | 
         | Device 0 [NVIDIA GeForce RTX 3060] PCIe GEN 3@16x RX: 0.000
         | KiB/s TX: 55.66 MiB/s GPU 1837MHz MEM 7300MHz TEMP 43degC FAN
         | 0% POW 43 / 170 W GPU[|| 5%]
         | MEM[|||||||||||||||||||9.769Gi/12.000Gi]
         | 
         | Device 1 [Tesla P40] PCIe GEN 3@16x RX: 977.5 MiB/s TX: 52.73
         | MiB/s GPU 1303MHz MEM 3615MHz TEMP 22degC FAN N/A% POW 50 / 250
         | W GPU[||| 9%] MEM[||||||||||||||||||18.888Gi/24.000Gi]
         | 
         | Device 2 [Tesla P40] PCIe GEN 3@16x RX: 164.1 MiB/s TX: 310.5
         | MiB/s GPU 1303MHz MEM 3615MHz TEMP 32degC FAN N/A% POW 48 / 250
         | W GPU[|||| 11%] MEM[||||||||||||||||||18.966Gi/24.000Gi]
        
         | karolist wrote:
         | He can use these cards for 128days non stop and re-sell,
         | claiming back the purchase price almost fully since OP bought
         | them cheap. Buying doesn't mean you use the GPUs to a point
         | where they end up costing 0, yes there is risk with GPUs going
         | but but c'mon.... Renting is money you will never see again.
        
       | cyanydeez wrote:
       | just ordered a 15k thread ripper platform because it's the only
       | way to cheaply maximize the pcie16x bottleneck. the mining rigs
       | are neat because the space you need for consumer GPU is a big
       | issue.
       | 
       | those rigs need pcie riser slots that are also limited.
       | 
       | looks like the primary value is the rig and the cards. they'll
       | need another 1-2k for a thread ripper and then the riser slots.
        
         | dijit wrote:
         | availability is tight i think but check out the ampere altra
         | stuff, they have an absurd number of pci's lanes compared to
         | AMD and especially intel, if you can suffer the ARM
         | architecture.
         | 
         | They also have some ML inference stuff on chip themselves.
        
           | choppaface wrote:
           | But then you need to deal with arm compile issues. A lot of
           | common packages are available for arm, but x86 is still least
           | likely to distract your development.
        
         | segmondy wrote:
         | Unless you are training maximizing the PCIe lanes is truly
         | overrated. You certainly don't want to be running at 1x speed.
         | But 8x speed is enough with minimal impact. 8*3 = 32 lanes.
         | Most CPUs can provide that. I'm running off a 2012 hp z820,
         | that yields 3x16/1x8. So for anyone going for a build, don't
         | throw money on CPUs. IMHO, GPU first, then your motherboard
         | second (read the specs sheets), then CPU supported pcie lanes &
         | Storage speed.
        
       | kaycebasques wrote:
       | I really enjoy and am inspired by the idea that people like
       | Dettmer (and probably this Samsja person) are the spiritual
       | successors to homebrew hackers in the 70s and 80s. They have
       | pretty intimate knowledge of many parts of the whole goddamn
       | stack, from what's going on in each hardware component, to how to
       | assemble all the components into a rig, up to all the software
       | stuff: algorithms, data, orchestration, etc.
       | 
       | Am also inspired by embedded developers for the same reason
        
       | nirav72 wrote:
       | This is nice. I would've used one of those ETH mining cases that
       | support multiple GPUs. Ebay has them $100-150 these days.
        
       | whoisthemachine wrote:
       | I've been slowly expanding my HTPC/media server into a gaming
       | server and box for running LLMs (and possibly diffusion models?)
       | locally for playing around with. I think it's becoming clear that
       | the future of LLM's will be local!
       | 
       | My box has a Gigabyte B450M, Ryzen 2700X, 32GB RAM, Radeon 6700XT
       | (for gaming/streaming to steam link on Linux), and an "old"
       | Geforce GTX 1650 with a paltry 6GB of RAM for running models on.
       | Currently it works nicely with smaller models on ollama :) and
       | it's been fun to get it set up. Obviously, now that the software
       | is running I could easily swap in a more modern NVidia card with
       | little hassle!
       | 
       | I've also been eyeing the b450 steel legend as a more capable
       | board for expansion than the Gigabyte board, this article gives
       | me some confidence that it is a solid board.
        
       | Uehreka wrote:
       | > I just got my hands on a mining rig with 3 rtx 3090 founder
       | edition for the modest sum of 1.7k euros.
       | 
       | I would prefer a tutorial on how to do this.
        
       | gigatexal wrote:
       | I thought this looked like a cryptocurrency miner. Seems the
       | crypto to AI pivot is legit happening. And good. Would rather we
       | boiled the oceans for something marginally more valuable than in-
       | game tokens we traded for fiat funds in this video game we call
       | life.
        
       | neilv wrote:
       | For large VRAM models, what about selling one of the 3090s, and
       | putting the money towards an NVLink and a motherboard with two
       | x16 PCIe slots (and preferably spaced so you don't need riser
       | cables)?
        
         | p1esk wrote:
         | Why do you need x16 pcie slots if you can use nvlink?
        
           | elorant wrote:
           | NVlink is to connect the cards to each other. To connect them
           | to the board you need the PCie slots.
        
             | p1esk wrote:
             | We are talking about increasing the intercard bandwidth,
             | assuming that's a bottleneck. It can be done by either
             | increasing pcie bandwidth, or using nvlink. If you use
             | nvlink, increasing pcie does not provide any additional
             | benefit because nvlink is much faster than pcie.
             | 
             | p.s. the mobo (B450 Steel Legend) already has 2 pcie x16
             | slots, so the recommendation does not make sense to me.
        
         | segmondy wrote:
         | full riser cables like they used doesn't impact performance.
         | Hanging it off on open air frame IMO is better, keeps
         | everything cooler, not just the GPU but the motherboard and
         | surrounding components. With only 2 24gb GPU they are not going
         | to be able to run larger models. You can't experiment with 70b
         | models without offloading to CPU which is super slow. The best
         | models are 70b+ models.
        
           | ImprobableTruth wrote:
           | 48 gb suffice for 4-bit inference and q-lora training of a
           | 70b model. ~80 GB allows you to push it to 8-bit (which is
           | nice of course), but full precision finetuning is completely
           | out of reach either way.
           | 
           | Though you're right of course that pcie will totally suffice
           | for this case.
        
         | ImprobableTruth wrote:
         | IME NVLink would be overkill for this. Model parallelism means
         | you only need bandwidth to transfer the intermediate
         | activations (/gradients + optimizer state) at the seams and
         | inference speed is generally slow enough that even pcie x8
         | won't be a bottleneck.
        
       | whimsicalism wrote:
       | I strongly, strongly suspect most people doing this are
       | significantly short of the breakeven prices for transitioning
       | from cloud 3090s.
       | 
       | inb4 there are no cloud 3090s: yes there are, just not in formal
       | datacenters
        
         | soraki_soladead wrote:
         | It's not always about cost. Sometimes the ergonomics of a local
         | machine are nicer.
        
       | smokeydoe wrote:
       | Does anyone have any good recommendations for an epyc server
       | grade motherboard that can use 3x3090? My current motherboard
       | (strix trx40-xe) has memory issues now. 2 slots cause boot errors
       | no matter what memory is inserted. I plan to sell the
       | threadripper. Other option is to just swap out the current
       | motherboard with a trx zenith extreme but I feel server grade
       | would be better at this point after experiencing issues. Is
       | supermicro worth it?
        
         | KuriousCat wrote:
         | It might not be the answer you are looking for, I would take a
         | look at components published by System76/Lambda labs such as
         | this to pick the one that would suit me:
         | https://github.com/system76/thelio/blob/master/Thelio%20Comm...
        
         | segmondy wrote:
         | If you're just going to stick to 3 GPUs. Then a lot of consumer
         | gaming motherboards would be more than sufficient. Checkout the
         | z270, x99, x299. If you really want epyc go to ebay search for
         | "gigabyte mz32-ar0 motherboard" Majority of them are going to
         | come form China and they are all pretty much used. If you have
         | plans to go even bigger then I say go for a new wrx80
        
           | buildbot wrote:
           | I have this motherboard - a big downside is many of the PCIE
           | slots will overhang into the RAM if used for a GPU. I can't
           | use two channels in my current ML machine because of this,
           | and I have single slot 4090s.
        
         | devbug wrote:
         | H12SSL-i or H12SSL-NT
         | 
         | ROMED8U-2T
        
       | Yenrabbit wrote:
       | Note that they shared part two recently:
       | https://samsja.github.io/blogs/rig/part_2/
       | 
       | For those talking about breakeven points and cheap cloud compute,
       | you need to factor in the mental difference it makes running a
       | test locally (which feels free) vs setting up a server and
       | knowing you're paying per hour it's running. Even if the cost is
       | low, I do different kinds of experiments knowing I'm not 'wasting
       | money' every minute the GPU sits idle. Once something is working,
       | then sure scaling up on cheap cloud compute makes sense. But it's
       | really, really nice having local compute to get to that state.
        
         | buildbot wrote:
         | Lots of people really underestimate the impact of that mental
         | state and the activation energy it creates towards doing
         | experiments - having some local compute is essential!
        
           | krallistic wrote:
           | This. In the second article, the author touches on this a
           | bit.
           | 
           | With a local setup, I often think, "Might as well run that
           | weird xyz experiment over night" (instead of idling) On a
           | cloud setup, the opposite is often the case: "Do I really
           | need that experiment or can I shut down the sever to save
           | money?". Makes a huge difference over longer periods.
           | 
           | For companies or if you just want to try a bit, then the
           | cloud is a good option, but for (Ph.D.) researchers, etc.,
           | the frictionless local system is quite powerful.
        
           | ummonk wrote:
           | I have the same attitude towards gym memberships - it really
           | helps to know I can just go in for 30 minutes when I feel
           | like it without worrying whether I'd be getting my money's
           | worth.
        
       | 0x20cowboy wrote:
       | If you would like to put Kubernetes on top of this kind of setup
       | this repo is helpful https://github.com/robrohan/skoupidia
       | 
       | The main benefit is you can shut off nodes entirely when not
       | using them, and then when you turn them back on they just rejoin
       | the cluster.
       | 
       | It also helps managing different types of devices and workloads
       | (tpu vs gpu vs cpu)
        
         | 2OEH8eoCRo0 wrote:
         | I love the idea of a "poor man's cluster" of hardware that I
         | can continually add to. Old ereaders, phones, tablets, family
         | laptops, everything.
         | 
         | I'm not sure what I'd use it for.
        
       | bick_nyers wrote:
       | Somewhat tangential question, but I'm wondering if anyone knows
       | of a solution (or Google search terms for this):
       | 
       | I have a 3U supermicro server chassis that I put an AM4
       | motherboard into, but I'm looking at upgrading the Mobo so that I
       | can run ~6 3090s in it. I don't have enough physical PCIE
       | slots/brackets in the chassis (7 expansion slots), so I either
       | need to try to do some complicated liquid cooling setup to make
       | the cards single slot (I don't want to do this), or I need to get
       | a bunch of riser cables and mount the GPU above the chassis. Is
       | there like a JBOD equivalent enclosure for PCIE cards? I don't
       | really think I can run the risers out the back of the case, so
       | I'll likely need to take off/modify the top panel somehow. What
       | I'm picturing in my head is basically a 3U to 6U case conversion,
       | but I'm trying to minimize cost (let's say $200 for the
       | chassis/mount component) as well as not have to cut metal.
        
         | choppaface wrote:
         | Comino sells a 6x 4090 box as a product:
         | https://www.comino.com/
         | 
         | They have single-slot GPU waterblocks but would want something
         | like $400 or more each for them individually.
        
         | ftufek wrote:
         | You'll need something like EPYC/Xeon CPUs and motherboards
         | which not only have many more PCIe lanes, but also allow
         | bifurcation. Once you have that, you can get bifurcated risers
         | and have many GPUs. And these risers use normal cables not the
         | typical gamer pcie risers which are pretty hard to arrange. You
         | won't get this for just $200 though.
         | 
         | For the chassis, you could try a 4U rosewill like this:
         | https://www.youtube.com/watch?v=ypn0jRHTsrQ, not sure if 6
         | 3090s would fit though. You're probably better off getting a
         | mining chassis, it's easier to setup and cool down, also
         | cheaper, unless you plan on putting them in a server rack.
        
       | jeffybefffy519 wrote:
       | Are m1/m2/m3 max mac's any good for this?
        
         | downrightmike wrote:
         | Way slower than 1 gpu, at many times the cost. If you don't
         | mind waiting minutes instead of seconds, macs are reasonable
        
           | fragmede wrote:
           | It depends on what you're trying to do, but I've got an M1,
           | and doing inference with llama2-uncensored using Ollama, I
           | get results within seconds.
        
       ___________________________________________________________________
       (page generated 2024-02-24 23:01 UTC)