[HN Gopher] Building a deep learning rig
___________________________________________________________________
Building a deep learning rig
Author : dvcoolarun
Score : 139 points
Date : 2024-02-23 13:52 UTC (1 days ago)
(HTM) web link (samsja.github.io)
(TXT) w3m dump (samsja.github.io)
| infogulch wrote:
| I'm eyeing Tinybox as a deep learning rig.
|
| https://tinygrad.org/
|
| https://twitter.com/__tinygrad__/status/1760988080754856210
| Smith42 wrote:
| $15k!
| KeplerBoy wrote:
| Which is not unreasonable for that amount of hardware.
|
| You have to ask yourself if you want to drop that kind of
| money on consumer GPUs, which launched late 2022. But then
| again, with that kind of money you are stuck with consumer
| GPUs either way, unless you want to buy Ada workstation cards
| for 6k each and those are just 4090s with p2p memory enabled.
| Hardly worth the premium, if you don't absolutely have to
| have that.
| cyanydeez wrote:
| I believe the ada workstation cards are typically 1slot
| cards
|
| which means you could build a 4gpu server from normal
| cases.
|
| most of the 4090 cards are 2-3 slot cards
| KeplerBoy wrote:
| The beefy workstation cards are 2 slots, but yeah the
| 4090 cards are usually 3.something slots, which is
| ridiculous. The few dual slot ones are water cooled.
| cyanydeez wrote:
| the work station cards also run on 300 watts and looks
| like the 4090 goes to 450.
|
| so you are getting a better practical card for the price
|
| if you are making a mining type rig, then yeah, the extra
| price is wasting money.
|
| but if you wanted to build a normal machine, the
| workstation cards are the most reasonable choice for
| anything more than 2 gpus
| kkielhofner wrote:
| I find it challenging to get my 4090s to consume more
| than 300 watts. There are also a lot of articles,
| benchmarks, etc around showing you can dramatically limit
| power while reducing perf by insignificant mounts (single
| digit %).
| justsomehnguy wrote:
| > which means you could build a 4gpu server from normal
| cases.
|
| Only if you are already live near an airport and you are
| accustomed to the sounds of the lift off and flying away.
| treprinum wrote:
| Sure, if you want to waste your time on getting stuff working
| on AMD instead of spending it on actual model training...
| kkielhofner wrote:
| This.
|
| People complain about the "Nvidia tax". I don't like
| monopolies and I fully support the efforts of AMD, Intel,
| Apple, anyone to chip away at this.
|
| That said as-is with ROCm you will:
|
| - Absolutely burn hours/days/weeks getting many (most?)
| things to work at all. If you get it working you need to
| essentially "freeze" the configuration because an upgrade
| means do it all over again.
|
| - In the event you get it to work at all you'll realize
| performance is nowhere near the hardware specs.
|
| - Throw up your hands and go back to CUDA.
|
| Between what it takes to get ROCm to work and the performance
| issues the Nvidia tax becomes a dividend nearly instantly
| once you factor in human time, less-than-optimal performance,
| and opportunity cost.
|
| Nvidia says roughly 30% of their costs are on software.
| That's what you need to do to deliver something that's
| actually usable in the real world. With the "Nvidia tax"
| they're also reaping the benefit of the ~15 years they've
| been sinking resources into CUDA.
| dkjaudyeqooe wrote:
| Wow! It's incredible how Nvidia has created the dark voodoo
| magic, and how only they can deliver the strong juju for
| AI. How are the so incredibly smart and powerful!?
|
| I wonder if it has anything to do with the strategy they
| used in 3D graphics, where game developers ended up writing
| for NV drivers in order to maximise performance, and Nvidia
| abused their market position and used every trick in the
| book to make AMD cards run poorly. People complained about
| AMD driver quality, but the actual problem was that they
| were not NV drivers and AMD couldn't defeat their software
| moat.
|
| So here we are again, this time with AI. You'd think we'd
| learnt our lesson but instead people are fooled yet again
| and instead of understanding the importance of diversity
| and competition in the lifeblood of their art, myopia and
| amnesia is the order of the day.
|
| Tinygrad are doing god's work and I won't be giving Nvidia
| a single fucking cent of my money until the software is
| hardware neutral and there is real competition.
| whimsicalism wrote:
| No, I think it is much more due to AMD absolutely failing
| to invest heavily in software at all. Honestly, they have
| had years - it is difficult to see how Nvidia abused
| their market position in AI when this is effectively a
| new market.
|
| I find this vague reflexive anti-corpo leftism that seems
| to have become extremely popular post-2020 really
| tiresome.
| infogulch wrote:
| For my part I'm not a leftist, and I'm not so much anti-
| corpo as pro-free market. I can still acknowledge that
| Nvidia is persistently pursuing anticompetitive policies
| and abusing their market position, without holing AMD on
| a pedestal assuming it would be different if the shoe was
| on the other foot.
| dkjaudyeqooe wrote:
| > I find this vague reflexive anti-corpo leftism that
| seems to have become extremely popular post-2020 really
| tiresome.
|
| Ah ideology, such a great alternative to actual thinking.
| Don't investigate or reason, just just blame it on the
| 'lefties'. Tiresome indeed.
|
| Not sure how me simply stating the obvious makes me a
| 'lefty'. If you think monopiles, regardless of how they
| come about, are a good idea, that companies should be
| allowed to lock up an important market for any reason,
| then that makes you a corporatist fascist, right? Wow,
| this mindless name calling is so much fun! I feel like a
| total genius.
|
| The simple fact is that it is the nature of software, its
| complexity and dependence on a multitude of fairly
| arbitrary technical choices makes it a very effective as
| a moat, even if its not intentional. CUDA, etc is 100% a
| software compatibility issue, and that's it. There's more
| than one way to skin a cat but we're stuck with this one.
| Nvidia isn't interested in interoperability, even though
| it's critical for the industry in the longer term. I'd
| wouldn't be either if it was money in my pocket.
|
| The point that is entirely missed here is that we, as a
| community, are screwing up by not steering the field
| toward better hardware compatibility, as in anyone being
| able to produce new hardware. In the rush to improve or
| or try out the latest model or software we have lost
| sight of this, and it will be to our great detriment.
| With the concentration of money in one company we will
| have a lot less innovation overall. Prices will be higher
| and resources will be misallocated. Everyone suffers.
|
| It's very possible that AI withers on the vine due to
| lagging hardware. It's going to need a lot of compute,
| and maybe a different kind of compute to boot. We may
| need a million or a billion times what we have to even
| get close to AGI. But if one company locks that up, and
| uses that position to squeeze out every dollar from its
| customers (really, have a look at the almost comical
| 'upgrades' Nvidia offers in their GPUs other than at the
| very high end) then it's going to take much longer to
| progress, and maybe we never get there because some small
| group of talented maverick researchers were never able to
| get their hands on the hardware they needed and never
| produce some critical breakthrough.
| kkielhofner wrote:
| Sarcasm aside...
|
| Can we drop the "Nvidia is the only self-interested evil
| company in existence" schtick?
|
| I'm not being "fooled" by anyone. I've been trying to use
| ROCm since the initial release six years ago (on Vega at
| the time). I've spent thousands of dollars on AMD
| hardware over the years hoping to see progress for
| myself. I've burned untold amounts of time fighting with
| ROCm, hoping it's even remotely a viable competitor to
| CUDA/Nvidia.
|
| Here we are in 2024 and they're still doing braindead
| stuff like dropping a new ROCm release to support their
| flagship $1000 consumer card a full year after release...
|
| ROCm 6 looks good? Check the docker containers[0]. Their
| initial release for ROCm only supported Python 3.9 for
| some strange reason even though the previous ROCm 5.7
| containers were based on Python 3.10. Python 3.10 is
| more-or-less the minimum for nearly anything out there.
|
| It took them 1.5 months to address this... This is merely
| one example, spend some time actually working with this
| and you will find dozens of similar "WTF?!?" bombs all
| over the place.
|
| I suggest you put your money and time where your mouth is
| (as I have) to actually try to work with ROCm. You will
| find that it is nowhere near the point of actually being
| a viable competitor to CUDA/Nvidia for anyone who's
| trying to get work done.
|
| > Tinygrad are doing god's work
|
| Tinygrad is packaging hardware with off the shelf
| components plus substantial markup. There is nothing
| special about this hardware and they aren't doing
| anything you couldn't have done in the past year. They
| have been vocal on calling out AMD but show me their
| commits to ROCm and I'll agree they are "doing god's
| work".
|
| We'll save the work being done on their framework for
| another thread.
|
| [0] - https://hub.docker.com/r/rocm/pytorch/tags
| latchkey wrote:
| > There is nothing special about this hardware and they
| aren't doing anything you couldn't have done in the past
| year.
|
| What they are doing is all of the hardware engineering
| work that it takes to build something like this. You're
| dismissing the amount of time they spent on figuring
| stuff like this out:
|
| "Beating back all the PCI-E AER errors was hard, as
| anyone knows who has tried to build a system like this."
| kkielhofner wrote:
| > "Beating back all the PCI-E AER errors was hard, as
| anyone knows who has tried to build a system like this."
|
| Define "hard".
|
| The crypto mining community has had this working for at
| least half a decade with AMD cards. With Nvidia it's a
| non-issue. I'd be very, very curious to get more
| technical details on what new work they did here.
| latchkey wrote:
| I ran 150,000 AMD cards for mining and we didn't run into
| that problem because we bought systems with PCIe
| baseboards (12x cards) instead of dumb risers. I'd be
| interested in finding out more details as well, but it
| seems he doesn't want to share that in public.
|
| That said, if you think any of this is easy, you're the
| one who should define that word.
| kkielhofner wrote:
| I never used the word easy, I never used the word hard.
| He used the word hard, you used the word easy.
|
| With that said.
|
| Easy: Assembling off the shelf PC components to provide
| what is fundamentally no different than what
| gamers/miners build every day. Six cards in a machine and
| two power supplies is low-end mining. Also see the x8 GPU
| machines with multiple power supplies that have been
| around forever. I'm not quite sure why you're arguing
| this so hard, you're more than familiar with these
| things.
|
| Hard: Show me something with a BOM. Some manufacturing?
| PCB? Fab? Anything.
|
| FWIW for someone that is frequently promoting their
| startup here you come across as pretty antagonistic. I'm
| not attacking you, just saying that for someone like
| myself that has been intrigued by what you're working on
| it gives me pause in terms of what I'd charitably refer
| to as potential personality/relationship issues.
|
| Everyone has those days, just thought it was worth
| mentioning.
| fragmede wrote:
| isn't tinygrad's value add the software they provide on
| top of the open source drivers to make it all work? why
| would should they commit to ROCm if that's the product
| they're trying to sell?
| hackerlight wrote:
| > Between what it takes to get ROCm to work
|
| It's not that bad. You just copy and paste ~10 bash
| commands from their the official guide. 7900XTX is now
| officially supported by AMD. Andrew Ng says it's much
| better than 1 year ago and isn't as bad as people say.
| dkjaudyeqooe wrote:
| Resistance is useless! Lets just accept our fate and toe the
| line. Why feel bad about paying essentially double, or
| getting half the compute for our money when we can just
| choose the easy route and accept our fate and feed the
| monopoly a little more money so they can charge us even more
| money. Who needs competition!
| lbotos wrote:
| Isn't the entire point of tinygrad and the tinybox the "apple
| style" of we are building this software to work best on our
| hardware?
| whimsicalism wrote:
| it's not there yet - and i don't really understand why they
| don't just try to upstream stuff to pytorch
| tutfbhuf wrote:
| This is the new startup from George Hotz. I would like him to
| succeed, but I'm not so optimistic about their chances of
| selling a $15k box that is most likely less than $10k in parts.
| Most people would do much better by buying a second-hand 3090
| or similar and connecting them into a rig.
| segmondy wrote:
| Not necessarily, I'm not sure about AMD GPUs, but he tweeted
| that AMD supports linking all 6 together. If that's the case,
| then 6 of those XTX should crush 6 3090's. For us techies we
| definitely will decide to build vs buy. However businesses
| would definitely decide to buy vs build.
| downrightmike wrote:
| He keeps hopping from thing to thing
| abra0 wrote:
| I was thinking of doing something similar, but I am a bit
| sceptical about how the economics on this works out. On vast.ai
| renting a 3x3090 rig is $0.6/hour. The electricity price of
| operating this in e.g. Germany is somewhere about $0.05/hour. If
| the OP paid 1700 EUR for the cards, the breakeven point would be
| around (haha) 3090 hours in, or ~128 days, assuming non-stop
| usage. It's probably cool to do that if you have a specific goal
| in mind, but to tinker around with LLMs and for unfocused
| exploration I'd advise folks to just rent.
| cyanydeez wrote:
| the current economics is a low ball to get costumers. it's
| absolutely not going to be the market price once commercial
| interests have locked in their products.
|
| but if you're just goofing around and not planning to create
| anything production worthy, it's a great deal.
| whimsicalism wrote:
| > the current economics is a low ball to get costumers.
|
| vast.ai is basically a clearinghouse. they are not doing some
| VC subsidy thing
|
| in general, community clouds are not suitable for commercial
| use.
| imiric wrote:
| > On vast.ai renting a 3x3090 rig is $0.6/hour. The electricity
| price of operating this in e.g. Germany is somewhere about
| $0.05/hour.
|
| Are you factoring in the varying power usage in that
| electricity price?
|
| The electricity cost of operating locally will vary depending
| on the actual system usage. When idle, it should be much
| cheaper. Whereas in cloud hosts you pay the same price whether
| the system is in use or not.
|
| Plus with cloud hosts reliability is not guaranteed. Especially
| with vast.ai, where you're renting other people's home
| infrastructure. You might get good bandwidth and availability
| on one host, but when that host disappears, you should hope
| that you did a backup, which vast.ai charges for separately,
| and if so, you need to spend time restoring the backup to
| another, hopefully equally reliable host, which can take hours
| depending on the amount of data and bandwidth.
|
| I recently built an AI rig and went with 2x3090s, and am very
| happy with the setup. I evaluated vast.ai beforehand, and my
| local experience is much better, while my electricity bill is
| not much higher (also in EU).
| KeplerBoy wrote:
| Well rented cloud instances shouldn't idle in the first
| place.
| imiric wrote:
| Sure, but unless you're using them for training, the power
| usage for inference will vary a lot. And it's cumbersome to
| shutdown the instance while you're working on something
| else, and have to start it back up when you need to use it
| again. During that time, the vast.ai host could disappear.
| segmondy wrote:
| Most people don't think of storage costs and network
| bandwidth. I have about 2tb of local models. What's the
| cost of storing this in the cloud? If I decide not to
| store them in the cloud, I have to transfer them in
| anytime I want to run experiments. Build your own rig so
| you can run experiments daily. This is a budget rig and
| you can even build cheaper.
| nightski wrote:
| Data as well. I have a 100TB NAS I can use for data
| storage and it was honesty pretty cheap overall.
| isoprophlex wrote:
| Let me add that moving data in and out of vast.ai is
| extremely painful. I might be overprivileged with a 1000
| MBit line but these vast.ai instances have highly
| variable bandwidth in my experience; plus even when
| advertising good speeds I'm sometimes doing transfers in
| the 10-100 KiB/s range.
| abra0 wrote:
| Well if you are not using a rented machine during a period of
| time, you should release it.
|
| Agreed on reliability and data transfer, that's a good point.
|
| Out of curiosity, what do you use a 2x3090 rig for? Bulk not
| time-sensitive inference on down quanted models?
| imiric wrote:
| > Well if you are not using a rented machine during a
| period of time, you should release it.
|
| If you're using them for inference, your usage pattern is
| unpredictable. I could spend hours between having to use
| it, or minutes. If you shut it down and release it, the
| host might be gone the next time you want to use it.
|
| > what do you use a 2x3090 rig for? Bulk not time-sensitive
| inference on down quanted models?
|
| Yeah. I can run 7B models unquantized, ~13-33B at q8, and
| ~70B at q4, at fairly acceptable speeds (>10tk/s).
| whimsicalism wrote:
| if you are just using it for inference, i think an
| appropriate comparison would just be like a together.ai
| endpoint or something - which allows you to scale up
| pretty immediately and likely is more economical as well.
| imiric wrote:
| Perhaps, but self-hosting is non-negotiable for me. It's
| much more flexible, gives me control of my data and
| privacy, and allows me to experiment and learn about how
| these systems work. Plus, like others mentioned, I can
| always use the GPUs for other purposes.
| whimsicalism wrote:
| to each their own. if you are having really high-
| sensitive conversations with your GAI that someone would
| bother snooping in your docker container, figuring out
| how you are doing inference, and then capturing it real-
| time - you have a different risk tolerance than me.
|
| i do think that cloud GPUs can cover most of this
| experimentation/learning need.
| algo_trader wrote:
| together.ai is really good but there is a price mismatch
| for small models (a 1BN model is not x10 cheaper than
| 10BN models)
|
| This is obviously because their are forced to use high
| memory cards.
|
| Are there ideal cards for low memory (1-2BN) models? So
| higher flops/$ on crippled memory
| whimsicalism wrote:
| with runpod/vast, you can request a set amount of time -
| generally if I request from Western EU or North America the
| availability is fine on the week-to-month timescale.
|
| fwiw I find runpod's vast clone significantly better than
| vast and there isn't really a price premium.
| algo_trader wrote:
| > built an AI rig and went with 2x3090s,
|
| Is there a goto card for low memory (1-2BN) models?
|
| Something with much better flops/$ but purposely crippled
| with low memory.
| mirekrusin wrote:
| For me "economics" are:
|
| - if I have it locally, I'll play with it
|
| - if not, I won't (especially with my data)
|
| - if I have something ready for a long run I may or may not
| want to send it somewhere (it's not going to be on 3090s for
| sure if I send it)
|
| - if I have requirement to have something public I'd probably
| go for per usage with ie [0].
|
| [0] https://www.runpod.io/serverless-gpu
| kkielhofner wrote:
| With the current more-or-less dependency on CUDA and thus
| Nvidia hardware it's about making sure you actually have the
| hardware available consistently.
|
| I've had VERY hit-miss results with Vast.ai and I'm convinced
| people are cheating their evaluation stuff because when the
| rubber meets the road it's very clear performance isn't what
| it's claimed to be. Then you still need to be able to actually
| get them...
| whimsicalism wrote:
| use runpod and yeah i think vast.ai has some scams,
| especially in the asian and eastern european nodes.
| KuriousCat wrote:
| When you compute the break even point did you factor in that
| you still own the cards and you can resell them? I bought my
| 3090s for 1000$ and after 1 year I think they go for more in
| the open market if I resell them now.
| wiradikusuma wrote:
| For me the economics is when I'm not using it to do AI stuff, I
| can use it to play games with max settings.
|
| Unfortunately my CFO (a.k.a Wife) does not share the same
| understanding.
| ejb999 wrote:
| I fear that someday I will die and my wife will sell off all
| my stuff for what I said I paid for it.
|
| (not really, but it is a joke I read someplace and I think it
| applies to a lot of couples).
| verticalscaler wrote:
| Well maybe you could rent it out to others for 256 days at
| $0.3/hour, tinker, and sell it for parts after you get bored
| with it. ;)
| ametrau wrote:
| Interesting. I checked it out. The providers running your
| docker container have access to all your data.
| Luc wrote:
| Breakeven point would be less than 128 days due to the
| (depreciating) resale value of the rig.
| segmondy wrote:
| Well, almost. GPUs have not be depreciating. The cost of
| 3090's and 4090's have gone up. Folks are selling it for what
| they paid for or even more. With the recent 40's SUPER series
| from Nvidia, I'm not expecting any new releases in a year.
| AMD & Intel still have ways to go before major adoption.
| Startups are buying up consumer cards. So I sadly expect
| prices to stay more or less the same.
| svnt wrote:
| If it isn't depreciating that supports the parent's bigger
| point even more.
| segmondy wrote:
| Unless you are training, you never hit peak watts. When
| inferring, the watt is still minimal. I'm running inference now
| and using 20%. GPU 0 is using more because I have it as main
| GPU. Idle watt sits at about 5%.
|
| Device 0 [NVIDIA GeForce RTX 3060] PCIe GEN 3@16x RX: 0.000
| KiB/s TX: 55.66 MiB/s GPU 1837MHz MEM 7300MHz TEMP 43degC FAN
| 0% POW 43 / 170 W GPU[|| 5%]
| MEM[|||||||||||||||||||9.769Gi/12.000Gi]
|
| Device 1 [Tesla P40] PCIe GEN 3@16x RX: 977.5 MiB/s TX: 52.73
| MiB/s GPU 1303MHz MEM 3615MHz TEMP 22degC FAN N/A% POW 50 / 250
| W GPU[||| 9%] MEM[||||||||||||||||||18.888Gi/24.000Gi]
|
| Device 2 [Tesla P40] PCIe GEN 3@16x RX: 164.1 MiB/s TX: 310.5
| MiB/s GPU 1303MHz MEM 3615MHz TEMP 32degC FAN N/A% POW 48 / 250
| W GPU[|||| 11%] MEM[||||||||||||||||||18.966Gi/24.000Gi]
| karolist wrote:
| He can use these cards for 128days non stop and re-sell,
| claiming back the purchase price almost fully since OP bought
| them cheap. Buying doesn't mean you use the GPUs to a point
| where they end up costing 0, yes there is risk with GPUs going
| but but c'mon.... Renting is money you will never see again.
| cyanydeez wrote:
| just ordered a 15k thread ripper platform because it's the only
| way to cheaply maximize the pcie16x bottleneck. the mining rigs
| are neat because the space you need for consumer GPU is a big
| issue.
|
| those rigs need pcie riser slots that are also limited.
|
| looks like the primary value is the rig and the cards. they'll
| need another 1-2k for a thread ripper and then the riser slots.
| dijit wrote:
| availability is tight i think but check out the ampere altra
| stuff, they have an absurd number of pci's lanes compared to
| AMD and especially intel, if you can suffer the ARM
| architecture.
|
| They also have some ML inference stuff on chip themselves.
| choppaface wrote:
| But then you need to deal with arm compile issues. A lot of
| common packages are available for arm, but x86 is still least
| likely to distract your development.
| segmondy wrote:
| Unless you are training maximizing the PCIe lanes is truly
| overrated. You certainly don't want to be running at 1x speed.
| But 8x speed is enough with minimal impact. 8*3 = 32 lanes.
| Most CPUs can provide that. I'm running off a 2012 hp z820,
| that yields 3x16/1x8. So for anyone going for a build, don't
| throw money on CPUs. IMHO, GPU first, then your motherboard
| second (read the specs sheets), then CPU supported pcie lanes &
| Storage speed.
| kaycebasques wrote:
| I really enjoy and am inspired by the idea that people like
| Dettmer (and probably this Samsja person) are the spiritual
| successors to homebrew hackers in the 70s and 80s. They have
| pretty intimate knowledge of many parts of the whole goddamn
| stack, from what's going on in each hardware component, to how to
| assemble all the components into a rig, up to all the software
| stuff: algorithms, data, orchestration, etc.
|
| Am also inspired by embedded developers for the same reason
| nirav72 wrote:
| This is nice. I would've used one of those ETH mining cases that
| support multiple GPUs. Ebay has them $100-150 these days.
| whoisthemachine wrote:
| I've been slowly expanding my HTPC/media server into a gaming
| server and box for running LLMs (and possibly diffusion models?)
| locally for playing around with. I think it's becoming clear that
| the future of LLM's will be local!
|
| My box has a Gigabyte B450M, Ryzen 2700X, 32GB RAM, Radeon 6700XT
| (for gaming/streaming to steam link on Linux), and an "old"
| Geforce GTX 1650 with a paltry 6GB of RAM for running models on.
| Currently it works nicely with smaller models on ollama :) and
| it's been fun to get it set up. Obviously, now that the software
| is running I could easily swap in a more modern NVidia card with
| little hassle!
|
| I've also been eyeing the b450 steel legend as a more capable
| board for expansion than the Gigabyte board, this article gives
| me some confidence that it is a solid board.
| Uehreka wrote:
| > I just got my hands on a mining rig with 3 rtx 3090 founder
| edition for the modest sum of 1.7k euros.
|
| I would prefer a tutorial on how to do this.
| gigatexal wrote:
| I thought this looked like a cryptocurrency miner. Seems the
| crypto to AI pivot is legit happening. And good. Would rather we
| boiled the oceans for something marginally more valuable than in-
| game tokens we traded for fiat funds in this video game we call
| life.
| neilv wrote:
| For large VRAM models, what about selling one of the 3090s, and
| putting the money towards an NVLink and a motherboard with two
| x16 PCIe slots (and preferably spaced so you don't need riser
| cables)?
| p1esk wrote:
| Why do you need x16 pcie slots if you can use nvlink?
| elorant wrote:
| NVlink is to connect the cards to each other. To connect them
| to the board you need the PCie slots.
| p1esk wrote:
| We are talking about increasing the intercard bandwidth,
| assuming that's a bottleneck. It can be done by either
| increasing pcie bandwidth, or using nvlink. If you use
| nvlink, increasing pcie does not provide any additional
| benefit because nvlink is much faster than pcie.
|
| p.s. the mobo (B450 Steel Legend) already has 2 pcie x16
| slots, so the recommendation does not make sense to me.
| segmondy wrote:
| full riser cables like they used doesn't impact performance.
| Hanging it off on open air frame IMO is better, keeps
| everything cooler, not just the GPU but the motherboard and
| surrounding components. With only 2 24gb GPU they are not going
| to be able to run larger models. You can't experiment with 70b
| models without offloading to CPU which is super slow. The best
| models are 70b+ models.
| ImprobableTruth wrote:
| 48 gb suffice for 4-bit inference and q-lora training of a
| 70b model. ~80 GB allows you to push it to 8-bit (which is
| nice of course), but full precision finetuning is completely
| out of reach either way.
|
| Though you're right of course that pcie will totally suffice
| for this case.
| ImprobableTruth wrote:
| IME NVLink would be overkill for this. Model parallelism means
| you only need bandwidth to transfer the intermediate
| activations (/gradients + optimizer state) at the seams and
| inference speed is generally slow enough that even pcie x8
| won't be a bottleneck.
| whimsicalism wrote:
| I strongly, strongly suspect most people doing this are
| significantly short of the breakeven prices for transitioning
| from cloud 3090s.
|
| inb4 there are no cloud 3090s: yes there are, just not in formal
| datacenters
| soraki_soladead wrote:
| It's not always about cost. Sometimes the ergonomics of a local
| machine are nicer.
| smokeydoe wrote:
| Does anyone have any good recommendations for an epyc server
| grade motherboard that can use 3x3090? My current motherboard
| (strix trx40-xe) has memory issues now. 2 slots cause boot errors
| no matter what memory is inserted. I plan to sell the
| threadripper. Other option is to just swap out the current
| motherboard with a trx zenith extreme but I feel server grade
| would be better at this point after experiencing issues. Is
| supermicro worth it?
| KuriousCat wrote:
| It might not be the answer you are looking for, I would take a
| look at components published by System76/Lambda labs such as
| this to pick the one that would suit me:
| https://github.com/system76/thelio/blob/master/Thelio%20Comm...
| segmondy wrote:
| If you're just going to stick to 3 GPUs. Then a lot of consumer
| gaming motherboards would be more than sufficient. Checkout the
| z270, x99, x299. If you really want epyc go to ebay search for
| "gigabyte mz32-ar0 motherboard" Majority of them are going to
| come form China and they are all pretty much used. If you have
| plans to go even bigger then I say go for a new wrx80
| buildbot wrote:
| I have this motherboard - a big downside is many of the PCIE
| slots will overhang into the RAM if used for a GPU. I can't
| use two channels in my current ML machine because of this,
| and I have single slot 4090s.
| devbug wrote:
| H12SSL-i or H12SSL-NT
|
| ROMED8U-2T
| Yenrabbit wrote:
| Note that they shared part two recently:
| https://samsja.github.io/blogs/rig/part_2/
|
| For those talking about breakeven points and cheap cloud compute,
| you need to factor in the mental difference it makes running a
| test locally (which feels free) vs setting up a server and
| knowing you're paying per hour it's running. Even if the cost is
| low, I do different kinds of experiments knowing I'm not 'wasting
| money' every minute the GPU sits idle. Once something is working,
| then sure scaling up on cheap cloud compute makes sense. But it's
| really, really nice having local compute to get to that state.
| buildbot wrote:
| Lots of people really underestimate the impact of that mental
| state and the activation energy it creates towards doing
| experiments - having some local compute is essential!
| krallistic wrote:
| This. In the second article, the author touches on this a
| bit.
|
| With a local setup, I often think, "Might as well run that
| weird xyz experiment over night" (instead of idling) On a
| cloud setup, the opposite is often the case: "Do I really
| need that experiment or can I shut down the sever to save
| money?". Makes a huge difference over longer periods.
|
| For companies or if you just want to try a bit, then the
| cloud is a good option, but for (Ph.D.) researchers, etc.,
| the frictionless local system is quite powerful.
| ummonk wrote:
| I have the same attitude towards gym memberships - it really
| helps to know I can just go in for 30 minutes when I feel
| like it without worrying whether I'd be getting my money's
| worth.
| 0x20cowboy wrote:
| If you would like to put Kubernetes on top of this kind of setup
| this repo is helpful https://github.com/robrohan/skoupidia
|
| The main benefit is you can shut off nodes entirely when not
| using them, and then when you turn them back on they just rejoin
| the cluster.
|
| It also helps managing different types of devices and workloads
| (tpu vs gpu vs cpu)
| 2OEH8eoCRo0 wrote:
| I love the idea of a "poor man's cluster" of hardware that I
| can continually add to. Old ereaders, phones, tablets, family
| laptops, everything.
|
| I'm not sure what I'd use it for.
| bick_nyers wrote:
| Somewhat tangential question, but I'm wondering if anyone knows
| of a solution (or Google search terms for this):
|
| I have a 3U supermicro server chassis that I put an AM4
| motherboard into, but I'm looking at upgrading the Mobo so that I
| can run ~6 3090s in it. I don't have enough physical PCIE
| slots/brackets in the chassis (7 expansion slots), so I either
| need to try to do some complicated liquid cooling setup to make
| the cards single slot (I don't want to do this), or I need to get
| a bunch of riser cables and mount the GPU above the chassis. Is
| there like a JBOD equivalent enclosure for PCIE cards? I don't
| really think I can run the risers out the back of the case, so
| I'll likely need to take off/modify the top panel somehow. What
| I'm picturing in my head is basically a 3U to 6U case conversion,
| but I'm trying to minimize cost (let's say $200 for the
| chassis/mount component) as well as not have to cut metal.
| choppaface wrote:
| Comino sells a 6x 4090 box as a product:
| https://www.comino.com/
|
| They have single-slot GPU waterblocks but would want something
| like $400 or more each for them individually.
| ftufek wrote:
| You'll need something like EPYC/Xeon CPUs and motherboards
| which not only have many more PCIe lanes, but also allow
| bifurcation. Once you have that, you can get bifurcated risers
| and have many GPUs. And these risers use normal cables not the
| typical gamer pcie risers which are pretty hard to arrange. You
| won't get this for just $200 though.
|
| For the chassis, you could try a 4U rosewill like this:
| https://www.youtube.com/watch?v=ypn0jRHTsrQ, not sure if 6
| 3090s would fit though. You're probably better off getting a
| mining chassis, it's easier to setup and cool down, also
| cheaper, unless you plan on putting them in a server rack.
| jeffybefffy519 wrote:
| Are m1/m2/m3 max mac's any good for this?
| downrightmike wrote:
| Way slower than 1 gpu, at many times the cost. If you don't
| mind waiting minutes instead of seconds, macs are reasonable
| fragmede wrote:
| It depends on what you're trying to do, but I've got an M1,
| and doing inference with llama2-uncensored using Ollama, I
| get results within seconds.
___________________________________________________________________
(page generated 2024-02-24 23:01 UTC)