[HN Gopher] Building an AI server on a budget
       ___________________________________________________________________
        
       Building an AI server on a budget
        
       Author : mful
       Score  : 58 points
       Date   : 2025-06-06 02:33 UTC (2 days ago)
        
 (HTM) web link (www.informationga.in)
 (TXT) w3m dump (www.informationga.in)
        
       | vunderba wrote:
       | The RTX market is particularly irritating right now, even second-
       | hard 4090s are still going for MSRP if you can find them at all.
       | 
       | Most of the recommendations for this budget AI system are on
       | point - the only thing I'd recommend is more RAM. 32GB is not a
       | lot - particularly if you start to load larger models through
       | formats such as GGUF and want to take advantage of system ram to
       | split the layers at the cost of inference speed. I'd recommend at
       | least _2 x 32GB_ or even _4 x 32GB_ if you can swing it budget-
       | wise.
       | 
       | Author mentioned using Claude for recommendations, but another
       | great resource for building machines is PC Part Picker. They'll
       | even show warnings if you try pairing incompatible parts or try
       | to use a PSU that won't supply the minimum recommended power.
       | 
       | https://pcpartpicker.com
        
       | uniposterz wrote:
       | I had a similar setup for a local LLM, 32GB was not enough. I
       | recommend going for 64GB.
        
       | golly_ned wrote:
       | Whenever I get to a section that was clearly autogenerated by an
       | LLM I lose interest in the entire article. Suddenly the entire
       | thing is suspect and I feel like I'm wasting my time, since I'm
       | lo lingering encountering the mind of another person, just
       | interacting with a system.
        
         | bravesoul2 wrote:
         | I didn't see anything like that here. Yeah they used bullets.
        
           | golly_ned wrote:
           | There's a section that says what the parts of a pc are, and
           | what that part is.
        
             | Nevermark wrote:
             | > I used the AI-generated recommendations as a starting
             | point, and refined the options with my own research.
             | 
             | Referring to this section?
             | 
             | I don't see a problem with that. This isn't an article
             | about a design intended for 10,000 systems. Just one
             | person's follow through on an interesting project. With
             | disclosure of methodology.
        
         | throwaway314155 wrote:
         | Eh, yeah - the article starts off pretty specific but then gets
         | into the weeds of stuff like how to put your PC together, which
         | is far from novel information and certainly not on-topic in my
         | opinion.
        
       | 7speter wrote:
       | I dunno everyone, but I think Intel has something big on their
       | hands with their announced workstation gpus. The b50 is a low
       | profile card that doesn't have a powersupply hookup because it
       | only uses something like 60 watts, and comes with 16gb vram at a
       | msrp of 300 dollars.
       | 
       | I imagine companies will have first dibs via the likes of
       | agreements with suppliers like CDW, etc, but if Intel had enough
       | of these battlemage dies accumulated, it could also drastically
       | change the local ai enthusiast/hobbyist landscape; for starters
       | this could drive down the price of workstation cards that are
       | ideal for inference, at the very least. I'm cautiously excited.
       | 
       | On the AMD front (really, a sort of open compute front), Vulkan
       | Kompute is picking up steam and it would be really cool to have a
       | standard that mostly(?) ships with Linux, and older ports
       | available for Freebsd, so that we can actually run free as in
       | freedom inference locally.
        
       | Uehreka wrote:
       | Love the attention to detail, I can tell this was a lot of work
       | to put together and I hope it helps people new to PC building.
       | 
       | I will note though, 12GB of VRAM and 32GB of system RAM is a
       | ceiling you're going to hit pretty quickly if you're into messing
       | with LLMs. There's basically no way to do a better job at the
       | budget you're working with though.
       | 
       | One thing I hear about a lot is people using things like RunPod
       | to briefly get access to powerful GPUs/servers when they need
       | one. If you spend $2/hr you can get access to an H100. If you
       | have a budget of $1300 that could get you about 600 hours of
       | compute time, which (unless you're doing training runs) should
       | last you several months.
       | 
       | In several months time the specs required to run good models will
       | be different again in ways that are hard to predict, so this
       | approach can help save on the heartbreak of buying an RTX 5090
       | only to find that even that doesn't help much with LLM inference
       | and we're all gonna need the cheaper-but-more-VRAM Intel Arc
       | B60s.
        
         | semi-extrinsic wrote:
         | > save on the heartbreak of buying an RTX 5090 only to find
         | that even that doesn't help much with LLM inference and we're
         | all gonna need the cheaper-but-more-VRAM Intel Arc B60s
         | 
         | When going for more VRAM, with an RTX 5090 currently sitting at
         | $3000 for 32GB, I'm curious why people aren't trying to get the
         | Dell C4140s. Those seem to go for $3000-$4000 for the whole
         | server with 4x V100 16GB, so 64GB total VRAM.
         | 
         | Maybe it's just because they produce heat and noise like a
         | small turbojet.
        
       | Jedd wrote:
       | In January 2024 there was a similar post (
       | https://news.ycombinator.com/item?id=38985152 ) wherein the
       | author selected dual NVidia 4060 Ti's for an at-home-LLM-with-
       | voice-control -- because they were the cheapest cost per GB of
       | well-supported VRAM at the time.
       | 
       | (They probably still are, or at least pretty close to it.)
       | 
       | That informed my decision shortly after, when I built something
       | similar - that video card model was widely panned by gamers (or
       | more accurately, gamer 'influencers'), but it was an excellent
       | choice if you wanted 16GB of VRAM with relatively low power draw
       | (150W peak).
       | 
       | TFA doesn't say where they are, or what currency they're using
       | (which implies the hubris of a North American) - at which point
       | that pricing for a second hand, smaller-capacity, higher-power-
       | drawing 4070 just seems weird.
       | 
       | Appreciate the 'on a budget' aspect, it just seems like an
       | objectively worse path, as upgrades are going to require
       | replacement, rather than augment.
       | 
       | As per other comments here, 32 / 12 is going to be _really_
       | limiting. Yes - lower parameter  / smaller-quant models are
       | becoming more capable, but at the same time we're seeing
       | increasing interest in larger context for these at home use
       | cases, and that chews up memory real fast.
        
         | throwaway314155 wrote:
         | > which implies the hubris of a North American
         | 
         | No need for that.
        
           | topato wrote:
           | True, though
        
           | topato wrote:
           | He did soften the blow by saying North American, rather than
           | the more correctly appropos, American
        
         | T-A wrote:
         | > TFA doesn't say where they are
         | 
         | "the 1,440W limit on wall outlets in California" is a pretty
         | good hint.
        
       | rcarmo wrote:
       | The trouble with these things is that "on a budget" doesn't
       | deliver much when most interesting and truly useful models are
       | creeping beyond the 16GB VRAM limit and/or require a lot of
       | wattage. Even a Mac mini with enough RAM is starting to look like
       | an expensive proposition, and the AMD Stryx Halo APUs (the SKUs
       | that matter, like the Framework Desktop at 128GB) are around $2K.
       | 
       | As someone who built a period-equivalent rig (with a 12GB 3060
       | and 128GB RAM) a few years ago, I am not overly optimistic that
       | local models will keep being a cheap alternative (never mind the
       | geopolitics). And yeah, there are vey cheap ways to run
       | inference, but hey become pointless - I can run Qwen and Phi4
       | locally on an ARM chip like the RK3588, but it is still dog slow.
        
       | v5v3 wrote:
       | I thought prevailing wisdom was that a used 3090 with it's larger
       | vram was the best budget gpu choice?
       | 
       | And in general, if on a budget then why not buy used and not new?
       | And more so as the author himself talks about the resale value
       | for when he sells it on.
        
         | olowe wrote:
         | > I thought prevailing wisdom was that a used 3090 with it's
         | larger vram was the best budget gpu choice?
         | 
         | The trick is memory bandwidth - not just the amount of VRAM -
         | is important for LLM inference. For example, the B50 specs list
         | a memory bandwidth of 224 GB/s [1], whereas the Nvidia RTX 3090
         | has over 900GB/s [2]. The 4070's bandwidth is "just" 500GB/s
         | [3].
         | 
         | More VRAM helps run larger models but with lower bandwidth
         | tokens could be generating so slowly it's not really practical
         | for day-to-day use or experimenting.
         | 
         | [1]:
         | https://www.intel.com/content/www/us/en/products/sku/242615/...
         | 
         | [2]: https://www.techpowerup.com/gpu-specs/geforce-
         | rtx-3090.c3622
         | 
         | [3]: https://www.thefpsreview.com/gpu-family/nvidia-geforce-
         | rtx-4...
        
           | lelanthran wrote:
           | > The trick is memory bandwidth - not just the amount of VRAM
           | - is important for LLM inference.
           | 
           | I'm not really knowledgeable about this space, so maybe I'm
           | missing something:
           | 
           | Why does the bus performance affect token generation? I would
           | expect it to cause a slow startup when loading the model, but
           | once the model is loaded, just how much bandwidth can the
           | token generation possibly use?
           | 
           | Token generation is completely on the card using the memory
           | _on the card_ , without any bus IO at all, no?
           | 
           | IOW, I'm trying to think of what IO the card is going to need
           | for token generation, and I can't think of any other than
           | returning the tokens (which, even on a slow 100MB/s transfer
           | is still going to be about 100x the rate at which tokens are
           | being generated.
        
         | retinaros wrote:
         | yes it is
        
       | politelemon wrote:
       | If the author is reading this I'll point out that the cuda
       | toolkit you find in the repositories is generally older. You can
       | find the latest versions straight from Nvidia:
       | https://developer.nvidia.com/cuda-downloads?target_os=Linux&...
       | 
       | The caveat is that sometimes a library might be expecting an
       | older version of cuda.
       | 
       | The vram on the GPU does make a difference, so it would at some
       | point be worth looking at another GPU or increasing your system
       | ram if you start running into limits.
       | 
       | However I wouldn't worry too much right away, it's more important
       | to get started and get an understanding of how these local LLMs
       | operate and take advantage of the optimisations that the
       | community is making to make it more accessible. Not everyone has
       | a 5090, and if LLMs remain in the realms of high end hardware,
       | it's not worth the time.
        
         | throwaway314155 wrote:
         | The other main caveat is that installing from custom sources
         | using apt is a massive pain in the ass.
        
       | burnt-resistor wrote:
       | Reminds me of https://cr.yp.to/hardware/build-20090123.html
       | 
       | I'll be that guy(tm) that says if you're going to do any
       | computing half-way reliably, only use ECC RAM. Silent bit flips
       | suck.
        
       | DogRunner wrote:
       | I used a similar budget and build something like this:
       | 
       | 7x RTX 3060 - 12 GB which results in 84GB Vram AMD Ryzen 5 -
       | 5500GT with 32GB Ram
       | 
       | All in a 19-inch rack with a nice cooling solution and a beefy
       | power supply.
       | 
       | My costs? 1300 Euro, but yeah, I sourced my parts on ebay /
       | second hand.
       | 
       | (Added some 3d printed parts into the mix:
       | https://www.printables.com/model/1142963-inter-tech-and-gene...
       | https://www.printables.com/model/1142973-120mm-5mm-rised-noc...
       | https://www.printables.com/model/1142962-cable-management-fu...
       | if you think about building something similar)
       | 
       | My power consumption is below 500 Watt at the wall, when using
       | LLLMs,since I did some optimizations:
       | 
       | * Worked on power optimizations and after many weeks of
       | benchmarking, the sweet spot on the RTX3060 12GB cards is a 105
       | Watt limit
       | 
       | * Created Patches for Ollama (
       | https://github.com/ollama/ollama/pull/10678) to group models to
       | exactly memory allocation instead of spreading over all available
       | GPUs (This also reduces the VRAM overhead)
       | 
       | * ensured that ASPM is used on all relevant PCI components
       | (Powertop is your friend)
       | 
       | It's not all shiny:
       | 
       | * I still use PCIe3 X1 for most of the cards, which limits their
       | capability, but all I found so far (PCIe Gen4 x4 extender and
       | bifurcation/special PCIE routers) are just too expensive to be
       | used on such low powered cards
       | 
       | * Due to the slow PCIe bandwidth, the performance drops
       | significantly
       | 
       | * Max VRAM per GPU is king. If you split up a model over several
       | cards, the RAM allocation overhead is huge! (See Examples in my
       | ollama patch about). I would rather use 3x 48GB instead of 7x
       | 12G.
       | 
       | * Some RTX 3060 12GB Cards do idle at 11-15 Watt, which is
       | unacceptable. Good BIOSes like the one from Gigabyte (Windforce
       | xxx) do idle at 3 Watt, which is a huge difference when you use 7
       | or more cards. These BIOSes can be patched, but this can be risky
       | 
       | All in all, this server idles at 90-100Watt currently, which is
       | perfect as a central service for my tinkerings and my family
       | usage.
        
       | incomingpain wrote:
       | I've been dreaming on pcpartpicker.
       | 
       | I think Radeon RX 7900 XT - 20 GB has been the best bang for your
       | buck. Enables full gpu 32B?
       | 
       | Looking at what other people have been doing lately, they arent
       | doing this.
       | 
       | They are getting 64+ core cpus and 512GB of ram. Keeping it on
       | cpu and enabling massive models. This setup lets you do deepseek
       | 671B.
       | 
       | It makes me wonder, how much better is 671B vs 32B?
        
       | djhworld wrote:
       | With system builds like this I always feel the VRAM is the
       | limiting factor when it comes to what models you can run, and
       | consumer grade stuff tends to max out at 16GB or (somemtimes)
       | 24GB for more expensive models.
       | 
       | It does make me wonder whether we'll start to see more and more
       | computers with unified memory architecture (like the Mac) - I
       | know nvidia have the Digits thing which has been renamed to
       | something else
        
         | JKCalhoun wrote:
         | Go server GPU (TESLA) and 24 GB is not unusual. (And also about
         | $300 used on eBay.)
        
       | atentaten wrote:
       | Enjoyed the article as I am interested in the same. I would like
       | to have seen more about the specific use cases and how they
       | performed on the rig.
        
       | ww520 wrote:
       | I use a 10-year old laptop to run a local LLM. The time between
       | prompts are 10-30 seconds. Not for speedy interactive usage.
        
       | JKCalhoun wrote:
       | Someone posted that they had used a "mining rig" [0] from
       | AliExpress for less than $100. It even has RAM and a CPU. He
       | picked up a 2000W (!) DELL server PS for cheap off eBay. The GPUs
       | were NVIDIA TESLAs (M40 for example) since they often have a lot
       | of RAM and are less expensive.
       | 
       | I followed in those footsteps to create my own [1] (photo [2]).
       | 
       | I picked up a 24GB M40 for around $300 off eBay. I 3D printed a
       | "cowl" for the GPU that I found online and picked up two small
       | fans from Amazon that got int he cowl. Attached the cowl + fans
       | keep the GPU cool. (These TESLA server GPUs have no fan since
       | they're expected to live in one of those wind-tunnels called a
       | server rack).
       | 
       | I bought the same cheap DELL server PS that the original person
       | had used and I also had to get a break-out board (and power-
       | supply cables and adapters) for the GPU.
       | 
       | Thanks to LLMs, I was able to successfully install Rocky Linux as
       | well as CUDA and NVIDIA drivers. I SSH into it and run ollama
       | commands.
       | 
       | My own hurdle at this point is: I have a 2nd 24 GB M40 TESLA but
       | when installed on the motherboard, Linux will not boot. LLMs are
       | helping me try to set up BIOS correctly or otherwise determine
       | what the issue is. (We'll see.) I would love to get to 48 GB.
       | 
       | [0] https://www.aliexpress.us/item/3256806580127486.html
       | 
       | [1]
       | https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...
       | 
       | [2]
       | https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:oxjqlam...
        
       | iJohnDoe wrote:
       | Details about the ML software or AI software?
        
       | jacekm wrote:
       | For $100 more you could get a used 3090 with twice as much VRAM.
       | You could also get 4060 Ti which is cheaper than 4070 and it has
       | 16 GB VRAM (although it's less powerfull too, so I guess depends
       | on the use case)
        
       | msp26 wrote:
       | > 12GB vram
       | 
       | waste of effort, why would you go through the trouble of building
       | + blogging for this?
        
       | pshirshov wrote:
       | 3090 for ~1000 is much more solid choice. Also these old mining
       | mobos play very well for multi-gpu ollama.
        
       | usercvapp wrote:
       | I have a server at home sitting IDLE for the last 2 years with 2
       | TB of RAM and 4 CPUs.
       | 
       | I am gonna push it this week and launch some LLM models to see
       | how they perform!
       | 
       | How much electric bill efficient are they running locally?
        
       | T-A wrote:
       | I would consider adding $400 for something like this instead:
       | 
       | https://www.bosgamepc.com/products/bosgame-m5-ai-mini-deskto...
        
       ___________________________________________________________________
       (page generated 2025-06-08 23:00 UTC)