[HN Gopher] Building a personal, private AI computer on a budget
       ___________________________________________________________________
        
       Building a personal, private AI computer on a budget
        
       Author : marban
       Score  : 296 points
       Date   : 2025-02-10 11:59 UTC (1 days ago)
        
 (HTM) web link (ewintr.nl)
 (TXT) w3m dump (ewintr.nl)
        
       | hexomancer wrote:
       | Isn't the fact the P40 has horrible fp16 performance a deal
       | breaker for local setups?
        
         | behohippy wrote:
         | You probably won't be running fp16 anything locally. We
         | typically run Q5 or Q6 quants to maximize the size of the model
         | and context length we can run with the VRAM we have available.
         | The quality loss is negligable at Q6.
        
           | Eisenstein wrote:
           | But the inference doesn't necessarily run at the quant
           | precision.
        
             | wkat4242 wrote:
             | As far as I understand it does if you quantify the K/V
             | store as well (the context). And that's pretty standard now
             | because it can increase maximum context size a lot.
        
         | numpad0 wrote:
         | Is it cheaper in $/GB than used Vega 56(HBM2 8GB) besides?
         | There are mining boards with bunch of x1 slots that probably
         | can run half a dozen of them for same 48GB.
        
           | 42lux wrote:
           | Would take a bunch of time just to load the model...
        
           | magicalhippo wrote:
           | AFAIK this doesn't really work for interactive use, as LLMs
           | process data serially. So your request needs to pass through
           | all of the cards for each token, one at a time. Thus a lot of
           | PCIe traffic and hence latency. Better than nothing, but only
           | really useful if you can batch requests so you can keep each
           | GPU working all the time, rather than just one at a time.
        
         | Havoc wrote:
         | Best as I can tell most of the disadvantages relate to larger
         | batches. And for home use you're likely running batch of 1
         | anyway
        
       | reacharavindh wrote:
       | The thing is though.... the locally hosted models in such
       | hardware are cute as toys, and sure do write funny jokes and
       | importantly, perform private tasks that I would never consider
       | passing to non-selfhosted models, but pale in comparison to the
       | models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc). If I
       | could run deepseek-r1-678b locally, without breaking the bank, I
       | would. But, for now, opex > capex at a consumer level.
        
         | walterbell wrote:
         | 200+ comments, https://news.ycombinator.com/item?id=42897205
         | 
         |  _> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS
         | for $2K on a single socket Epyc server motherboard using 512GB
         | of RAM._
        
           | elorant wrote:
           | Runs is an overstatement though. With 4 tokens/second you
           | can't use it on production.
        
             | deadbabe wrote:
             | Isn't 4 tps good enough for local use by a single user,
             | which is the point of a personal AI computer?
        
               | ErikBjare wrote:
               | I tend to get impatient at less than 10tok/s: If the
               | answer is 600tok (normal for me) that's a minute.
        
               | JKCalhoun wrote:
               | It is for me. I'm happy to switch over to another task
               | (and maybe that task is refilling my coffee) and come
               | back when the answer is fully formed.
        
             | walterbell wrote:
             | Some LLM use cases are async, e.g. agents, "deep research"
             | clones.
        
               | diggan wrote:
               | Not to mention even simpler things, like wanting to tag
               | all of your local notes based on the content, basically a
               | bash loop you can run indefinitely and speed doesn't
               | matter much, as long as it eventually finishes
        
             | mechagodzilla wrote:
             | I have a similar setup running at about 1.5 tokens/second,
             | and it's perfectly usable for the sorts of difficult tasks
             | one needs a frontier model like this for - give it a prompt
             | and come back an hour or two later. You interact with it
             | like e-mailing a coworker. If I need an answer back in
             | seconds, it's probably not a very complicated question, and
             | a much smaller model will do.
        
               | xienze wrote:
               | I get where you're coming from, but the problem with LLMs
               | is that you very regularly need a lot of back-and-forth
               | with them to tease out the information you're looking
               | for. A more apt analogy might be a coworker that you have
               | to follow up with three or four times, at an hour per.
               | Not so appealing anymore. Doubly so when you have to
               | stand up $2k+ of hardware for the privilege. If I'm
               | paying good money to host something locally, I want
               | decent performance.
        
               | unshavedyak wrote:
               | Agreed. Furthermore, for some tasks like large context
               | code assistant windows i want really fast responses. I've
               | not found a UX i'm happy with yet but for anything i care
               | about i'd want very fast token responses. Small blocks of
               | code which instantly autocomplete, basically.
        
               | MonkeyClub wrote:
               | > If I'm paying good money to host something locally
               | 
               | The thing is, however, that at 2k one is not paying good
               | money, one is paying near the least amount possible. TFA
               | specifically is about building a machine on a budget, and
               | as such cuts corners to save costs, e.g. by buying older
               | cards.
               | 
               | Just because 2k is not a negligible amount in itself,
               | that doesn't also automatically make it adequate for the
               | purpose. Look for example at the 15k, 25k, and 40k price
               | range tinyboxes:
               | 
               | https://tinygrad.org/#tinybox
               | 
               | It's like buying a 2k-worth used car, and expecting it to
               | perform as well as a 40k one.
        
               | Aurornis wrote:
               | > give it a prompt and come back an hour or two later.
               | 
               | This is the problem.
               | 
               | If your use case is getting a small handful of non-urgent
               | responses per day then it's not a problem. That's not how
               | most people use LLMs, though.
        
             | CamperBob2 wrote:
             | What I'd like to know is how well those dual-Epyc machines
             | run the 1.58 bit dynamic quant model. It really does seem
             | to be almost as good as the full Q8.
        
             | Cascais wrote:
             | I agree with elorant. Indirectly, some youtubers ended up
             | demonstrating that it's difficult to run the best models
             | with less than 7k$, even if NVIDIA hardware is very
             | efficient.
             | 
             | In the future, I expect this to not be the case, because
             | models will be far more efficient. At this pace, maybe even
             | 6 months can make a difference.
        
         | vanillax wrote:
         | Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro..
         | You can run Phi4, Qwen2.5, or llama3.3. They work great for
         | coding tasks
        
           | 3s wrote:
           | Yeah but as one of the replies points out the resulting
           | tokens/second would be unusable in production environments
        
         | CamperBob2 wrote:
         | The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no
         | joke. It just needs a lot of RAM and some patience.
        
           | jaggs wrote:
           | There seems to be a LOT of work going on to optimize the
           | 1.58-bit option in terms of hardware and add-ons. I get the
           | feeling that someone from Unsloth is going to have a genuine
           | breakthrough shortly, and the rig/compute costs are going to
           | plummet. Hope I'm not being naive or over-confident.
        
         | cratermoon wrote:
         | This is not because the models are better. These services have
         | unknown and opaque levels of shadow prompting[1] to tweak the
         | behavior. The subject article even mentions "tweaking their
         | outputs to the liking of whoever pays the most". The more I
         | play with LLMs locally, the more I realize how much prompting
         | going on under the covers is shaping the results from the big
         | tech services.
         | 
         | 1 https://www.techpolicy.press/shining-a-light-on-shadow-
         | promp...
        
       | dgrabla wrote:
       | Great breakdown!. The "own your own AI" at home is a terrific
       | hobby if you like to tinker, but you are going to spend a ton of
       | time and money on hardware that will be underutilized most of the
       | time. If you want to go nuts check out Mitko Vasilev's dream
       | machine. It makes no sense if you don't have a very clear use
       | case that only requires small models or really slow token
       | generation speeds.
       | 
       | If the goal however is not to tinker but to really build and
       | learn AI, it is going to be financially better to rent those
       | GPUs/TPUs as needs arise.
        
         | memhole wrote:
         | This is correct. The cost makes no sense outside of hobby and
         | interest. You're far better off renting. I think there is some
         | merit to having a local inference server if you're doing
         | development. You can manage models and have a little more
         | control over your infra as the main benefits.
        
         | theshrike79 wrote:
         | Any M-series Mac is "good enough" for home LLMs. Just grab LM
         | studio and a model that fits in memory.
         | 
         | Yes, it will not rival OpenAI, but it's 100% local with no
         | monthly fees and depending on the model no censoring or limits
         | on what you can do with it.
        
         | lioeters wrote:
         | > spend a ton of time and money
         | 
         | Not necessarily. For non-professional purposes, I've spent zero
         | dollars (no additional memory or GPU) and I'm running a local
         | language model that's good enough to help with many kinds of
         | tasks including writing, coding, and translation.
         | 
         | It's a personal, private, budget AI that requires no network
         | connection or third-party servers.
        
           | ImPostingOnHN wrote:
           | on what hardware (and how much did you spend on it)?
        
         | JKCalhoun wrote:
         | Terrific hobby? Sign me up!
        
         | jrm4 wrote:
         | For what purpose? I'm asking this as someone who threw one of
         | the cheap $500 Nvidia's with 16gb of VRAM and I'm already
         | overwhelmed with what I can do already with Ollama,
         | Krita+ComfyUI etc etc.
        
       | rcarmo wrote:
       | Given the power and noise involved, a Mac Mini M4 seems like a
       | much nicer approach, although the RAM requirements will drive up
       | the price.
        
       | axegon_ wrote:
       | I did something similar but using a K80 and M40 I dug up from
       | eBay for pennies. Be advised though, stay as far away as possible
       | from the K80 - the drivers were one of the most painful tech
       | things I've ever had to endure, even if 24GB of VRAM for 50 bucks
       | sounds incredibly appealing. That said, I had a decent-ish HP
       | workstation laying around with 1200 watt power supply so I had
       | where to put those two in. The one thing to note here is that
       | these types of GPUs do not have a cooling of their own. My
       | solution was to 3d print a bunch of brackets and attach several
       | Noctua fans and have them blow at full speed 24/7. Surprisingly
       | it worked way better than I expected - I've never gone above 60
       | degrees. As a side efffect, the CPUs are also benefiting from
       | this hack: at idle, they are in the mid-20 degrees range. Mind
       | you, the noctua fans are located on the front and the back of the
       | case: the ones on the front act as an intake and the ones on the
       | back as exhaust and there's two more inside the case that are
       | stuck in front of the GPUs.
       | 
       | The workstation was refurbished for just over 600 bucks, and
       | another 120 bucks for the GPUs and another ~60 for the fans.
       | 
       | Edit: and before someone asks - no I have not uploaded the STL's
       | anywhere cause I haven't had the time but also since this is a
       | very niche use case, though I might: the back(exhaust) bracket
       | came out brilliant the first try - it was a sub-millimeter fit.
       | Then I got cocky and thought that I'd also nail it first try on
       | the intake and ended up re-printing it 4 times.
        
         | JKCalhoun wrote:
         | Curious what HP workstation you have?
        
           | 9front wrote:
           | HP Z440, it's in the article.
        
         | egorfine wrote:
         | > K80 - the drivers were one of the most painful tech things
         | I've ever had to endure
         | 
         | Well, for a dedicated LLM box it might be feasible to suffer
         | with drivers a bit, no? What was your experience like with the
         | software side?
        
         | deadbabe wrote:
         | What's the most pain you've ever felt?
        
       | apples_oranges wrote:
       | Does using 2x24GB VRAM mean that the model can be fully loaded
       | into memory if it's between 24 and 48 GB in size? I somehow doubt
       | it, at least ollama wouldn't work like that I think. But does
       | anyone know?
        
         | memhole wrote:
         | No. Hopefully, someone with more knowledge can explain better.
         | But you need room for the kvcache is my understanding. You also
         | need to factor in the size of the context window. If anyone has
         | good resources on this, that would be awesome. Presently, it
         | feels very much like a dark art to host these without crashing
         | or being massively over-provisioned.
        
           | htrp wrote:
           | The dark art is to massively overprovision hardware.
        
             | cratermoon wrote:
             | Thus you get Open AI spending billions while DeepSeek comes
             | long with, shocking, _actual understanding of the hardware
             | and how to optimize for it_ [1] and spends $6 Million[2]
             | 
             | 1. https://arxiv.org/abs/2412.19437v1
             | 
             | 2. Quibble over the exact figure. Far less than Open AI,
             | doing more with less.
        
         | Eisenstein wrote:
         | No, you need to have extra space for the context (which
         | requires more space the larger the model is).
         | 
         | But it should be said that basing model quality on its size in
         | GB is like qualifying a video based on its size in GB. You can
         | have the same video be small or huge with anywhere from
         | negligible to huge differences in quality between the two.
         | 
         | You will be running quantizied model weights, which can range
         | in precision from 1 to 16 bits per parameter (the B for billion
         | in the model name). Model weights at Q8 are generally their
         | parameter size without the B in GB (Llama 3 8B at Q8 would be
         | ~8GB). There are many different strategies for quantizing as
         | well, so this is just a rough guide.
         | 
         | So basically if you can't fit the 48GB model into your 48GB of
         | VRAM, just download a lower precision quant.
        
         | michaelt wrote:
         | For a great many LLMs, you can find someone on HuggingFace who
         | has produced a set of different quantised versions, with
         | approximate RAM requirements.
         | 
         | For example, if you want to run "CodeLlama 70B" from
         | https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF
         | where's a table saying the "Q4_K_M" quantised version is a
         | 41.42 GB download and runs in 43.92 GB of memory.
        
       | brador wrote:
       | You can do 8b local on the latest iPhones.
        
         | 0xEF wrote:
         | How useful is this, though? In my modest experience, these tiny
         | models aren't good for much more than tinkering with,
         | definitely not something I'd integrate into my workflow since
         | the output quality is pretty low.
         | 
         | Again, though, my experience is limited. I imagine others know
         | something I do not and would absolutely love to hear more from
         | people who are running tiny models on low-end hardware for
         | things like code assistance, since that's where the use-case
         | would lie for me.
         | 
         | At the moment, I subscribe to "cloud" models that I use for
         | various tasks and that seems to be working well enough, but it
         | would be nice to have a personal model that I could train on
         | very specific data. I'm sure I am missing something, since it's
         | also hard to keep up with all the developments in the
         | Generative AI world.
        
           | DemetriousJones wrote:
           | I tried running the 8B model on my 8GB M2 Macbook Air through
           | Ollama and it was awful. It took ages to do anything and the
           | responses were bad at best.
        
             | redman25 wrote:
             | Doesn't 8B need at least 16gb of ram? Otherwise, your
             | swapping I would imagine...
        
               | blebo wrote:
               | Depends on quantization selected - see
               | https://www.canirunthisllm.net/
        
       | jmyeet wrote:
       | The author mentions it but I want to expand on it: Apple is a
       | seriously good option here, specifically the M4 Mac Mini.
       | 
       | What makes Apple attractive is (as the author mentions) that RAM
       | is shared between main and video RAM whereas NVidia is quite
       | intentionally segmenting the market and charging huge premiums
       | for high VRAM cards. Here are some options:
       | 
       | 1. Base $599 Mac Mini: 16GB of RAM. Stocked in store.
       | 
       | 2. $999 Mac Mini: 24GB of RAM. Stocked in store.
       | 
       | 3. Add RAM to either of the above up to 32GB. It's not cheap at
       | $200/8GB but you can buy a Mac Mini with 32GB of shared RAM for
       | $999, substantially cheaper than the author's PC build but less
       | storage (although you can upgrade that too).
       | 
       | 4. M4 Pro: $1399 w/ 24GB of RAM. Stocked in store. You can
       | customize this all the way to 64GB of RAM for +$600 so $1999 in
       | total. That is amazing value for this kind of workload.
       | 
       | 5. The Mac Studio is really the ultimate option. Way more cores
       | and you can go all the way to 192GB of unified memory (for a
       | $6000 machine). The problem here is that the Mac Studio is old,
       | still on the M2 architecture. An M4 Ultra update is expected
       | sometime this year, possibly late this year.
       | 
       | 6. You can get into clustering these (eg [1]).
       | 
       | 7. There are various Macbook Pro options, the highest of which is
       | a 16" Mackbook Pro with 128GB of unified memory for $4999.
       | 
       | But the main takeaway is the M4 Mac Mini is fantastic value.
       | 
       | Some more random thoughts:
       | 
       | - Some Mac Minis have Thunderbolt 5 ("TB5"), which is up to
       | either 80Gbps or 120Gbps bidirectional (I've seen it quoted as
       | both);
       | 
       | - Mac Minis have the option of 10GbE (+$200);
       | 
       | - The Mac Mini has 2 USB3 ports and either 3 TB4 or 3 TB5 ports.
       | 
       | [1]: https://blog.exolabs.net/day-2/
        
         | sofixa wrote:
         | The issue with Macs is that below Max/Ultra processors, the
         | memory bandwidth is pretty slow. So you need to spend a lot on
         | a high level processor and lots of memory, and the current gen
         | processor, M4, doesn't even have an Ultra, while the Max is
         | only available in a laptop form factor (so thermal
         | constraints).
         | 
         | An M4 Pro still has only 273GB/s, while even the 2 generations
         | old RTX 3090 has 935GB/s.
         | 
         | https://github.com/ggerganov/llama.cpp/discussions/4167
        
           | jmyeet wrote:
           | That's a good point. I checked the M2 Mac Studio and it's
           | 400GB/s for the M2 Max and 800GB/s for the M2 Ultra so the M4
           | Ultra when we get it later this year should really be a
           | beast.
           | 
           | Oh and the top end Macbook Pro 16 (the only current Mac with
           | an M4 Max) has 410GB/s memory bandwidth.
           | 
           | Obviously the Mac Studio is at a much higher price point.
           | 
           | Still, you need to spend $1500+ to get an NVidia GPU with
           | >12GB of RAM. Multiple of those starts adding up quick. Put
           | multiple in the same box and you're talking more expensive
           | case, PSU, mainboard, etc and cooling too.
           | 
           | Apple has a really interesting opportunity here with their
           | unified memory architecture and power efficiency.
        
         | diggan wrote:
         | How is the performance difference between using a dedicated GPU
         | from Nvidia for example compared to whatever Apple does?
         | 
         | So lets say we'd run a model on a Mac Mini M4 with 24GB RAM,
         | how many tokens/s are you getting? Then if we run the exact
         | same model but with a RTX 3090ti for example, how many tokens/s
         | are you getting?
         | 
         | Do these comparisons exist somewhere online already? I
         | understand it's possible to run the model on Apple hardware
         | today, with the unified memory, but how fast is that really?
        
           | redman25 wrote:
           | Not the exact same comparison but I have an M1 mac with 16gb
           | ram and can get about 10 t/s with a 3B model. The same model
           | on my 3060ti gets more than 100 t/s.
           | 
           | Needless to say, ram isn't everything.
        
             | diggan wrote:
             | Could you say what exact model+quant you're using for that
             | specific test + settings + runtime? Just so I could try to
             | compare with other numbers I come across.
        
         | iamleppert wrote:
         | The hassle of not being able to work with native CUDA isn't
         | worth it for a huge amount of AI. Good luck getting that latest
         | paper or code working quickly just to try it out, if the author
         | didn't explicitly target M4 (unlikely but all the most
         | mainstream of stuff).
        
           | darkwater wrote:
           | In a homelab scenario, having your own AI assistant not ran
           | by someone else, that is not an issue. If you want to
           | tinker/learn AI it's definitely an issue.
        
         | sethd wrote:
         | For sure and the Mac Mini M4 Pro with 64GB of RAM feels like
         | the sweet spot right now.
         | 
         | That said, the base storage option is only 512GB, and if this
         | machine is also a daily driver, you're going to want to bump
         | that up a bit. Still, it's an amazing machine for under $3K.
        
           | wolfhumble wrote:
           | It would be better/cheaper to buy an external Thunderbolt 5
           | enclosure for the NVME drive you need.
        
             | sethd wrote:
             | I looked into this a couple months ago and external TB5 was
             | still more expensive at 1-2 TB not sure about above,
             | though.
        
         | atwrk wrote:
         | Worth pointing out that you "only" get <= 270GB/s of memory
         | bandwith with those Macs, unless you choose the max/ultra
         | models.
         | 
         | If that is enough for your use case, it may make sense to wait
         | 2 months and get a Ryzen AI Max+ 395 APU, which will have the
         | same memory bandwith, but allows for up to 128GB RAM. For
         | probably ~half the Mac's price.
         | 
         | Usual AMD driver disclaimer applies, but then again inference
         | is most often way easier to get running than training.
        
         | oofbaroomf wrote:
         | Unified memory is great because it's fast, but you can also get
         | a lot of system memory on a "conventional" machine like OP's,
         | and offload MOE layers like what Ktransformers did, so you can
         | run huge models with acceptable speeds. While the Mac mini may
         | have better value for anything that fits in the unified memory,
         | if you want to run Deepseek R1 or other large models, then it's
         | best to max out system RAM and get a GPU to offload.
        
       | robblbobbl wrote:
       | Deep true words. I'm sorry for the author but thx for the
       | article!
        
       | gregwebs wrote:
       | The problem for me with making such an investment is that next
       | month a better model will be released. It will either require
       | more or less RAM than the current best model- making it either
       | not runnable or expensive to run on an overbuilt machine.
       | 
       | Using cloud infrastructure should help with this issue. It may
       | cost much more per run but money can be saved if usage is
       | intermittent.
       | 
       | How are HN users handling this?
        
         | diggan wrote:
         | > How are HN users handling this?
         | 
         | Combine the best of both worlds. I have a local assistant
         | (communicate via Telegram) that handles tool-calling and basic
         | calendar/todo management (running on a RTX 3090ti), but for
         | more complicated stuff, it can call out to more advanced models
         | (currently using OpenAI APIs for this) granted the request
         | itself doesn't involve personal data, then it flat out refuses,
         | for better or worse.
        
         | idrathernot wrote:
         | There is also an overlooked "tail risk" with cloud services
         | that can end up costing you more than a a few entire on-premise
         | rigs if you don't correctly configure services or forget to
         | shut down a high end vm instance. Yeah you can implement
         | additional scripts and services as a fail-safe, but this adds
         | another layer of complexity that isn't always trivial
         | (especially for a hobbyist).
         | 
         | I'm not saying that dumping $10k into rapidly depreciating
         | local hardware is the more economical choice, just that people
         | often discount the likelihood and cost of making mistakes in
         | the cloud during their evaluations and the time investment
         | required to ensure you have the correct safeguards in-place.
        
         | walterbell wrote:
         | _> expensive to run on an overbuilt machine_
         | 
         | There's a healthy secondary market for GPUs.
        
           | xienze wrote:
           | The price goes up dramatically once you go past 12GB though,
           | that's the problem.
        
             | JKCalhoun wrote:
             | Not on these server GPUs.
             | 
             | I'm seeing 24GB M40 cards for $200, 24GB K80 cards for $40
             | on eBay.
        
               | xienze wrote:
               | Well OK, I should have been more specific that, even for
               | server GPUs on eBay:
               | 
               | * Cheap
               | 
               | * Fast
               | 
               | * Decent amount of RAM
               | 
               | Pick two.
               | 
               | These old GPUs are as cheap as they are because they
               | don't perform well.
        
               | JKCalhoun wrote:
               | That's fair. From what I read though I think there is
               | some interplay between Fast and Decent amount of RAM. Or
               | at least there is a large falloff in performance when RAM
               | is too small.
               | 
               | So Cheap and Decent amount of RAM work for me.
        
         | JKCalhoun wrote:
         | I think the solution is already in the article and comments
         | here: go cheap. Even next year the author will still have, at
         | the very least, their P40 setup running late 2024 models.
         | 
         | I'm about to plunge in as others have to get my own homelab
         | running the current crop of models. I think there's no time
         | like the present.
        
         | 3s wrote:
         | Exactly! While I have llama running locally on RTX and it's fun
         | to tinker with, I can't use it for my workflows and don't want
         | to invest 20k+ to run a decent model locally
         | 
         | > How are HN users handling this? I'm working on a startup for
         | end-to-end confidential AI using secure enclaves in the cloud
         | (think of it like extending a local+private setup to the cloud
         | with verifiable security guarantees). Live demo with DeepSeek
         | 70B: chat.tinfoil.sh
        
         | tempoponet wrote:
         | Most of these new models release several variants, typically in
         | the 8b, 30b, and 70b range for personal use. YMMV with each,
         | but you usually use the models that fit your hardware, and the
         | models keep getting better even in the same parameter range.
         | 
         | To your point about cloud models, these are really quite cheap
         | these days, especially for inference. If you're just doing
         | conversation or tool use, you're unlikely to spend more than
         | the cost of a local server, and the price per token is a race
         | to the bottom.
         | 
         | If you're doing training or processing a ton of documents for
         | RAG setups, you can run these in batches locally overnight and
         | let them take as long as they need, only paying for power. Then
         | you can use cloud services on the resulting model or RAG for
         | quick and cheap inference.
        
         | nickthegreek wrote:
         | I plan to wait for the NVIDIA Digits release and see what the
         | token/sec is there. Ideally it will work well for at least 2-3
         | years then I can resell and upgrade if needed.
        
         | michaelt wrote:
         | Among people who are running large models at home, I think the
         | solution is basically to be rich.
         | 
         | Plenty of people in tech earn enough to support a family and
         | drive a fancy car, but choose not to. A used RTX 3090 isn't
         | cheap, but you can afford a lot of $1000 GPUs if you don't buy
         | that $40k car.
         | 
         | Other options include only running the smaller LLMs; buying
         | dated cards and praying you can get the drivers to work; or
         | just using hosted LLMs like normal people.
        
       | joshstrange wrote:
       | I'd really love to build a machine for local LLMs. I've tested
       | models on my MBP M3 Max with 128GB of ram and it's really cool
       | but I'd like a dedicated local server. I'd also like an excuse to
       | play with proxmox as I've just run raw Linux servers or UnRaid w/
       | containers in the past.
       | 
       | I have OpenWebUI and LibreChat running on my local "app server"
       | and I'm quite enjoying that but every time I price out a beefier
       | box I feel like the ROI just isn't there, especially for an
       | industry that is moving so fast.
       | 
       | Privacy is not something to ignore at all but the cost of
       | inference online is very hard to beat, especially when I'm still
       | learning how best to use LLMs.
        
         | datadrivenangel wrote:
         | You pay a premium to get the theoretical local privacy and
         | reliability of hosting your own models.
         | 
         | But to get commercially competitive models you need 5 figures
         | of hardware, and then need to actually run it securely and
         | reliably. Pay as you go with multiple vendors as fallback is a
         | better option right now if you don't need harder privacy.
        
           | rsanek wrote:
           | With something like OpenRouter, you don't even have to
           | manually integrate with multiple vendors
        
             | wkat4242 wrote:
             | Is that like LiteLLM? I have that running but never tried
             | OpenRouter. I wonder now if it's better :)
        
           | joshstrange wrote:
           | Yeah, really I'd love for my Home Assistant to be able to use
           | a local LLM/TTS/STT which I did get working but was way too
           | slow. Also it would fun to just throw some problems/ideas at
           | the wall without incurring (more) cost, that's a big part of
           | it. But each time I run the numbers I would be better off
           | using Anthropic/OpenAI/DeepSeek/other.
           | 
           | I think sooner or later I'll break down and buy a server for
           | local inference even if the ROI is upside down because it
           | would be a fun project. I also find that these thing fall in
           | the "You don't know what you will do with it until you have
           | it and it starts unlocking things in your mind"-category. I'm
           | sure there are things I would have it grind on overnight just
           | to test/play with an idea which is something I'd be less
           | likely to do on a paid API.
        
             | nickthegreek wrote:
             | You shouldn't be having slow response issues with
             | LLM/TTS/STT for HA on a mbp m3 max 128gb. I'd either limit
             | the entities exposed or choose a smaller model.
        
               | joshstrange wrote:
               | Oh, I can get smaller models to run reasonably fast but
               | I'm very interested in tool calling and I'm having a hard
               | time finding a model that runs fast and is good at
               | calling tools locally (I'm sure that's due to my own
               | ignorance).
        
               | nickthegreek wrote:
               | I decided on openai api for now after setting up so many
               | differnt methods. the local stuff isn't up to snuff yet
               | for what I am trying to accomplish but decent for basic
               | control.
        
               | joshstrange wrote:
               | I use a combo of Anthropic and OpenAI for now through my
               | bots and my chat UIs and that lets me iterate faster. My
               | hope is once I've done all my testing I could consider
               | moving to local models if it made sense.
        
             | cruffle_duffle wrote:
             | > You don't know what you will do with it until you have it
             | and it starts unlocking things in your mind
             | 
             | Exactly. Once the price and performance get to the level
             | where buying stuff for local training and inferencing...
             | that is when we will start to see the LLM break out of its
             | current "corporate lawyer safe" stage and really begin to
             | shake things up.
        
         | smith7018 wrote:
         | For what it's worth, looking at the benchmarks, I think the
         | machine they built is comparable to what your MBP can already
         | do. They probably have a better inference speed, though.
        
         | cwalv wrote:
         | > but every time I price out a beefier box I feel like the ROI
         | just isn't there, especially for an industry that is moving so
         | fast.
         | 
         | Same, esp. if you factor in the cost of renting. Even if you
         | run 24/7 it's hard to see it paying off in half the time it
         | will take to be obsolete
        
         | moffkalast wrote:
         | A Strix Halo minipc might be a good mid tier option once
         | they're out, though AMD still isn't clear on how much they'll
         | overprice them.
         | 
         | Core Ultra Arc iGPU boxes are pretty neat too for being
         | standalone and can be loaded up with DDR5 shared memory,
         | efficient and usable in terms of speed, though that's
         | definitely low end performance, plus SYCL and IPEX are a bit
         | eh.
        
         | whalesalad wrote:
         | The juice aint worth the squeeze to do this locally.
         | 
         | But you should still play with proxmox, just not for this
         | purpose. My recommendation would be to get an i7 HP Elitedesk.
         | I have multiple racks in my basement, hundreds of gigs of ram,
         | multiple 2U 2x processor enterprise servers etc.... but at this
         | point all of it is turned off and a single HP Elitedesk with a
         | 2nd NIC added and 64GB of ram is doing everything I ever needed
         | and more.
        
           | joshstrange wrote:
           | Yeah, right now I'm running a tower PC (Intel Core i9-11900K,
           | 64GB Ram) with Unraid as my local "app server". I want to
           | play with Proxmox (for professional and mostly fun reasons)
           | though. Someday I'd like a rack in my basement as my homelab
           | stuff has overgrown the space it's in and I'm going to need
           | to add a new 12-bay Synology (on top of 2x12-bay) soon since
           | I'm running out of space again. For now I've been sticking
           | with consumer/prosumer equipment but my needs are slowly
           | outstripping that I think.
        
       | walterbell wrote:
       | One reason to bother with private AI: cloud AI ToS for consumers
       | may have legal clauses about usage of prompt and context data,
       | e.g. data that is not already on the Internet. Enterprise
       | customers can exclude their data from future training.
       | 
       | https://stratechery.com/2025/deep-research-and-knowledge-val...
       | 
       |  _> Unless, of course, the information that matters is not on the
       | Internet. This is why I am not sharing the Deep Research report
       | that provoked this insight: I happen to know some things about
       | the industry in question -- which is not related to tech, to be
       | clear -- because I have a friend who works in it, and it is
       | suddenly clear to me how much future economic value is wrapped up
       | in information not being public. In this case the entity in
       | question is privately held, so there aren't stock market filings,
       | public reports, barely even a webpage! And so AI is blind._
       | 
       | (edited for clarity)
        
         | icepat wrote:
         | ToS can change. Companies can (and do) act illegally. Data
         | breaches happen. Insider threats happen.
         | 
         | Why trust the good will of a company, over a box that you built
         | yourself, and have complete control over?
        
         | rovr138 wrote:
         | Cost for one
        
       | dandanua wrote:
       | I doubt it is that efficient. Even though it has 48GB of VRAM,
       | it's more than twice slower than a single 3090 GPU.
       | 
       | In my budget AI setup I use 7840 Ryzen based miniPC with USB4
       | port and connect 3090 to it via the eGPU adapter (ADT-link UT3G).
       | It costed me about $1000 total and I can easily achieve 35 t/s
       | with qwen2.5-coder-32b using ollama.
        
         | mrbonner wrote:
         | Wouldn't eGPU defeat the purpose of having fast memory
         | bandwidth? Have you tried it with stable diffusion?
        
           | dandanua wrote:
           | 40Gbps of USB4 is plenty. I've tried this pytorch tests
           | https://github.com/aime-team/pytorch-benchmarks/ and saw only
           | 10% drop in performance. No drop in performance for LLM
           | inference, if a model is already loaded to the VRAM.
        
       | rjurney wrote:
       | A lot of people build personal deep learning machines. The
       | economics and convenience can definitely work out... I am
       | confused however by "dummy GPU" - I searched for "dummy" for an
       | explanation but didn't find one. Modern motherboards all include
       | an integrated video card, so I'm not sure what this would be for?
       | 
       | My personal DL machine has a 24 core CPU, 128GB RAM and 2 x 3060
       | GPUs and 2 x 2TB NVMe drives in a RAID 1 array. I <3 it.
        
         | T-A wrote:
         | Look under "Available Graphics" at
         | 
         | https://www.hp.com/us-en/shop/mdp/business-solutions/z440-wo...
         | 
         | No integrated graphics.
         | 
         | Author's explanation of the problem:
         | 
         |  _The Teslas are intended to crunch numbers, not to play video
         | games with. Consequently, they don 't have any ports to connect
         | a monitor to. The BIOS of the HP Z440 does not like this. It
         | refuses to boot if there is no way to output a video signal._
        
       | refibrillator wrote:
       | Pay attention to IO bandwidth if you're building a machine with
       | multiple GPUs like this!
       | 
       | In this setup the model is sharded between cards so data must be
       | shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s
       | max. For reference that's an order of magnitude lower than the
       | ~350 GB/s memory bandwidth of the Tesla P40 cards being used.
       | 
       | Author didn't mention NVLink so I'm presuming it wasn't used, but
       | I believe these cards would support it.
       | 
       | Building on a budget is really hard. In my experience 5-15 tok/s
       | is a bit too slow for use cases like coding, but I admit once
       | you've had a taste of 150 tok/s it's hard to go back (I've been
       | spoiled by RTX 4090 with vLLM).
        
         | zinccat wrote:
         | I feel that you are mistaking the two bandwidth numbers
        
         | Miraste wrote:
         | Unless you run the GPUs in parallel, which you have to go out
         | of your way to do, the IO bandwidth doesn't matter. The cards
         | hold separate layers of the model, they're not working
         | together. They're only passing a few kilobytes per second
         | between them.
        
         | Xenograph wrote:
         | Which models do you enjoy most on your 4090? and why vLLM
         | instead of ollama?
        
         | ekianjo wrote:
         | > Author didn't mention NVLink so I'm presuming it wasn't used,
         | but I believe these cards would support it.
         | 
         | How would you setup NVLink, if the cards support it?
        
       | kamranjon wrote:
       | For the same price ($1799) you could buy a Mac Mini with 48gb of
       | unified memory and an m4 pro. It'd probably use less power and be
       | much quieter to run and likely could outperform this setup in
       | terms of tokens per second. I enjoyed the write up still, but I
       | would probably just buy a Mac in this situation.
        
         | diggan wrote:
         | > likely could outperform this setup in terms of tokens per
         | second
         | 
         | I've heard arguments both for and against this, but they always
         | lack concrete numbers.
         | 
         | I'd love something like "Here is Qwen2.5 at Q4 quantization
         | running via Ollama + these settings, and M4 24GB RAM gets X
         | tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're
         | just propagating mostly anecdotes without any reality-checks.
        
           | cruffle_duffle wrote:
           | I think we are somewhat still at the "fuzzy super early
           | adopter" stage of this local LLM game and hard data is not
           | going to be easy to come by. I almost want to use the word
           | "hobbiest stage" where almost all of the "data" and "best
           | practice" is anecdotal but I think we are a step above that.
           | 
           | Still, it's way to early and there are simply way to many
           | hardware and software combinations that change almost weekly
           | to establish "the best practice hardware configuration for
           | training / inferencing large language models locally".
           | 
           | Some day there will be established guides with solid. In fact
           | someday there will be be PC's that specifically target LLMs
           | and will feature all kinds of stats aimed at getting you to
           | bust out your wallet. And I even predict they'll come up with
           | metrics that all the players will chase well beyond when
           | those metrics make sense (megapixels, clock frequency,
           | etc)... but we aren't there yet!
        
             | diggan wrote:
             | Right, but how are we supposed to be getting anywhere else
             | unless people start being more specific and stop leaning on
             | anecdotes or repeating what they've heard elsewhere?
             | 
             | Saying "Apple seems to be somewhat equal to this other
             | setup" doesn't really contribute to someone getting an
             | accurate picture if it is equal or not, unless we start
             | including raw numbers, even if they aren't directly
             | comparable.
             | 
             | I don't think it's too early to say "I get X tokens/second
             | with this setup + these settings" because then we can at
             | least start comparing, instead of just guessing which seems
             | to be the current SOTA.
        
               | kamranjon wrote:
               | A great thread with the type of info your looking for
               | lives here:
               | https://github.com/ggerganov/whisper.cpp/issues/89
               | 
               | But you can likely find similar threads for the llama.cpp
               | benchmark here: https://github.com/ggerganov/llama.cpp/tr
               | ee/master/examples/...
               | 
               | These are good examples because the llama.cpp and
               | whisper.cpp benchmarks take full advantage of the Apple
               | hardware but also take full advantage of non-Apple
               | hardware with GPU support, AVX support etc.
               | 
               | It's been true for a while now that the memory bandwidth
               | of modern Apple systems in tandem with the neural cores
               | and gpu has made them very competitive Nvidia for local
               | inference and even basic training.
        
               | diggan wrote:
               | I guess I'm mostly lamenting about how unscientific these
               | discussions are in general, on HN and elsewhere (besides
               | specific GitHub repositories). Every community is filled
               | with just anecdotal stories, or some numbers but missing
               | to specify a bunch of settings + model + runtime details
               | so people could at least compare it to something.
               | 
               | Still, thanks for the links :)
        
               | t1amat wrote:
               | In fairness it's become even more difficult now than ever
               | before.
               | 
               | * hardware spec
               | 
               | * inference engine
               | 
               | * specific model - differences to tokenizer will make
               | models faster/slower with equivalent parameter count
               | 
               | * quantization used - and you need to be aware of
               | hardware specific optimizations for particular quants
               | 
               | * kv cache settings
               | 
               | * input context size
               | 
               | * output token count
               | 
               | This is probably not a complete list either.
        
               | nickthegreek wrote:
               | Best place to get that kinda info is gonna be
               | /r/LocalLlama
        
             | motorest wrote:
             | > I think we are somewhat still at the "fuzzy super early
             | adopter" stage of this local LLM game and hard data is not
             | going to be easy to come by.
             | 
             | What's hard about it? You get the hardware, you run the
             | software, you take measurements.
        
               | GTP wrote:
               | Yes, but we don't have enough people doing that to get
               | quality data. Not many people are building this kind of
               | setup, and even less are publishing their results.
               | Additionally, if I just run a test a couple of time and
               | then average the results, this is still far from a solid
               | measurement.
        
               | diggan wrote:
               | > but we don't have enough people doing that to get
               | quality data
               | 
               | But how are we supposed to get enough people doing those
               | things if everyone say "There isn't enough data right now
               | for it to be useful"? We have to start somewhere
        
               | unshavedyak wrote:
               | I don't think they're saying anything counter to that.
               | The people who don't require the volume of data will run
               | these. Ie the super early adopters.
        
               | colonCapitalDee wrote:
               | We've already started, we just haven't finished yet
        
           | fkyoureadthedoc wrote:
           | On an M1 Max 64GB laptop running gemma2:27b same prompt and
           | settings from blog post                   total duration:
           | 24.919887458s         load duration:        39.315083ms
           | prompt eval count:    37 token(s)         prompt eval
           | duration: 963.071ms         prompt eval rate:     38.42
           | tokens/s         eval count:           441 token(s)
           | eval duration:        23.916616s         eval rate:
           | 18.44 tokens/s
           | 
           | I have a gaming PC with a 4090 I could try, but I don't think
           | this model would fit
        
             | diggan wrote:
             | > gemma2:27b
             | 
             | What quantization are you using? What's the runtime+version
             | you run this with? And the rest of the settings?
             | 
             | Edit: Turns out parent is using Q4 for their test. Doing
             | the same test with LM Studio and a 3090ti + Ryzen 5950X
             | (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.
        
               | fkyoureadthedoc wrote:
               | Fresh install from brew, ollama version is 0.5.7
               | 
               | Only settings I did were the ones shown in the blog post
               | OLLAMA_FLASH_ATTENTION=1
               | OLLAMA_KV_CACHE_TYPE=q8_0
               | 
               | Ran the model like                   ollama run
               | gemma2:27b --verbose
               | 
               | With the same prompt, "Can you write me a story about a
               | tortoise and a hare, but one that involves a race to get
               | the most tokens per second?"
        
               | diggan wrote:
               | When you run that, what quantization do you get? The
               | library website of Ollama
               | (https://ollama.com/library/gemma2:27b) isn't exactly a
               | good use case in surfacing useful information like what
               | the default quantization is.
        
               | fkyoureadthedoc wrote:
               | not sure how to tell, but here's the full output from
               | ollama serve https://pastes.io/ollama-run-gemma2-27b
        
               | diggan wrote:
               | Thanks, that seems to indicate Q4 for the quantization,
               | you're probably able to run that on the 4090 as well
               | FWIW, the size of the model is just 14.55 GiB.
        
               | navbaker wrote:
               | If you hit the drop-down menu for the size of the model,
               | then tap "view all", you will see the size and hash of
               | the model you have selected and can compare it to the
               | full list below it that has the quantization specs in the
               | name.
        
               | diggan wrote:
               | Still, I don't see a way (from the web library) to see
               | the default quantization (from Ollama's POV) at all, is
               | that possible somehow?
        
               | navbaker wrote:
               | The model displayed in the drop-down when you access the
               | web library is the default that will be pulled. Compare
               | the size and hash to the more detailed model listing
               | below it and you will see what quantization you have.
               | 
               | Example: the default model weights for Llama 3.3 70b,
               | after hitting the "view all" have this hash and size
               | listed next to it - a6eb4748fd29 * 43GB
               | 
               | Now scroll down through the list and you will find the
               | one that matches that hash and size is
               | "70b-instruct-q4_K_M". That tells you that the default
               | weights for Llama 3.3 70B from Ollama are 4-bit quantized
               | (q4) while the "K_M" tells you a bit about what
               | techniques were used during quantization to balance size
               | and performance.
        
               | mkesper wrote:
               | If you leave the :27b off from that URL you'll see the
               | default size which is 9b. Ollama seems to always use Q4_0
               | even if other quants are better.
        
               | rahimnathwani wrote:
               | gemma2:27b-instruct-q4_0 (checksum 53261bc9c192)
        
             | condiment wrote:
             | On a 3090 (24gb vram), same prompt & quant, I can report
             | more than double the tokens per second, and significantly
             | faster prompt eval.                   total_duration:
             | 10530451000         load_duration:        54350253
             | prompt_eval_count:    36         prompt_eval_duration:
             | 29000000         prompt_token/s:       1241.38
             | eval_count:           460         eval_duration:
             | 10445000000         response_token/s:     44.04
             | 
             | Fast prompt eval is important when feeding larger contexts
             | into these models, which is required for almost anything
             | useful. GPUs have other advantages for traditional ML,
             | whisper models, vision, and image generation. There's a lot
             | of flexibility that doesn't really get discussed when folks
             | trot out the 'just buy a mac' line.
             | 
             | Anecdotally I can share my revealed preference. I have both
             | an M3 (36gb) as well as a GPU machine, and I went through
             | the trouble of putting my GPU box online because it was so
             | much faster than the mac. And doubling up the GPUs allows
             | me to run models like the deepseek-tuned llama 3.3, with
             | which I have completely replaced my use of chatgpt 4o.
        
               | svachalek wrote:
               | Thanks for numbers! People should include their LLM
               | runner as well I think, as there are differences in
               | hardware optimization support. Like I haven't tested it
               | but I've heard MLX is noticeably faster than Ollama on
               | Macs.
        
           | un_ess wrote:
           | Per the screenshot, this is a DeepSeek running on a 192GB M2
           | Studio https://nitter.poast.org/ggerganov/status/188461277009
           | 384272...
           | 
           | The same on Nvidia (various models)
           | https://github.com/ggerganov/llama.cpp/issues/11474
           | 
           | [1] this is a the model: https://huggingface.co/unsloth/DeepS
           | eek-R1-GGUF/tree/main/De...
        
             | diggan wrote:
             | So Apple M2 Studio does ~15 tks/second and A100-SXM4-80GB
             | does 9 tks/second?
             | 
             | I'm not sure I'm reading the results wrong or missing some
             | vital context, but that sounds unlikely to me.
        
               | achierius wrote:
               | The studio has a lot more ram available to the GPU (up to
               | 192gb) than the a100 (80gb), and iirc at least comparable
               | memory bandwidth -- those are what matter when you're
               | doing LLM inference, so the studio tends to win out
               | there.
               | 
               | Where the a100 and other similar chips dominate is in
               | training &c, which is mostly a question of flops.
        
               | diggan wrote:
               | > and iirc at least comparable memory bandwidth
               | 
               | I don't think they do.
               | 
               | From Wikipedia:
               | 
               | > the M2 Pro, M2 Max, and M2 Ultra have approximately 200
               | GB/s, 400 GB/s, and 800 GB/s respectively
               | 
               | From techpowerup:
               | 
               | > NVIDIA A100 SXM4 80 GB - Memory bandwidth - 2.04 TB/s
               | 
               | Seems to be a magnitude of difference, and that's just
               | the bandwidth.
        
           | vladgur wrote:
           | as someone who is paying $0.50 per kwh, id also like to
           | include kw per 1000 tokens or something to give me a sense of
           | cost of ownership these local systems
        
         | ekianjo wrote:
         | Mac Mini will be very slow for context ingestion compared to
         | nvidia GPU, and the other issue is that they are not usable for
         | Stable Diffusion... So if you just want to use LLMs, maybe, but
         | if you have other interests in AI models, probably not the
         | right answer.
        
           | drcongo wrote:
           | I use a Mac Studio for Stable Diffusion, what's special about
           | the Mac Mini that means it won't work?
        
         | JKCalhoun wrote:
         | For this use case though, I would prefer something more modular
         | than Apple hardware -- where down the road I could upgrade the
         | GPUs, for example.
        
         | motorest wrote:
         | > For the same price ($1799) you could buy a Mac Mini with 48gb
         | of unified memory and an m4 pro.
         | 
         | Around half that price tag was attributed to the blogger
         | reusing an old workstation he had lying around. Beyond this
         | point, OP slapped two graphics cards into an old rig. A better
         | description would be something like "what buying two graphics
         | cards gets you in terms of AI".
        
           | Capricorn2481 wrote:
           | > Beyond this point, OP slapped two graphics cards into an
           | old rig
           | 
           | Meaning what? This is largely what you do on a budget since
           | RAM is such a difference maker in token generation. This is
           | what's recommended. OP could buy an a100, but that wouldn't
           | be a budget build.
        
         | oofbaroomf wrote:
         | The bottleneck for single batch inference is memory bandwidth.
         | The M4 Pro has less memory bandwidth than the P40, so it would
         | be slower. Also, the setup presented in the OP has system RAM,
         | allowing you to run models than what fits in 48GB of VRAM (and
         | with good speeds too if you offload with something like
         | ktransformers).
        
           | anthonyskipper wrote:
           | >>M4 Pro has less memory bandwidth than the P40, so it would
           | be slower
           | 
           | Why do you say this? I thought the p40 only had a memory
           | bandwidth of 346 Gbytes/sec. The m4 is 546 GB/s. So the
           | macbook should kick the crap out of the p40.
        
             | oofbaroomf wrote:
             | The M4 Max has up to 546 GB/s. The M4 Pro, what GP was
             | talking about, has only 273 GB/s. An M4 Max with that much
             | RAM would most likely exceed OP's budget.
        
         | UncleOxidant wrote:
         | I wish Apple would offer a 128GB option in the Mac Mini - That
         | would require an M4 Max which they don't offer in the mini. I
         | know they have a MBP with M4 Max and 128GB, but I don't need
         | another laptop.
        
           | kridsdale1 wrote:
           | I'm waiting until this summer with the M4 Ultra Studio.
        
       | zinccat wrote:
       | P40 don't support fp16 well, buy 3090 instead
        
         | JKCalhoun wrote:
         | Cost is about 4X for the 24 GB 3090 on eBay.
        
       | gwern wrote:
       | > Another important finding: Terry is by far the most popular
       | name for a tortoise, followed by Turbo and Toby. Harry is a
       | favorite for hares. All LLMs are loving alliteration.
       | 
       | Mode-collapse. One reason that the tuned (or tuning-contaminated
       | models) are bad for creative writing: every protagonist and place
       | seems to be named the same thing.
        
         | diggan wrote:
         | Couldn't you just up the temperature/change some other
         | parameter to get it to be more random/"creative"? It wouldn't
         | be active/intentional randomness/novelty like what a human
         | would do, but at least it shouldn't generate exactly the same
         | naming.
        
           | gwern wrote:
           | No. The collapse is not as simple as simply shifting down
           | most of the logits, so ramping up the temperature does little
           | until outputs start degenerating.
        
       | brianzelip wrote:
       | Useful recent podcast about homelab LLMs,
       | https://changelog.com/friends/79
        
       | DrPhish wrote:
       | This is just a limited recreation of the ancient mikubox from
       | https://rentry.org/lmg-build-guides
       | 
       | Its funny to see people independently "discover" these builds
       | that are a year plus old.
       | 
       | Everyone is sleeping on these guides, but I guess the stink of
       | 4chan scares people away?
        
         | cratermoon wrote:
         | "ancient" guide. Pub: 10 May 2024 21:48 UTC
        
       | miniwark wrote:
       | 2 x Nvidia Tesla P40 card for EUR660 is not a thing i consider to
       | be "on a budget".
       | 
       | People can play with "small" or "medium" models less powerfull
       | and cheaper cards. A Nvidia Geforce RTX 3060 card with "only"
       | 12Gb VRAM can be found around EUR200-250 on second hand market
       | (and they are around 300~350 new).
       | 
       | In my opinion, 48Gb of VRAM is overkill to call it "on a budget",
       | for me this setup is nice but it's for semi-professional or
       | professional usage.
       | 
       | There is of course a trade off to use medium or small models, but
       | being "on a budget" is also to do trade off.
        
         | whywhywhywhy wrote:
         | > A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be
         | found around EUR200-250 on second hand market
         | 
         | 1080Ti might even be a better option, it also has a 12gb model
         | and some reports say it even outperforms the 3060, in non-rtx I
         | presume.
        
           | Eisenstein wrote:
           | CUDA compute version is a big deal. 1080ti is 6.1. 3060 is
           | 8.6. It also has tensor cores.
           | 
           | Note that CUDA version numbers are confusing, the compute
           | number is a different thing than the runtime/driver version.
        
           | Melatonic wrote:
           | Not sure what used prices are like these days but the Titan
           | XP (similar to the 1080 ti) is even better
        
         | mock-possum wrote:
         | Yeesh, yeah, that was my first thought too - _who's_ budget??
         | 
         | less than $500 total feels more fitting as a 'budget' build -
         | EUR1700 is more along the lines of 'enthusiast' or less
         | charitably "I am rich enough to afford expensive hobbies"
         | 
         | If it's your business and you expect to recoup the cost and
         | write off the cost on your taxes, that's one thing - but if
         | you're just looking to run a personal local LLM for funnies,
         | that's not an accessible price tag.
         | 
         | I suppose "or you could just buy a Mac" should have tipped me
         | off though.
        
       | cratermoon wrote:
       | From the article: In the future, I fully expect to be able to
       | have a frank and honest discussion about the Tiananmen events
       | with an American AI agent, but the only one I can afford will
       | have assumed the persona of Father Christmas who, while holding a
       | can of Coca-Cola, will intersperse the recounting of the tragic
       | events with a joyful "Ho ho ho... Didn't you know? The holidays
       | are coming!"
       | 
       | How unfortunate that people are discounting the likelihood that
       | American AI agents will avoid saying things their master think
       | should not be said. Anyone want to take bets on when the big 3
       | (Open AI, Meta, and Google) will quietly remove anything to do
       | with DEI, trans people, or global warming? They'll start out
       | changing all mentions of "Gulf of Mexico" to "Gulf of America",
       | but then what?
        
         | CamperBob2 wrote:
         | It is easy to get a local R1 model to talk about Tiananmen
         | Square to your heart's content. Telling it to replace
         | problematic terms with "Smurf" or another nonsense word is very
         | effective, but with the local model you don't even have to do
         | that in many cases. (e.g., https://i.imgur.com/btcI1fN.png)
        
       | birktj wrote:
       | I was wondering if anyone here has experimented with running a
       | cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB
       | of memory and also a NPU and only costs about 300 euros. I'm not
       | super up to date on the architecture on modern LLMs, but as far
       | as I understand you should be able to split the layers between
       | multiple nodes? It is not that much data the needs to be sent
       | between them, right? I guess you won't get quite the same
       | performance as a modern mac or nvidia GPU, but it could be quite
       | acceptable and possibly a cheap way of getting a lot of memory.
       | 
       | On the other hand I am wondering about what is the state of the
       | art in CPU + GPU inference. Prompt processing is both compute and
       | memory constrained, but I think token generation afterwards is
       | mostly memory bound. Are there any tools that support loading a
       | few layers at a time into a GPU for initial prompt processing and
       | then switches to CPU inference for token generation? Last time I
       | experimented it was possible to run some layers on the GPU and
       | some on the CPU, but to me it seems more efficient to run
       | everything on the GPU initially (but a few layers at a time so
       | they fit in VRAM) and then switch to the CPU when doing the
       | memory bound token generation.
        
         | Eisenstein wrote:
         | > I was wondering if anyone here has experimented with running
         | a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has
         | 32GB of memory and also a NPU and only costs about 300 euros.
         | 
         | Look into RPC. Llama.cpp supports it.
         | 
         | *
         | https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...
         | 
         | > Last time I experimented it was possible to run some layers
         | on the GPU and some on the CPU, but to me it seems more
         | efficient to run everything on the GPU initially (but a few
         | layers at a time so they fit in VRAM) and then switch to the
         | CPU when doing the memory bound token generation.
         | 
         | Moving layers over the PCIe bus to do this is going to be slow,
         | which seems to be the issue with that strategy. I think it the
         | key is to use MoE and be smart about which layers go where.
         | This project seems to be doing that with great results:
         | 
         | * https://github.com/kvcache-
         | ai/ktransformers/blob/main/doc/en...
        
       | hemant1041 wrote:
       | Interesting read!
        
       | ollybee wrote:
       | The middle ground is to rent a GPU VPS as needed. You can get an
       | H100 for $2/h. Not quite the same privacy as fully local offline,
       | but better than a SASS API and good enough for me. Hopefully in a
       | year or three it will truly be cost effective to run something
       | useful locally and then I can switch.
        
         | 1shooner wrote:
         | Do you have a recommended provider or other pointers for GPU
         | rental?
        
         | anonzzzies wrote:
         | That is what I do but it costs a lot of $, more than just using
         | openrouter. I would like to have a machine so I can have a
         | model talk to itself 24/7 for a realtively fixed price. I have
         | enough solar and wind + cheap net electric so it would
         | basically be free after buying. Just hard to pick what to buy
         | without just forking out a fortune on GPU's.
        
       | renewiltord wrote:
       | I have 6x 4090s in a rack with an Epyc driving them but tbh I am
       | selling them all to get a Mac Studio. Simpler to work with I
       | think.
        
       | _boffin_ wrote:
       | For around 1800, I was able to get myself a Dell T5820 with 2x
       | Dell 3090s. Can't complain at all.
        
       | almosthere wrote:
       | I bought a Mac M4 Mini (the cheapest one) at Costco for 559 and
       | while I don't know exactly how many tokens per second, it seems
       | to generate text from llama 3.2 (through ollama) as fast as
       | chatgpt.
        
       | cwoolfe wrote:
       | As others have said, a high powered Mac could be used for the
       | same purpose at a comparable price and lower power usage. Which
       | makes me wonder: why doesn't Apple get into the enterprise AI
       | chip game and compete with Nvidia? They could design their own
       | ASIC for it with all their hardware & manufacturing knowledge.
       | Maybe they already are.
        
         | gmueckl wrote:
         | The primary market for such a product would be businesses. And
         | Apple isn't particularly good at selling to companies. The
         | consumer product focus may just be too ingrained to be
         | successful with such a move.
         | 
         | A beefed up home pod with a local LLM-based assistant would be
         | a more typical Apple product. But they'd probably need LLMs to
         | become much, much more reliable to not ruin their reputation
         | over this.
        
           | fragmede wrote:
           | Why? Siri's still total crap but that doesn't seem to have
           | slowed down iPhone sales.
        
             | gmueckl wrote:
             | Siri mostly hit the expectations they themselves were able
             | to set through their ads when launching that product -
             | having a voice based assistant at all was huge back then.
             | With an LLM-based assistant, the market has set the
             | expectations for them and they are just unreasonably high
             | and don't mirror reality. That's a potentially big trap for
             | Apple now.
        
           | lolinder wrote:
           | > And Apple isn't particularly good at selling to companies.
           | 
           | With a big glaring exception: developer laptops are
           | overwhelmingly Apple's game right now. It seems like they
           | should be able to piggyback off of that, given that the
           | decision makers are going to be in the same branch of the
           | customer company.
        
         | jrm4 wrote:
         | For roughly the same reason Steve Jobs et al killed Hypercard;
         | too much power to the users.
        
       | lewisl9029 wrote:
       | This article is coming out at an interesting time for me.
       | 
       | We probably have different definitions for "budget", but I just
       | ordered a super janky eGPU setup for my very dated 8th gen Intel
       | NUC, with a m2->pcie adapter, a PSU, and a refurb Intel A770 for
       | about 350 all-in, not bad considering that's about the cost of a
       | proper Thunderbolt eGPU enclosure alone.
       | 
       | The overall idea: A770 seems like a really good budget LLM GPU
       | since it has more memory (16GB) and more memory bandwidth
       | (512GB/s) than a 4070, but costs a tiny fraction. The m2-pcie
       | adapter should give it a bit more bandwidth to the rest of the
       | system than Thunderbolt as well, so hopefully it'll make for a
       | decent gaming experience too.
       | 
       | If the eGPU part of the setup doesn't work out for some reason,
       | I'll probably just bite the bullet and order the rest of the PC
       | for a couple hundred more, and return the m2-pcie adapter (I got
       | it off of Amazon instead of Aliexpress specifically so I could do
       | this), looking to end up somewhere around 600 bux total. I think
       | that's probably a more reasonable price of entry for something
       | like this for most people.
       | 
       | Curious if anyone else has experience with the A770 for LLM? Been
       | looking at Intel's https://github.com/intel/ipex-llm project and
       | it looked pretty promising, that's what made me pull the trigger
       | in the end. Am I making a huge mistake?
        
         | UncleOxidant wrote:
         | > refurb Intel A770 for about 350
         | 
         | I'm seeing A770s for about $500 - $550. Where did you find a
         | refurb one for $350 (or less since you're also including other
         | parts of the system)
        
           | lewisl9029 wrote:
           | I got this one from Acer's Ebay for $220:
           | https://www.ebay.com/itm/266390922629
           | 
           | It's out of stock now unfortunately, but it does seem to pop
           | up again from time to time according to Slickdeals: https://s
           | lickdeals.net/newsearch.php?q=a770&pp=20&sort=newes...
           | 
           | I would probably just watch the listing and/or set up a deal
           | alert on Slickdeals and wait. If you're in a hurry though,
           | you can probably find a used one on Ebay for not too much
           | more.
        
       | jll29 wrote:
       | Can we just call it "PC to run ML models on" on a budget?
       | 
       | "AI computer" sounds pretentious and misleading to outsiders.
        
       | whalesalad wrote:
       | How many credits does this get you in any cloud or via
       | openai/anthropic api credits? For ~$1700 I can accomplish way
       | more without hardware locally. Don't get me wrong I enjoy
       | tinkering and building projects like this - but it doesn't make
       | financial sense here to me. Unless of course you live 100% off-
       | grid and have Stallman level privacy concerns.
       | 
       | Of course I do want my own local GPU compute setup, but the juice
       | just isn't worth the squeeze.
        
       | gytisgreitai wrote:
       | 1.7kEUR and 300w for a playground. Man this world is getting
       | crazy and I'm getting f-kin old by not understanding it.
        
       | asasidh wrote:
       | you can run 32B and even 70B (a bit slow) models on a m4 mac mini
       | pro with 48 GB ram. Out of the box using Ollama. If you enjoy
       | putting together a desktop, thats understandable.
       | 
       | https://deepgains.substack.com/p/running-deepseek-locally-fo...
        
       ___________________________________________________________________
       (page generated 2025-02-11 23:00 UTC)