[HN Gopher] Building a personal, private AI computer on a budget
___________________________________________________________________
Building a personal, private AI computer on a budget
Author : marban
Score : 296 points
Date : 2025-02-10 11:59 UTC (1 days ago)
(HTM) web link (ewintr.nl)
(TXT) w3m dump (ewintr.nl)
| hexomancer wrote:
| Isn't the fact the P40 has horrible fp16 performance a deal
| breaker for local setups?
| behohippy wrote:
| You probably won't be running fp16 anything locally. We
| typically run Q5 or Q6 quants to maximize the size of the model
| and context length we can run with the VRAM we have available.
| The quality loss is negligable at Q6.
| Eisenstein wrote:
| But the inference doesn't necessarily run at the quant
| precision.
| wkat4242 wrote:
| As far as I understand it does if you quantify the K/V
| store as well (the context). And that's pretty standard now
| because it can increase maximum context size a lot.
| numpad0 wrote:
| Is it cheaper in $/GB than used Vega 56(HBM2 8GB) besides?
| There are mining boards with bunch of x1 slots that probably
| can run half a dozen of them for same 48GB.
| 42lux wrote:
| Would take a bunch of time just to load the model...
| magicalhippo wrote:
| AFAIK this doesn't really work for interactive use, as LLMs
| process data serially. So your request needs to pass through
| all of the cards for each token, one at a time. Thus a lot of
| PCIe traffic and hence latency. Better than nothing, but only
| really useful if you can batch requests so you can keep each
| GPU working all the time, rather than just one at a time.
| Havoc wrote:
| Best as I can tell most of the disadvantages relate to larger
| batches. And for home use you're likely running batch of 1
| anyway
| reacharavindh wrote:
| The thing is though.... the locally hosted models in such
| hardware are cute as toys, and sure do write funny jokes and
| importantly, perform private tasks that I would never consider
| passing to non-selfhosted models, but pale in comparison to the
| models accessible over APIs(Claude 3.5 Sonnet, OpenAI etc). If I
| could run deepseek-r1-678b locally, without breaking the bank, I
| would. But, for now, opex > capex at a consumer level.
| walterbell wrote:
| 200+ comments, https://news.ycombinator.com/item?id=42897205
|
| _> This runs the 671B model in Q4 quantization at 3.5-4.25 TPS
| for $2K on a single socket Epyc server motherboard using 512GB
| of RAM._
| elorant wrote:
| Runs is an overstatement though. With 4 tokens/second you
| can't use it on production.
| deadbabe wrote:
| Isn't 4 tps good enough for local use by a single user,
| which is the point of a personal AI computer?
| ErikBjare wrote:
| I tend to get impatient at less than 10tok/s: If the
| answer is 600tok (normal for me) that's a minute.
| JKCalhoun wrote:
| It is for me. I'm happy to switch over to another task
| (and maybe that task is refilling my coffee) and come
| back when the answer is fully formed.
| walterbell wrote:
| Some LLM use cases are async, e.g. agents, "deep research"
| clones.
| diggan wrote:
| Not to mention even simpler things, like wanting to tag
| all of your local notes based on the content, basically a
| bash loop you can run indefinitely and speed doesn't
| matter much, as long as it eventually finishes
| mechagodzilla wrote:
| I have a similar setup running at about 1.5 tokens/second,
| and it's perfectly usable for the sorts of difficult tasks
| one needs a frontier model like this for - give it a prompt
| and come back an hour or two later. You interact with it
| like e-mailing a coworker. If I need an answer back in
| seconds, it's probably not a very complicated question, and
| a much smaller model will do.
| xienze wrote:
| I get where you're coming from, but the problem with LLMs
| is that you very regularly need a lot of back-and-forth
| with them to tease out the information you're looking
| for. A more apt analogy might be a coworker that you have
| to follow up with three or four times, at an hour per.
| Not so appealing anymore. Doubly so when you have to
| stand up $2k+ of hardware for the privilege. If I'm
| paying good money to host something locally, I want
| decent performance.
| unshavedyak wrote:
| Agreed. Furthermore, for some tasks like large context
| code assistant windows i want really fast responses. I've
| not found a UX i'm happy with yet but for anything i care
| about i'd want very fast token responses. Small blocks of
| code which instantly autocomplete, basically.
| MonkeyClub wrote:
| > If I'm paying good money to host something locally
|
| The thing is, however, that at 2k one is not paying good
| money, one is paying near the least amount possible. TFA
| specifically is about building a machine on a budget, and
| as such cuts corners to save costs, e.g. by buying older
| cards.
|
| Just because 2k is not a negligible amount in itself,
| that doesn't also automatically make it adequate for the
| purpose. Look for example at the 15k, 25k, and 40k price
| range tinyboxes:
|
| https://tinygrad.org/#tinybox
|
| It's like buying a 2k-worth used car, and expecting it to
| perform as well as a 40k one.
| Aurornis wrote:
| > give it a prompt and come back an hour or two later.
|
| This is the problem.
|
| If your use case is getting a small handful of non-urgent
| responses per day then it's not a problem. That's not how
| most people use LLMs, though.
| CamperBob2 wrote:
| What I'd like to know is how well those dual-Epyc machines
| run the 1.58 bit dynamic quant model. It really does seem
| to be almost as good as the full Q8.
| Cascais wrote:
| I agree with elorant. Indirectly, some youtubers ended up
| demonstrating that it's difficult to run the best models
| with less than 7k$, even if NVIDIA hardware is very
| efficient.
|
| In the future, I expect this to not be the case, because
| models will be far more efficient. At this pace, maybe even
| 6 months can make a difference.
| vanillax wrote:
| Huh? Toys? You can run DeepSeek 70b on 36GB ram Macbook pro..
| You can run Phi4, Qwen2.5, or llama3.3. They work great for
| coding tasks
| 3s wrote:
| Yeah but as one of the replies points out the resulting
| tokens/second would be unusable in production environments
| CamperBob2 wrote:
| The 1.58-bit DeepSeek R1 dynamic quant model from Unsloth is no
| joke. It just needs a lot of RAM and some patience.
| jaggs wrote:
| There seems to be a LOT of work going on to optimize the
| 1.58-bit option in terms of hardware and add-ons. I get the
| feeling that someone from Unsloth is going to have a genuine
| breakthrough shortly, and the rig/compute costs are going to
| plummet. Hope I'm not being naive or over-confident.
| cratermoon wrote:
| This is not because the models are better. These services have
| unknown and opaque levels of shadow prompting[1] to tweak the
| behavior. The subject article even mentions "tweaking their
| outputs to the liking of whoever pays the most". The more I
| play with LLMs locally, the more I realize how much prompting
| going on under the covers is shaping the results from the big
| tech services.
|
| 1 https://www.techpolicy.press/shining-a-light-on-shadow-
| promp...
| dgrabla wrote:
| Great breakdown!. The "own your own AI" at home is a terrific
| hobby if you like to tinker, but you are going to spend a ton of
| time and money on hardware that will be underutilized most of the
| time. If you want to go nuts check out Mitko Vasilev's dream
| machine. It makes no sense if you don't have a very clear use
| case that only requires small models or really slow token
| generation speeds.
|
| If the goal however is not to tinker but to really build and
| learn AI, it is going to be financially better to rent those
| GPUs/TPUs as needs arise.
| memhole wrote:
| This is correct. The cost makes no sense outside of hobby and
| interest. You're far better off renting. I think there is some
| merit to having a local inference server if you're doing
| development. You can manage models and have a little more
| control over your infra as the main benefits.
| theshrike79 wrote:
| Any M-series Mac is "good enough" for home LLMs. Just grab LM
| studio and a model that fits in memory.
|
| Yes, it will not rival OpenAI, but it's 100% local with no
| monthly fees and depending on the model no censoring or limits
| on what you can do with it.
| lioeters wrote:
| > spend a ton of time and money
|
| Not necessarily. For non-professional purposes, I've spent zero
| dollars (no additional memory or GPU) and I'm running a local
| language model that's good enough to help with many kinds of
| tasks including writing, coding, and translation.
|
| It's a personal, private, budget AI that requires no network
| connection or third-party servers.
| ImPostingOnHN wrote:
| on what hardware (and how much did you spend on it)?
| JKCalhoun wrote:
| Terrific hobby? Sign me up!
| jrm4 wrote:
| For what purpose? I'm asking this as someone who threw one of
| the cheap $500 Nvidia's with 16gb of VRAM and I'm already
| overwhelmed with what I can do already with Ollama,
| Krita+ComfyUI etc etc.
| rcarmo wrote:
| Given the power and noise involved, a Mac Mini M4 seems like a
| much nicer approach, although the RAM requirements will drive up
| the price.
| axegon_ wrote:
| I did something similar but using a K80 and M40 I dug up from
| eBay for pennies. Be advised though, stay as far away as possible
| from the K80 - the drivers were one of the most painful tech
| things I've ever had to endure, even if 24GB of VRAM for 50 bucks
| sounds incredibly appealing. That said, I had a decent-ish HP
| workstation laying around with 1200 watt power supply so I had
| where to put those two in. The one thing to note here is that
| these types of GPUs do not have a cooling of their own. My
| solution was to 3d print a bunch of brackets and attach several
| Noctua fans and have them blow at full speed 24/7. Surprisingly
| it worked way better than I expected - I've never gone above 60
| degrees. As a side efffect, the CPUs are also benefiting from
| this hack: at idle, they are in the mid-20 degrees range. Mind
| you, the noctua fans are located on the front and the back of the
| case: the ones on the front act as an intake and the ones on the
| back as exhaust and there's two more inside the case that are
| stuck in front of the GPUs.
|
| The workstation was refurbished for just over 600 bucks, and
| another 120 bucks for the GPUs and another ~60 for the fans.
|
| Edit: and before someone asks - no I have not uploaded the STL's
| anywhere cause I haven't had the time but also since this is a
| very niche use case, though I might: the back(exhaust) bracket
| came out brilliant the first try - it was a sub-millimeter fit.
| Then I got cocky and thought that I'd also nail it first try on
| the intake and ended up re-printing it 4 times.
| JKCalhoun wrote:
| Curious what HP workstation you have?
| 9front wrote:
| HP Z440, it's in the article.
| egorfine wrote:
| > K80 - the drivers were one of the most painful tech things
| I've ever had to endure
|
| Well, for a dedicated LLM box it might be feasible to suffer
| with drivers a bit, no? What was your experience like with the
| software side?
| deadbabe wrote:
| What's the most pain you've ever felt?
| apples_oranges wrote:
| Does using 2x24GB VRAM mean that the model can be fully loaded
| into memory if it's between 24 and 48 GB in size? I somehow doubt
| it, at least ollama wouldn't work like that I think. But does
| anyone know?
| memhole wrote:
| No. Hopefully, someone with more knowledge can explain better.
| But you need room for the kvcache is my understanding. You also
| need to factor in the size of the context window. If anyone has
| good resources on this, that would be awesome. Presently, it
| feels very much like a dark art to host these without crashing
| or being massively over-provisioned.
| htrp wrote:
| The dark art is to massively overprovision hardware.
| cratermoon wrote:
| Thus you get Open AI spending billions while DeepSeek comes
| long with, shocking, _actual understanding of the hardware
| and how to optimize for it_ [1] and spends $6 Million[2]
|
| 1. https://arxiv.org/abs/2412.19437v1
|
| 2. Quibble over the exact figure. Far less than Open AI,
| doing more with less.
| Eisenstein wrote:
| No, you need to have extra space for the context (which
| requires more space the larger the model is).
|
| But it should be said that basing model quality on its size in
| GB is like qualifying a video based on its size in GB. You can
| have the same video be small or huge with anywhere from
| negligible to huge differences in quality between the two.
|
| You will be running quantizied model weights, which can range
| in precision from 1 to 16 bits per parameter (the B for billion
| in the model name). Model weights at Q8 are generally their
| parameter size without the B in GB (Llama 3 8B at Q8 would be
| ~8GB). There are many different strategies for quantizing as
| well, so this is just a rough guide.
|
| So basically if you can't fit the 48GB model into your 48GB of
| VRAM, just download a lower precision quant.
| michaelt wrote:
| For a great many LLMs, you can find someone on HuggingFace who
| has produced a set of different quantised versions, with
| approximate RAM requirements.
|
| For example, if you want to run "CodeLlama 70B" from
| https://huggingface.co/TheBloke/CodeLlama-70B-Python-GGUF
| where's a table saying the "Q4_K_M" quantised version is a
| 41.42 GB download and runs in 43.92 GB of memory.
| brador wrote:
| You can do 8b local on the latest iPhones.
| 0xEF wrote:
| How useful is this, though? In my modest experience, these tiny
| models aren't good for much more than tinkering with,
| definitely not something I'd integrate into my workflow since
| the output quality is pretty low.
|
| Again, though, my experience is limited. I imagine others know
| something I do not and would absolutely love to hear more from
| people who are running tiny models on low-end hardware for
| things like code assistance, since that's where the use-case
| would lie for me.
|
| At the moment, I subscribe to "cloud" models that I use for
| various tasks and that seems to be working well enough, but it
| would be nice to have a personal model that I could train on
| very specific data. I'm sure I am missing something, since it's
| also hard to keep up with all the developments in the
| Generative AI world.
| DemetriousJones wrote:
| I tried running the 8B model on my 8GB M2 Macbook Air through
| Ollama and it was awful. It took ages to do anything and the
| responses were bad at best.
| redman25 wrote:
| Doesn't 8B need at least 16gb of ram? Otherwise, your
| swapping I would imagine...
| blebo wrote:
| Depends on quantization selected - see
| https://www.canirunthisllm.net/
| jmyeet wrote:
| The author mentions it but I want to expand on it: Apple is a
| seriously good option here, specifically the M4 Mac Mini.
|
| What makes Apple attractive is (as the author mentions) that RAM
| is shared between main and video RAM whereas NVidia is quite
| intentionally segmenting the market and charging huge premiums
| for high VRAM cards. Here are some options:
|
| 1. Base $599 Mac Mini: 16GB of RAM. Stocked in store.
|
| 2. $999 Mac Mini: 24GB of RAM. Stocked in store.
|
| 3. Add RAM to either of the above up to 32GB. It's not cheap at
| $200/8GB but you can buy a Mac Mini with 32GB of shared RAM for
| $999, substantially cheaper than the author's PC build but less
| storage (although you can upgrade that too).
|
| 4. M4 Pro: $1399 w/ 24GB of RAM. Stocked in store. You can
| customize this all the way to 64GB of RAM for +$600 so $1999 in
| total. That is amazing value for this kind of workload.
|
| 5. The Mac Studio is really the ultimate option. Way more cores
| and you can go all the way to 192GB of unified memory (for a
| $6000 machine). The problem here is that the Mac Studio is old,
| still on the M2 architecture. An M4 Ultra update is expected
| sometime this year, possibly late this year.
|
| 6. You can get into clustering these (eg [1]).
|
| 7. There are various Macbook Pro options, the highest of which is
| a 16" Mackbook Pro with 128GB of unified memory for $4999.
|
| But the main takeaway is the M4 Mac Mini is fantastic value.
|
| Some more random thoughts:
|
| - Some Mac Minis have Thunderbolt 5 ("TB5"), which is up to
| either 80Gbps or 120Gbps bidirectional (I've seen it quoted as
| both);
|
| - Mac Minis have the option of 10GbE (+$200);
|
| - The Mac Mini has 2 USB3 ports and either 3 TB4 or 3 TB5 ports.
|
| [1]: https://blog.exolabs.net/day-2/
| sofixa wrote:
| The issue with Macs is that below Max/Ultra processors, the
| memory bandwidth is pretty slow. So you need to spend a lot on
| a high level processor and lots of memory, and the current gen
| processor, M4, doesn't even have an Ultra, while the Max is
| only available in a laptop form factor (so thermal
| constraints).
|
| An M4 Pro still has only 273GB/s, while even the 2 generations
| old RTX 3090 has 935GB/s.
|
| https://github.com/ggerganov/llama.cpp/discussions/4167
| jmyeet wrote:
| That's a good point. I checked the M2 Mac Studio and it's
| 400GB/s for the M2 Max and 800GB/s for the M2 Ultra so the M4
| Ultra when we get it later this year should really be a
| beast.
|
| Oh and the top end Macbook Pro 16 (the only current Mac with
| an M4 Max) has 410GB/s memory bandwidth.
|
| Obviously the Mac Studio is at a much higher price point.
|
| Still, you need to spend $1500+ to get an NVidia GPU with
| >12GB of RAM. Multiple of those starts adding up quick. Put
| multiple in the same box and you're talking more expensive
| case, PSU, mainboard, etc and cooling too.
|
| Apple has a really interesting opportunity here with their
| unified memory architecture and power efficiency.
| diggan wrote:
| How is the performance difference between using a dedicated GPU
| from Nvidia for example compared to whatever Apple does?
|
| So lets say we'd run a model on a Mac Mini M4 with 24GB RAM,
| how many tokens/s are you getting? Then if we run the exact
| same model but with a RTX 3090ti for example, how many tokens/s
| are you getting?
|
| Do these comparisons exist somewhere online already? I
| understand it's possible to run the model on Apple hardware
| today, with the unified memory, but how fast is that really?
| redman25 wrote:
| Not the exact same comparison but I have an M1 mac with 16gb
| ram and can get about 10 t/s with a 3B model. The same model
| on my 3060ti gets more than 100 t/s.
|
| Needless to say, ram isn't everything.
| diggan wrote:
| Could you say what exact model+quant you're using for that
| specific test + settings + runtime? Just so I could try to
| compare with other numbers I come across.
| iamleppert wrote:
| The hassle of not being able to work with native CUDA isn't
| worth it for a huge amount of AI. Good luck getting that latest
| paper or code working quickly just to try it out, if the author
| didn't explicitly target M4 (unlikely but all the most
| mainstream of stuff).
| darkwater wrote:
| In a homelab scenario, having your own AI assistant not ran
| by someone else, that is not an issue. If you want to
| tinker/learn AI it's definitely an issue.
| sethd wrote:
| For sure and the Mac Mini M4 Pro with 64GB of RAM feels like
| the sweet spot right now.
|
| That said, the base storage option is only 512GB, and if this
| machine is also a daily driver, you're going to want to bump
| that up a bit. Still, it's an amazing machine for under $3K.
| wolfhumble wrote:
| It would be better/cheaper to buy an external Thunderbolt 5
| enclosure for the NVME drive you need.
| sethd wrote:
| I looked into this a couple months ago and external TB5 was
| still more expensive at 1-2 TB not sure about above,
| though.
| atwrk wrote:
| Worth pointing out that you "only" get <= 270GB/s of memory
| bandwith with those Macs, unless you choose the max/ultra
| models.
|
| If that is enough for your use case, it may make sense to wait
| 2 months and get a Ryzen AI Max+ 395 APU, which will have the
| same memory bandwith, but allows for up to 128GB RAM. For
| probably ~half the Mac's price.
|
| Usual AMD driver disclaimer applies, but then again inference
| is most often way easier to get running than training.
| oofbaroomf wrote:
| Unified memory is great because it's fast, but you can also get
| a lot of system memory on a "conventional" machine like OP's,
| and offload MOE layers like what Ktransformers did, so you can
| run huge models with acceptable speeds. While the Mac mini may
| have better value for anything that fits in the unified memory,
| if you want to run Deepseek R1 or other large models, then it's
| best to max out system RAM and get a GPU to offload.
| robblbobbl wrote:
| Deep true words. I'm sorry for the author but thx for the
| article!
| gregwebs wrote:
| The problem for me with making such an investment is that next
| month a better model will be released. It will either require
| more or less RAM than the current best model- making it either
| not runnable or expensive to run on an overbuilt machine.
|
| Using cloud infrastructure should help with this issue. It may
| cost much more per run but money can be saved if usage is
| intermittent.
|
| How are HN users handling this?
| diggan wrote:
| > How are HN users handling this?
|
| Combine the best of both worlds. I have a local assistant
| (communicate via Telegram) that handles tool-calling and basic
| calendar/todo management (running on a RTX 3090ti), but for
| more complicated stuff, it can call out to more advanced models
| (currently using OpenAI APIs for this) granted the request
| itself doesn't involve personal data, then it flat out refuses,
| for better or worse.
| idrathernot wrote:
| There is also an overlooked "tail risk" with cloud services
| that can end up costing you more than a a few entire on-premise
| rigs if you don't correctly configure services or forget to
| shut down a high end vm instance. Yeah you can implement
| additional scripts and services as a fail-safe, but this adds
| another layer of complexity that isn't always trivial
| (especially for a hobbyist).
|
| I'm not saying that dumping $10k into rapidly depreciating
| local hardware is the more economical choice, just that people
| often discount the likelihood and cost of making mistakes in
| the cloud during their evaluations and the time investment
| required to ensure you have the correct safeguards in-place.
| walterbell wrote:
| _> expensive to run on an overbuilt machine_
|
| There's a healthy secondary market for GPUs.
| xienze wrote:
| The price goes up dramatically once you go past 12GB though,
| that's the problem.
| JKCalhoun wrote:
| Not on these server GPUs.
|
| I'm seeing 24GB M40 cards for $200, 24GB K80 cards for $40
| on eBay.
| xienze wrote:
| Well OK, I should have been more specific that, even for
| server GPUs on eBay:
|
| * Cheap
|
| * Fast
|
| * Decent amount of RAM
|
| Pick two.
|
| These old GPUs are as cheap as they are because they
| don't perform well.
| JKCalhoun wrote:
| That's fair. From what I read though I think there is
| some interplay between Fast and Decent amount of RAM. Or
| at least there is a large falloff in performance when RAM
| is too small.
|
| So Cheap and Decent amount of RAM work for me.
| JKCalhoun wrote:
| I think the solution is already in the article and comments
| here: go cheap. Even next year the author will still have, at
| the very least, their P40 setup running late 2024 models.
|
| I'm about to plunge in as others have to get my own homelab
| running the current crop of models. I think there's no time
| like the present.
| 3s wrote:
| Exactly! While I have llama running locally on RTX and it's fun
| to tinker with, I can't use it for my workflows and don't want
| to invest 20k+ to run a decent model locally
|
| > How are HN users handling this? I'm working on a startup for
| end-to-end confidential AI using secure enclaves in the cloud
| (think of it like extending a local+private setup to the cloud
| with verifiable security guarantees). Live demo with DeepSeek
| 70B: chat.tinfoil.sh
| tempoponet wrote:
| Most of these new models release several variants, typically in
| the 8b, 30b, and 70b range for personal use. YMMV with each,
| but you usually use the models that fit your hardware, and the
| models keep getting better even in the same parameter range.
|
| To your point about cloud models, these are really quite cheap
| these days, especially for inference. If you're just doing
| conversation or tool use, you're unlikely to spend more than
| the cost of a local server, and the price per token is a race
| to the bottom.
|
| If you're doing training or processing a ton of documents for
| RAG setups, you can run these in batches locally overnight and
| let them take as long as they need, only paying for power. Then
| you can use cloud services on the resulting model or RAG for
| quick and cheap inference.
| nickthegreek wrote:
| I plan to wait for the NVIDIA Digits release and see what the
| token/sec is there. Ideally it will work well for at least 2-3
| years then I can resell and upgrade if needed.
| michaelt wrote:
| Among people who are running large models at home, I think the
| solution is basically to be rich.
|
| Plenty of people in tech earn enough to support a family and
| drive a fancy car, but choose not to. A used RTX 3090 isn't
| cheap, but you can afford a lot of $1000 GPUs if you don't buy
| that $40k car.
|
| Other options include only running the smaller LLMs; buying
| dated cards and praying you can get the drivers to work; or
| just using hosted LLMs like normal people.
| joshstrange wrote:
| I'd really love to build a machine for local LLMs. I've tested
| models on my MBP M3 Max with 128GB of ram and it's really cool
| but I'd like a dedicated local server. I'd also like an excuse to
| play with proxmox as I've just run raw Linux servers or UnRaid w/
| containers in the past.
|
| I have OpenWebUI and LibreChat running on my local "app server"
| and I'm quite enjoying that but every time I price out a beefier
| box I feel like the ROI just isn't there, especially for an
| industry that is moving so fast.
|
| Privacy is not something to ignore at all but the cost of
| inference online is very hard to beat, especially when I'm still
| learning how best to use LLMs.
| datadrivenangel wrote:
| You pay a premium to get the theoretical local privacy and
| reliability of hosting your own models.
|
| But to get commercially competitive models you need 5 figures
| of hardware, and then need to actually run it securely and
| reliably. Pay as you go with multiple vendors as fallback is a
| better option right now if you don't need harder privacy.
| rsanek wrote:
| With something like OpenRouter, you don't even have to
| manually integrate with multiple vendors
| wkat4242 wrote:
| Is that like LiteLLM? I have that running but never tried
| OpenRouter. I wonder now if it's better :)
| joshstrange wrote:
| Yeah, really I'd love for my Home Assistant to be able to use
| a local LLM/TTS/STT which I did get working but was way too
| slow. Also it would fun to just throw some problems/ideas at
| the wall without incurring (more) cost, that's a big part of
| it. But each time I run the numbers I would be better off
| using Anthropic/OpenAI/DeepSeek/other.
|
| I think sooner or later I'll break down and buy a server for
| local inference even if the ROI is upside down because it
| would be a fun project. I also find that these thing fall in
| the "You don't know what you will do with it until you have
| it and it starts unlocking things in your mind"-category. I'm
| sure there are things I would have it grind on overnight just
| to test/play with an idea which is something I'd be less
| likely to do on a paid API.
| nickthegreek wrote:
| You shouldn't be having slow response issues with
| LLM/TTS/STT for HA on a mbp m3 max 128gb. I'd either limit
| the entities exposed or choose a smaller model.
| joshstrange wrote:
| Oh, I can get smaller models to run reasonably fast but
| I'm very interested in tool calling and I'm having a hard
| time finding a model that runs fast and is good at
| calling tools locally (I'm sure that's due to my own
| ignorance).
| nickthegreek wrote:
| I decided on openai api for now after setting up so many
| differnt methods. the local stuff isn't up to snuff yet
| for what I am trying to accomplish but decent for basic
| control.
| joshstrange wrote:
| I use a combo of Anthropic and OpenAI for now through my
| bots and my chat UIs and that lets me iterate faster. My
| hope is once I've done all my testing I could consider
| moving to local models if it made sense.
| cruffle_duffle wrote:
| > You don't know what you will do with it until you have it
| and it starts unlocking things in your mind
|
| Exactly. Once the price and performance get to the level
| where buying stuff for local training and inferencing...
| that is when we will start to see the LLM break out of its
| current "corporate lawyer safe" stage and really begin to
| shake things up.
| smith7018 wrote:
| For what it's worth, looking at the benchmarks, I think the
| machine they built is comparable to what your MBP can already
| do. They probably have a better inference speed, though.
| cwalv wrote:
| > but every time I price out a beefier box I feel like the ROI
| just isn't there, especially for an industry that is moving so
| fast.
|
| Same, esp. if you factor in the cost of renting. Even if you
| run 24/7 it's hard to see it paying off in half the time it
| will take to be obsolete
| moffkalast wrote:
| A Strix Halo minipc might be a good mid tier option once
| they're out, though AMD still isn't clear on how much they'll
| overprice them.
|
| Core Ultra Arc iGPU boxes are pretty neat too for being
| standalone and can be loaded up with DDR5 shared memory,
| efficient and usable in terms of speed, though that's
| definitely low end performance, plus SYCL and IPEX are a bit
| eh.
| whalesalad wrote:
| The juice aint worth the squeeze to do this locally.
|
| But you should still play with proxmox, just not for this
| purpose. My recommendation would be to get an i7 HP Elitedesk.
| I have multiple racks in my basement, hundreds of gigs of ram,
| multiple 2U 2x processor enterprise servers etc.... but at this
| point all of it is turned off and a single HP Elitedesk with a
| 2nd NIC added and 64GB of ram is doing everything I ever needed
| and more.
| joshstrange wrote:
| Yeah, right now I'm running a tower PC (Intel Core i9-11900K,
| 64GB Ram) with Unraid as my local "app server". I want to
| play with Proxmox (for professional and mostly fun reasons)
| though. Someday I'd like a rack in my basement as my homelab
| stuff has overgrown the space it's in and I'm going to need
| to add a new 12-bay Synology (on top of 2x12-bay) soon since
| I'm running out of space again. For now I've been sticking
| with consumer/prosumer equipment but my needs are slowly
| outstripping that I think.
| walterbell wrote:
| One reason to bother with private AI: cloud AI ToS for consumers
| may have legal clauses about usage of prompt and context data,
| e.g. data that is not already on the Internet. Enterprise
| customers can exclude their data from future training.
|
| https://stratechery.com/2025/deep-research-and-knowledge-val...
|
| _> Unless, of course, the information that matters is not on the
| Internet. This is why I am not sharing the Deep Research report
| that provoked this insight: I happen to know some things about
| the industry in question -- which is not related to tech, to be
| clear -- because I have a friend who works in it, and it is
| suddenly clear to me how much future economic value is wrapped up
| in information not being public. In this case the entity in
| question is privately held, so there aren't stock market filings,
| public reports, barely even a webpage! And so AI is blind._
|
| (edited for clarity)
| icepat wrote:
| ToS can change. Companies can (and do) act illegally. Data
| breaches happen. Insider threats happen.
|
| Why trust the good will of a company, over a box that you built
| yourself, and have complete control over?
| rovr138 wrote:
| Cost for one
| dandanua wrote:
| I doubt it is that efficient. Even though it has 48GB of VRAM,
| it's more than twice slower than a single 3090 GPU.
|
| In my budget AI setup I use 7840 Ryzen based miniPC with USB4
| port and connect 3090 to it via the eGPU adapter (ADT-link UT3G).
| It costed me about $1000 total and I can easily achieve 35 t/s
| with qwen2.5-coder-32b using ollama.
| mrbonner wrote:
| Wouldn't eGPU defeat the purpose of having fast memory
| bandwidth? Have you tried it with stable diffusion?
| dandanua wrote:
| 40Gbps of USB4 is plenty. I've tried this pytorch tests
| https://github.com/aime-team/pytorch-benchmarks/ and saw only
| 10% drop in performance. No drop in performance for LLM
| inference, if a model is already loaded to the VRAM.
| rjurney wrote:
| A lot of people build personal deep learning machines. The
| economics and convenience can definitely work out... I am
| confused however by "dummy GPU" - I searched for "dummy" for an
| explanation but didn't find one. Modern motherboards all include
| an integrated video card, so I'm not sure what this would be for?
|
| My personal DL machine has a 24 core CPU, 128GB RAM and 2 x 3060
| GPUs and 2 x 2TB NVMe drives in a RAID 1 array. I <3 it.
| T-A wrote:
| Look under "Available Graphics" at
|
| https://www.hp.com/us-en/shop/mdp/business-solutions/z440-wo...
|
| No integrated graphics.
|
| Author's explanation of the problem:
|
| _The Teslas are intended to crunch numbers, not to play video
| games with. Consequently, they don 't have any ports to connect
| a monitor to. The BIOS of the HP Z440 does not like this. It
| refuses to boot if there is no way to output a video signal._
| refibrillator wrote:
| Pay attention to IO bandwidth if you're building a machine with
| multiple GPUs like this!
|
| In this setup the model is sharded between cards so data must be
| shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s
| max. For reference that's an order of magnitude lower than the
| ~350 GB/s memory bandwidth of the Tesla P40 cards being used.
|
| Author didn't mention NVLink so I'm presuming it wasn't used, but
| I believe these cards would support it.
|
| Building on a budget is really hard. In my experience 5-15 tok/s
| is a bit too slow for use cases like coding, but I admit once
| you've had a taste of 150 tok/s it's hard to go back (I've been
| spoiled by RTX 4090 with vLLM).
| zinccat wrote:
| I feel that you are mistaking the two bandwidth numbers
| Miraste wrote:
| Unless you run the GPUs in parallel, which you have to go out
| of your way to do, the IO bandwidth doesn't matter. The cards
| hold separate layers of the model, they're not working
| together. They're only passing a few kilobytes per second
| between them.
| Xenograph wrote:
| Which models do you enjoy most on your 4090? and why vLLM
| instead of ollama?
| ekianjo wrote:
| > Author didn't mention NVLink so I'm presuming it wasn't used,
| but I believe these cards would support it.
|
| How would you setup NVLink, if the cards support it?
| kamranjon wrote:
| For the same price ($1799) you could buy a Mac Mini with 48gb of
| unified memory and an m4 pro. It'd probably use less power and be
| much quieter to run and likely could outperform this setup in
| terms of tokens per second. I enjoyed the write up still, but I
| would probably just buy a Mac in this situation.
| diggan wrote:
| > likely could outperform this setup in terms of tokens per
| second
|
| I've heard arguments both for and against this, but they always
| lack concrete numbers.
|
| I'd love something like "Here is Qwen2.5 at Q4 quantization
| running via Ollama + these settings, and M4 24GB RAM gets X
| tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're
| just propagating mostly anecdotes without any reality-checks.
| cruffle_duffle wrote:
| I think we are somewhat still at the "fuzzy super early
| adopter" stage of this local LLM game and hard data is not
| going to be easy to come by. I almost want to use the word
| "hobbiest stage" where almost all of the "data" and "best
| practice" is anecdotal but I think we are a step above that.
|
| Still, it's way to early and there are simply way to many
| hardware and software combinations that change almost weekly
| to establish "the best practice hardware configuration for
| training / inferencing large language models locally".
|
| Some day there will be established guides with solid. In fact
| someday there will be be PC's that specifically target LLMs
| and will feature all kinds of stats aimed at getting you to
| bust out your wallet. And I even predict they'll come up with
| metrics that all the players will chase well beyond when
| those metrics make sense (megapixels, clock frequency,
| etc)... but we aren't there yet!
| diggan wrote:
| Right, but how are we supposed to be getting anywhere else
| unless people start being more specific and stop leaning on
| anecdotes or repeating what they've heard elsewhere?
|
| Saying "Apple seems to be somewhat equal to this other
| setup" doesn't really contribute to someone getting an
| accurate picture if it is equal or not, unless we start
| including raw numbers, even if they aren't directly
| comparable.
|
| I don't think it's too early to say "I get X tokens/second
| with this setup + these settings" because then we can at
| least start comparing, instead of just guessing which seems
| to be the current SOTA.
| kamranjon wrote:
| A great thread with the type of info your looking for
| lives here:
| https://github.com/ggerganov/whisper.cpp/issues/89
|
| But you can likely find similar threads for the llama.cpp
| benchmark here: https://github.com/ggerganov/llama.cpp/tr
| ee/master/examples/...
|
| These are good examples because the llama.cpp and
| whisper.cpp benchmarks take full advantage of the Apple
| hardware but also take full advantage of non-Apple
| hardware with GPU support, AVX support etc.
|
| It's been true for a while now that the memory bandwidth
| of modern Apple systems in tandem with the neural cores
| and gpu has made them very competitive Nvidia for local
| inference and even basic training.
| diggan wrote:
| I guess I'm mostly lamenting about how unscientific these
| discussions are in general, on HN and elsewhere (besides
| specific GitHub repositories). Every community is filled
| with just anecdotal stories, or some numbers but missing
| to specify a bunch of settings + model + runtime details
| so people could at least compare it to something.
|
| Still, thanks for the links :)
| t1amat wrote:
| In fairness it's become even more difficult now than ever
| before.
|
| * hardware spec
|
| * inference engine
|
| * specific model - differences to tokenizer will make
| models faster/slower with equivalent parameter count
|
| * quantization used - and you need to be aware of
| hardware specific optimizations for particular quants
|
| * kv cache settings
|
| * input context size
|
| * output token count
|
| This is probably not a complete list either.
| nickthegreek wrote:
| Best place to get that kinda info is gonna be
| /r/LocalLlama
| motorest wrote:
| > I think we are somewhat still at the "fuzzy super early
| adopter" stage of this local LLM game and hard data is not
| going to be easy to come by.
|
| What's hard about it? You get the hardware, you run the
| software, you take measurements.
| GTP wrote:
| Yes, but we don't have enough people doing that to get
| quality data. Not many people are building this kind of
| setup, and even less are publishing their results.
| Additionally, if I just run a test a couple of time and
| then average the results, this is still far from a solid
| measurement.
| diggan wrote:
| > but we don't have enough people doing that to get
| quality data
|
| But how are we supposed to get enough people doing those
| things if everyone say "There isn't enough data right now
| for it to be useful"? We have to start somewhere
| unshavedyak wrote:
| I don't think they're saying anything counter to that.
| The people who don't require the volume of data will run
| these. Ie the super early adopters.
| colonCapitalDee wrote:
| We've already started, we just haven't finished yet
| fkyoureadthedoc wrote:
| On an M1 Max 64GB laptop running gemma2:27b same prompt and
| settings from blog post total duration:
| 24.919887458s load duration: 39.315083ms
| prompt eval count: 37 token(s) prompt eval
| duration: 963.071ms prompt eval rate: 38.42
| tokens/s eval count: 441 token(s)
| eval duration: 23.916616s eval rate:
| 18.44 tokens/s
|
| I have a gaming PC with a 4090 I could try, but I don't think
| this model would fit
| diggan wrote:
| > gemma2:27b
|
| What quantization are you using? What's the runtime+version
| you run this with? And the rest of the settings?
|
| Edit: Turns out parent is using Q4 for their test. Doing
| the same test with LM Studio and a 3090ti + Ryzen 5950X
| (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.
| fkyoureadthedoc wrote:
| Fresh install from brew, ollama version is 0.5.7
|
| Only settings I did were the ones shown in the blog post
| OLLAMA_FLASH_ATTENTION=1
| OLLAMA_KV_CACHE_TYPE=q8_0
|
| Ran the model like ollama run
| gemma2:27b --verbose
|
| With the same prompt, "Can you write me a story about a
| tortoise and a hare, but one that involves a race to get
| the most tokens per second?"
| diggan wrote:
| When you run that, what quantization do you get? The
| library website of Ollama
| (https://ollama.com/library/gemma2:27b) isn't exactly a
| good use case in surfacing useful information like what
| the default quantization is.
| fkyoureadthedoc wrote:
| not sure how to tell, but here's the full output from
| ollama serve https://pastes.io/ollama-run-gemma2-27b
| diggan wrote:
| Thanks, that seems to indicate Q4 for the quantization,
| you're probably able to run that on the 4090 as well
| FWIW, the size of the model is just 14.55 GiB.
| navbaker wrote:
| If you hit the drop-down menu for the size of the model,
| then tap "view all", you will see the size and hash of
| the model you have selected and can compare it to the
| full list below it that has the quantization specs in the
| name.
| diggan wrote:
| Still, I don't see a way (from the web library) to see
| the default quantization (from Ollama's POV) at all, is
| that possible somehow?
| navbaker wrote:
| The model displayed in the drop-down when you access the
| web library is the default that will be pulled. Compare
| the size and hash to the more detailed model listing
| below it and you will see what quantization you have.
|
| Example: the default model weights for Llama 3.3 70b,
| after hitting the "view all" have this hash and size
| listed next to it - a6eb4748fd29 * 43GB
|
| Now scroll down through the list and you will find the
| one that matches that hash and size is
| "70b-instruct-q4_K_M". That tells you that the default
| weights for Llama 3.3 70B from Ollama are 4-bit quantized
| (q4) while the "K_M" tells you a bit about what
| techniques were used during quantization to balance size
| and performance.
| mkesper wrote:
| If you leave the :27b off from that URL you'll see the
| default size which is 9b. Ollama seems to always use Q4_0
| even if other quants are better.
| rahimnathwani wrote:
| gemma2:27b-instruct-q4_0 (checksum 53261bc9c192)
| condiment wrote:
| On a 3090 (24gb vram), same prompt & quant, I can report
| more than double the tokens per second, and significantly
| faster prompt eval. total_duration:
| 10530451000 load_duration: 54350253
| prompt_eval_count: 36 prompt_eval_duration:
| 29000000 prompt_token/s: 1241.38
| eval_count: 460 eval_duration:
| 10445000000 response_token/s: 44.04
|
| Fast prompt eval is important when feeding larger contexts
| into these models, which is required for almost anything
| useful. GPUs have other advantages for traditional ML,
| whisper models, vision, and image generation. There's a lot
| of flexibility that doesn't really get discussed when folks
| trot out the 'just buy a mac' line.
|
| Anecdotally I can share my revealed preference. I have both
| an M3 (36gb) as well as a GPU machine, and I went through
| the trouble of putting my GPU box online because it was so
| much faster than the mac. And doubling up the GPUs allows
| me to run models like the deepseek-tuned llama 3.3, with
| which I have completely replaced my use of chatgpt 4o.
| svachalek wrote:
| Thanks for numbers! People should include their LLM
| runner as well I think, as there are differences in
| hardware optimization support. Like I haven't tested it
| but I've heard MLX is noticeably faster than Ollama on
| Macs.
| un_ess wrote:
| Per the screenshot, this is a DeepSeek running on a 192GB M2
| Studio https://nitter.poast.org/ggerganov/status/188461277009
| 384272...
|
| The same on Nvidia (various models)
| https://github.com/ggerganov/llama.cpp/issues/11474
|
| [1] this is a the model: https://huggingface.co/unsloth/DeepS
| eek-R1-GGUF/tree/main/De...
| diggan wrote:
| So Apple M2 Studio does ~15 tks/second and A100-SXM4-80GB
| does 9 tks/second?
|
| I'm not sure I'm reading the results wrong or missing some
| vital context, but that sounds unlikely to me.
| achierius wrote:
| The studio has a lot more ram available to the GPU (up to
| 192gb) than the a100 (80gb), and iirc at least comparable
| memory bandwidth -- those are what matter when you're
| doing LLM inference, so the studio tends to win out
| there.
|
| Where the a100 and other similar chips dominate is in
| training &c, which is mostly a question of flops.
| diggan wrote:
| > and iirc at least comparable memory bandwidth
|
| I don't think they do.
|
| From Wikipedia:
|
| > the M2 Pro, M2 Max, and M2 Ultra have approximately 200
| GB/s, 400 GB/s, and 800 GB/s respectively
|
| From techpowerup:
|
| > NVIDIA A100 SXM4 80 GB - Memory bandwidth - 2.04 TB/s
|
| Seems to be a magnitude of difference, and that's just
| the bandwidth.
| vladgur wrote:
| as someone who is paying $0.50 per kwh, id also like to
| include kw per 1000 tokens or something to give me a sense of
| cost of ownership these local systems
| ekianjo wrote:
| Mac Mini will be very slow for context ingestion compared to
| nvidia GPU, and the other issue is that they are not usable for
| Stable Diffusion... So if you just want to use LLMs, maybe, but
| if you have other interests in AI models, probably not the
| right answer.
| drcongo wrote:
| I use a Mac Studio for Stable Diffusion, what's special about
| the Mac Mini that means it won't work?
| JKCalhoun wrote:
| For this use case though, I would prefer something more modular
| than Apple hardware -- where down the road I could upgrade the
| GPUs, for example.
| motorest wrote:
| > For the same price ($1799) you could buy a Mac Mini with 48gb
| of unified memory and an m4 pro.
|
| Around half that price tag was attributed to the blogger
| reusing an old workstation he had lying around. Beyond this
| point, OP slapped two graphics cards into an old rig. A better
| description would be something like "what buying two graphics
| cards gets you in terms of AI".
| Capricorn2481 wrote:
| > Beyond this point, OP slapped two graphics cards into an
| old rig
|
| Meaning what? This is largely what you do on a budget since
| RAM is such a difference maker in token generation. This is
| what's recommended. OP could buy an a100, but that wouldn't
| be a budget build.
| oofbaroomf wrote:
| The bottleneck for single batch inference is memory bandwidth.
| The M4 Pro has less memory bandwidth than the P40, so it would
| be slower. Also, the setup presented in the OP has system RAM,
| allowing you to run models than what fits in 48GB of VRAM (and
| with good speeds too if you offload with something like
| ktransformers).
| anthonyskipper wrote:
| >>M4 Pro has less memory bandwidth than the P40, so it would
| be slower
|
| Why do you say this? I thought the p40 only had a memory
| bandwidth of 346 Gbytes/sec. The m4 is 546 GB/s. So the
| macbook should kick the crap out of the p40.
| oofbaroomf wrote:
| The M4 Max has up to 546 GB/s. The M4 Pro, what GP was
| talking about, has only 273 GB/s. An M4 Max with that much
| RAM would most likely exceed OP's budget.
| UncleOxidant wrote:
| I wish Apple would offer a 128GB option in the Mac Mini - That
| would require an M4 Max which they don't offer in the mini. I
| know they have a MBP with M4 Max and 128GB, but I don't need
| another laptop.
| kridsdale1 wrote:
| I'm waiting until this summer with the M4 Ultra Studio.
| zinccat wrote:
| P40 don't support fp16 well, buy 3090 instead
| JKCalhoun wrote:
| Cost is about 4X for the 24 GB 3090 on eBay.
| gwern wrote:
| > Another important finding: Terry is by far the most popular
| name for a tortoise, followed by Turbo and Toby. Harry is a
| favorite for hares. All LLMs are loving alliteration.
|
| Mode-collapse. One reason that the tuned (or tuning-contaminated
| models) are bad for creative writing: every protagonist and place
| seems to be named the same thing.
| diggan wrote:
| Couldn't you just up the temperature/change some other
| parameter to get it to be more random/"creative"? It wouldn't
| be active/intentional randomness/novelty like what a human
| would do, but at least it shouldn't generate exactly the same
| naming.
| gwern wrote:
| No. The collapse is not as simple as simply shifting down
| most of the logits, so ramping up the temperature does little
| until outputs start degenerating.
| brianzelip wrote:
| Useful recent podcast about homelab LLMs,
| https://changelog.com/friends/79
| DrPhish wrote:
| This is just a limited recreation of the ancient mikubox from
| https://rentry.org/lmg-build-guides
|
| Its funny to see people independently "discover" these builds
| that are a year plus old.
|
| Everyone is sleeping on these guides, but I guess the stink of
| 4chan scares people away?
| cratermoon wrote:
| "ancient" guide. Pub: 10 May 2024 21:48 UTC
| miniwark wrote:
| 2 x Nvidia Tesla P40 card for EUR660 is not a thing i consider to
| be "on a budget".
|
| People can play with "small" or "medium" models less powerfull
| and cheaper cards. A Nvidia Geforce RTX 3060 card with "only"
| 12Gb VRAM can be found around EUR200-250 on second hand market
| (and they are around 300~350 new).
|
| In my opinion, 48Gb of VRAM is overkill to call it "on a budget",
| for me this setup is nice but it's for semi-professional or
| professional usage.
|
| There is of course a trade off to use medium or small models, but
| being "on a budget" is also to do trade off.
| whywhywhywhy wrote:
| > A Nvidia Geforce RTX 3060 card with "only" 12Gb VRAM can be
| found around EUR200-250 on second hand market
|
| 1080Ti might even be a better option, it also has a 12gb model
| and some reports say it even outperforms the 3060, in non-rtx I
| presume.
| Eisenstein wrote:
| CUDA compute version is a big deal. 1080ti is 6.1. 3060 is
| 8.6. It also has tensor cores.
|
| Note that CUDA version numbers are confusing, the compute
| number is a different thing than the runtime/driver version.
| Melatonic wrote:
| Not sure what used prices are like these days but the Titan
| XP (similar to the 1080 ti) is even better
| mock-possum wrote:
| Yeesh, yeah, that was my first thought too - _who's_ budget??
|
| less than $500 total feels more fitting as a 'budget' build -
| EUR1700 is more along the lines of 'enthusiast' or less
| charitably "I am rich enough to afford expensive hobbies"
|
| If it's your business and you expect to recoup the cost and
| write off the cost on your taxes, that's one thing - but if
| you're just looking to run a personal local LLM for funnies,
| that's not an accessible price tag.
|
| I suppose "or you could just buy a Mac" should have tipped me
| off though.
| cratermoon wrote:
| From the article: In the future, I fully expect to be able to
| have a frank and honest discussion about the Tiananmen events
| with an American AI agent, but the only one I can afford will
| have assumed the persona of Father Christmas who, while holding a
| can of Coca-Cola, will intersperse the recounting of the tragic
| events with a joyful "Ho ho ho... Didn't you know? The holidays
| are coming!"
|
| How unfortunate that people are discounting the likelihood that
| American AI agents will avoid saying things their master think
| should not be said. Anyone want to take bets on when the big 3
| (Open AI, Meta, and Google) will quietly remove anything to do
| with DEI, trans people, or global warming? They'll start out
| changing all mentions of "Gulf of Mexico" to "Gulf of America",
| but then what?
| CamperBob2 wrote:
| It is easy to get a local R1 model to talk about Tiananmen
| Square to your heart's content. Telling it to replace
| problematic terms with "Smurf" or another nonsense word is very
| effective, but with the local model you don't even have to do
| that in many cases. (e.g., https://i.imgur.com/btcI1fN.png)
| birktj wrote:
| I was wondering if anyone here has experimented with running a
| cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has 32GB
| of memory and also a NPU and only costs about 300 euros. I'm not
| super up to date on the architecture on modern LLMs, but as far
| as I understand you should be able to split the layers between
| multiple nodes? It is not that much data the needs to be sent
| between them, right? I guess you won't get quite the same
| performance as a modern mac or nvidia GPU, but it could be quite
| acceptable and possibly a cheap way of getting a lot of memory.
|
| On the other hand I am wondering about what is the state of the
| art in CPU + GPU inference. Prompt processing is both compute and
| memory constrained, but I think token generation afterwards is
| mostly memory bound. Are there any tools that support loading a
| few layers at a time into a GPU for initial prompt processing and
| then switches to CPU inference for token generation? Last time I
| experimented it was possible to run some layers on the GPU and
| some on the CPU, but to me it seems more efficient to run
| everything on the GPU initially (but a few layers at a time so
| they fit in VRAM) and then switch to the CPU when doing the
| memory bound token generation.
| Eisenstein wrote:
| > I was wondering if anyone here has experimented with running
| a cluster of SBC for LLM inference? Ex. the Radxa ROCK 5C has
| 32GB of memory and also a NPU and only costs about 300 euros.
|
| Look into RPC. Llama.cpp supports it.
|
| *
| https://www.reddit.com/r/LocalLLaMA/comments/1cyzi9e/llamacp...
|
| > Last time I experimented it was possible to run some layers
| on the GPU and some on the CPU, but to me it seems more
| efficient to run everything on the GPU initially (but a few
| layers at a time so they fit in VRAM) and then switch to the
| CPU when doing the memory bound token generation.
|
| Moving layers over the PCIe bus to do this is going to be slow,
| which seems to be the issue with that strategy. I think it the
| key is to use MoE and be smart about which layers go where.
| This project seems to be doing that with great results:
|
| * https://github.com/kvcache-
| ai/ktransformers/blob/main/doc/en...
| hemant1041 wrote:
| Interesting read!
| ollybee wrote:
| The middle ground is to rent a GPU VPS as needed. You can get an
| H100 for $2/h. Not quite the same privacy as fully local offline,
| but better than a SASS API and good enough for me. Hopefully in a
| year or three it will truly be cost effective to run something
| useful locally and then I can switch.
| 1shooner wrote:
| Do you have a recommended provider or other pointers for GPU
| rental?
| anonzzzies wrote:
| That is what I do but it costs a lot of $, more than just using
| openrouter. I would like to have a machine so I can have a
| model talk to itself 24/7 for a realtively fixed price. I have
| enough solar and wind + cheap net electric so it would
| basically be free after buying. Just hard to pick what to buy
| without just forking out a fortune on GPU's.
| renewiltord wrote:
| I have 6x 4090s in a rack with an Epyc driving them but tbh I am
| selling them all to get a Mac Studio. Simpler to work with I
| think.
| _boffin_ wrote:
| For around 1800, I was able to get myself a Dell T5820 with 2x
| Dell 3090s. Can't complain at all.
| almosthere wrote:
| I bought a Mac M4 Mini (the cheapest one) at Costco for 559 and
| while I don't know exactly how many tokens per second, it seems
| to generate text from llama 3.2 (through ollama) as fast as
| chatgpt.
| cwoolfe wrote:
| As others have said, a high powered Mac could be used for the
| same purpose at a comparable price and lower power usage. Which
| makes me wonder: why doesn't Apple get into the enterprise AI
| chip game and compete with Nvidia? They could design their own
| ASIC for it with all their hardware & manufacturing knowledge.
| Maybe they already are.
| gmueckl wrote:
| The primary market for such a product would be businesses. And
| Apple isn't particularly good at selling to companies. The
| consumer product focus may just be too ingrained to be
| successful with such a move.
|
| A beefed up home pod with a local LLM-based assistant would be
| a more typical Apple product. But they'd probably need LLMs to
| become much, much more reliable to not ruin their reputation
| over this.
| fragmede wrote:
| Why? Siri's still total crap but that doesn't seem to have
| slowed down iPhone sales.
| gmueckl wrote:
| Siri mostly hit the expectations they themselves were able
| to set through their ads when launching that product -
| having a voice based assistant at all was huge back then.
| With an LLM-based assistant, the market has set the
| expectations for them and they are just unreasonably high
| and don't mirror reality. That's a potentially big trap for
| Apple now.
| lolinder wrote:
| > And Apple isn't particularly good at selling to companies.
|
| With a big glaring exception: developer laptops are
| overwhelmingly Apple's game right now. It seems like they
| should be able to piggyback off of that, given that the
| decision makers are going to be in the same branch of the
| customer company.
| jrm4 wrote:
| For roughly the same reason Steve Jobs et al killed Hypercard;
| too much power to the users.
| lewisl9029 wrote:
| This article is coming out at an interesting time for me.
|
| We probably have different definitions for "budget", but I just
| ordered a super janky eGPU setup for my very dated 8th gen Intel
| NUC, with a m2->pcie adapter, a PSU, and a refurb Intel A770 for
| about 350 all-in, not bad considering that's about the cost of a
| proper Thunderbolt eGPU enclosure alone.
|
| The overall idea: A770 seems like a really good budget LLM GPU
| since it has more memory (16GB) and more memory bandwidth
| (512GB/s) than a 4070, but costs a tiny fraction. The m2-pcie
| adapter should give it a bit more bandwidth to the rest of the
| system than Thunderbolt as well, so hopefully it'll make for a
| decent gaming experience too.
|
| If the eGPU part of the setup doesn't work out for some reason,
| I'll probably just bite the bullet and order the rest of the PC
| for a couple hundred more, and return the m2-pcie adapter (I got
| it off of Amazon instead of Aliexpress specifically so I could do
| this), looking to end up somewhere around 600 bux total. I think
| that's probably a more reasonable price of entry for something
| like this for most people.
|
| Curious if anyone else has experience with the A770 for LLM? Been
| looking at Intel's https://github.com/intel/ipex-llm project and
| it looked pretty promising, that's what made me pull the trigger
| in the end. Am I making a huge mistake?
| UncleOxidant wrote:
| > refurb Intel A770 for about 350
|
| I'm seeing A770s for about $500 - $550. Where did you find a
| refurb one for $350 (or less since you're also including other
| parts of the system)
| lewisl9029 wrote:
| I got this one from Acer's Ebay for $220:
| https://www.ebay.com/itm/266390922629
|
| It's out of stock now unfortunately, but it does seem to pop
| up again from time to time according to Slickdeals: https://s
| lickdeals.net/newsearch.php?q=a770&pp=20&sort=newes...
|
| I would probably just watch the listing and/or set up a deal
| alert on Slickdeals and wait. If you're in a hurry though,
| you can probably find a used one on Ebay for not too much
| more.
| jll29 wrote:
| Can we just call it "PC to run ML models on" on a budget?
|
| "AI computer" sounds pretentious and misleading to outsiders.
| whalesalad wrote:
| How many credits does this get you in any cloud or via
| openai/anthropic api credits? For ~$1700 I can accomplish way
| more without hardware locally. Don't get me wrong I enjoy
| tinkering and building projects like this - but it doesn't make
| financial sense here to me. Unless of course you live 100% off-
| grid and have Stallman level privacy concerns.
|
| Of course I do want my own local GPU compute setup, but the juice
| just isn't worth the squeeze.
| gytisgreitai wrote:
| 1.7kEUR and 300w for a playground. Man this world is getting
| crazy and I'm getting f-kin old by not understanding it.
| asasidh wrote:
| you can run 32B and even 70B (a bit slow) models on a m4 mac mini
| pro with 48 GB ram. Out of the box using Ollama. If you enjoy
| putting together a desktop, thats understandable.
|
| https://deepgains.substack.com/p/running-deepseek-locally-fo...
___________________________________________________________________
(page generated 2025-02-11 23:00 UTC)