[HN Gopher] Local LLM inference - impressive but too hard to wor...
___________________________________________________________________
Local LLM inference - impressive but too hard to work with
Author : aazo11
Score : 50 points
Date : 2025-04-21 16:42 UTC (6 hours ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| aazo11 wrote:
| I spent a couple of weeks trying out local inference solutions
| for a project. Wrote up my thoughts with some performance
| benchmarks in a blog.
|
| TLDR -- What these frameworks can do on off the shelf laptops is
| astounding. However, it is very difficult to find and deploy a
| task specific model and the models themselves (even with
| quantization) are so large the download would kill UX for most
| applications.
| codelion wrote:
| There are ways to improve the performance of local LLMs with
| inference time techniques. You can try with optillm -
| https://github.com/codelion/optillm it is possible to match the
| performance of larger models on narrow tasks by doing more at
| inference.
| ranger_danger wrote:
| I thought llamafile was supposed to be the solution to "too hard
| to work with"?
|
| https://github.com/Mozilla-Ocho/llamafile
| archerx wrote:
| Llamafile is great and love it. I run all my models using it
| and it's super portable, I have tested it on windows and linux,
| on a powerful PC and SBC. It worked great without too my
| issues.
|
| It takes about a month for the features from llama.cpp to
| trickle in. Also figuring the best mix of context length size
| to vram size to desired speed takes a while before it gets
| intuitive.
| rzzzt wrote:
| I thought it's "docker model" (and OCI artifacts).
| dust42 wrote:
| llamafile is a multiplatform executable that wraps the model
| and a slightly modified version of llama.cpp. IIRC funded by
| Moz.
| antirez wrote:
| Download the model in background. Serve the client with an LLM
| vendor API just for the first requests, or even using that same
| local LLM installed on your own servers (likely cheaper). By
| doing so, in the long run the inference cost is near-zero and
| allows to use LLMs in otherwise impossible business models (like
| freemium).
| aazo11 wrote:
| Exactly. Why does this not exist yet?
| byyoung3 wrote:
| its an if statement on whether the model has downloaded or
| not
| aazo11 wrote:
| A better solution would train/finetune the smaller model
| from the responses of the larger model and only push to the
| inference to the edge if the smaller model is performant
| and the hardware specs can handle the workload?
| monoid73 wrote:
| yeah, that'd b nice, some kind of self-bootstrapping
| system where you start with a strong cloud model, then
| fine-tune a smaller local one over time until it's good
| enough to take over. tricky part is managing quality
| drift and deciding when it's 'good enough' without
| tanking UX. edge hardware's catching up though, so feels
| more feasible by the day.
| manmal wrote:
| Personally, I only use locally run models when I absolutely
| can't have the prompt/context uploaded to a cloud. For anything
| else, I just use one of the commercial cloud hosted models. The
| ones I'm using are way faster and better in _every_ way except
| privacy. Eg if you are ok to spend more, you can get blazing
| fast DeepSeek v3 or R1 via OpenRouter. Or, rather cheap Claude
| Sonnet via Copilot (pre-release also has Gemini 2.5 Pro btw).
|
| I've gotten carried away - I meant to express that using cloud
| as a fallback for local models is something I absolutely don't
| want or need, because privacy is the whole and only point to
| local models.
| ijk wrote:
| There's two general categories of local inference:
|
| - You're running a personal hosted instance. Good for
| experimentation and personal use; though there's a tradeoff on
| renting a cloud server.
|
| - You want to run LLM inference on client machines (i.e., you
| aren't directly supervising it while it is running).
|
| I'd say that the article is mostly talking about the second one.
| Doing the first one will get you familiar enough with the
| ecosystem to handle some of the issues he ran into when
| attempting the second (e.g., exactly which model to use). But the
| second has a bunch of unique constraints--you want things to just
| work for your users, after all.
|
| I've done in-browser neural network stuff in the past (back when
| using TensorFlow.js was a reasonable default choice) and based on
| the way LLM trends are going I'd guess that edge device LLM will
| be relatively reasonable soon; I'm not quite sure that I'd deploy
| it in production this month but ask me again in a few.
|
| Relatively tightly constrained applications are going to benefit
| more than general-purpose chatbots; pick a small model that's
| relatively good at your task and train it on enough of your data
| and you can get a 1B or 3B model that has acceptable performance,
| let alone the 7B ones being discussed here. It absolutely won't
| replace ChatGPT (though we're getting closer to replacing ChatGPT
| 3.5 with small models). But if you've got a specific use case
| that will hold still enough to deploy a model it can definitely
| give you the edge versus relying on the APIs.
|
| I expect games to be one of the first to try this: per-player-
| action API costs murder per-user revenue, most of the gaming
| devices have some form of GPU already, and most games are shipped
| as apps so bundling a few more GB in there is, if not reasonable,
| at least not unprecedented.
| aazo11 wrote:
| Very interesting. I had not thought about gaming at all but
| that makes a lot of sense.
|
| I also agree the goal should not be to replace ChatGPT. I think
| ChatGPT is way overkill for a lot of the workloads it is
| handling. A good solution should probably use the cloud LLM
| outputs to train a smaller model to deploy in the background.
| CharlieRuan wrote:
| Curious what are some examples of "per-player-action API costs"
| for games?
| ivape wrote:
| What if I charge "whales" in games to talk to an anime girl?
| Maybe I'll only let you talk to her once a day unless you pay
| me like a kissing booth for every convo. There's going to be
| some predatory stuff out there, I can see what the GP is
| talking about with games.
| kevingadd wrote:
| For a while basically any mobile or browser freemium game you
| tried would have progress timers for building things or
| upgrading things and they'd charge you Actual Money to skip
| the wait. That's kind of out of fashion now though some games
| still do it.
| ijk wrote:
| Inference using an API costs money. Not a lot of money, per
| million tokens, but it adds up if you have a lot of
| tokens...and some of the obvious game uses really chew
| through the tokens. Like chatting with a character, or having
| the NPC character make decisions via reasoning model. Can
| easily make the tokens add up.
|
| Games, on the other hand, are mostly funded via up-front
| purchase (so you get the money once and then have to keep the
| servers running) or free to play, which very carefully tracks
| user acquisition costs versus revenue. Most F2P games make a
| tiny amount per player; they make up the difference via
| volume (and whales). So even a handful of queries per day per
| player can bankrupt you if you have a million players and no
| way to recoup the inference cost.
|
| Now, you can obviously add a subscription or ongoing charge
| to offset it, but that's not how the industry is mostly set
| up at the moment. I expect that the funding model _will_
| change, but meanwhile having a model on the edge device is
| the only currently realistic way to afford adding an LLM to a
| big single player RPG, for example.
| thot_experiment wrote:
| Yikes what's the bar for dead simple these days? Even my totally
| non-technical gamer friends are messing around with ollama
| because I just have to give them one command to get any of the
| popular LLMs up and running.
|
| Now of course "non technical" here is still a pc gamer that's had
| to fix drivers once or twice and messaged me to ask "hey how do i
| into LLM, Mr. AI knower", but I don't think twice these days
| about showing any pc owner how to use ollama because I know I
| probably won't be on the hook for much technical support. My
| sysadmin friends are easily writing clever scripts against
| ollama's JSON output to do log analysis and other stuff.
| aazo11 wrote:
| By "too hard" I do not mean getting started with them to run
| inference on a prompt. Ollama especially makes that quite easy.
| But as an application developer, I feel these platforms are too
| hard to build around. The main issues being: getting the
| correct small enough task specific model and how long it takes
| to download these models for the end user.
| thot_experiment wrote:
| I guess it depends on expectations, if your expectation is an
| CRUD app that opens in 5 seconds, then sure, it's definitely
| tedious. People do _install_ things though, the companion app
| for DJI action cameras is 700mb (which is an abomination, but
| still). Modern games are > 100gb on the high side,
| downloading 8-16gb of tensors one time is nbd. You mentioned
| that there are 663 different models of dsr1-7b on
| huggingface, sure, but if you want that model on ollama it's
| just `ollama run deepseek-r1`
|
| As a developer the amount of effort I'm likely to spend on
| the infra side of getting the model onto the user's computer
| and getting it running is now FAR FAR below the amount of
| time I'll spend developing the app itself or getting together
| a dataset to tune the model I want etc. Inference is solved
| enough. "getting the correct small enough model" is something
| that I would spend the day or two thinking about/testing when
| building something regardless. It's not hard to check how
| much VRAM someone has and get the right model, the decision
| tree for that will have like 4 branches. It's just so little
| effort compared to everything else you're going to have to do
| to deliver something of value to someone. Especially in the
| set of users that have a good reason to run locally.
| bionhoward wrote:
| LM Studio seems pretty good at making local models easier to use
| jasonjmcghee wrote:
| they made it so easy to do specdec, that alone sold it for me
|
| Some models have even a 0.5B draft model. The speed increase is
| incredible.
| aazo11 wrote:
| They look awesome. Will try it out.
| ivape wrote:
| Here is another:
|
| https://msty.app/
| resource_waste wrote:
| I'm genuinely afraid its going to do telemetry one day.
|
| I'm sure someone is watching their internet traffic, but I
| don't.
|
| I take the risk now, but I ask questions about myself,
| relationships, conversations, etc... Stuff I don't exactly want
| Microsoft/ChatGPT to have.
| ivape wrote:
| Local inferencing is synonymous with privacy for me. There is
| no universe until laws get put into effect where your LLM
| usage online is private as it stands now. I suspect most of
| these companies are going to be putting in a Microsoft Clippy
| style assistant in soon that will act as a recommendation/ad
| engine very soon, and this of course requires parsing every
| convo you've ever had. Paid tier may remove Clippy, but boy
| oh boy the free tier (which most people will use) won't.
|
| Clippy is coming back guys, and we have to be ready for it.
| manmal wrote:
| I've configured Little Snitch to only allow it access to
| huggingface. I think for updates I need to reset LS to ,,ask
| for each connection" or sthg like that.
| manmal wrote:
| A less known feature of LM Studio I really like is speculative
| decoding: https://lmstudio.ai/blog/lmstudio-v0.3.10
|
| Basically you let a very small model speculate on the next few
| tokens, and the large model then blesses/rejects those
| predictions. Depending on how well the small model performs,
| you get massive speedups that way.
|
| The small model has to be as close to the big model as possible
| - I tried this with models from different vendors and it slowed
| generation down by x3 or so. So, you need to use a small Qwen
| 2.5 with a big Qwen 2.5, etc
| resource_waste wrote:
| They are using a Mac and complaining about how slow it is... This
| is an Id10t error, not a problem with LLMs.
|
| If you know, you know. CPU for LLMs is bad. No amount of Apple
| Marketing can change that.
|
| Even my $700 laptop with a 3050 produces near instant results
| with 12B models.
|
| I'm not sure what to tell you... Look to corporations who are
| doing Local LLMs and look to see what they are buying? They arent
| buying Apple, they are buying Nvidia.
| zellyn wrote:
| Weird to give MacBook Pro specs and omit RAM. Or did I miss it
| somehow? That's one of the most important factors.
| manmal wrote:
| Using a 7B model on a M2 Max also isn't quite the most
| impressive way to locally run an LLM. Why not use QwQ-32 and
| let it give some commercial non-reasoning models a run for
| their money?
| zellyn wrote:
| Exactly. You want to come close to maxing out your RAM for
| model+context. I've run Gemma on a 64GB M1 and it was pretty
| okay, although that was before the Quantization-Aware
| Training version released last week, so it might be even
| better now.
| aazo11 wrote:
| Thanks for calling that out. It was 32GB. I updated the post as
| well.
| larodi wrote:
| Having done my masters on the topic of grammar-assisted text2sql
| let me add some additional context here:
|
| - first of all local inference can never beat cloud inference for
| the very simple reason that costs go down with batching. it took
| me two years to actually understand what batching is - the LLM
| tensors flowing through transformer layers has a dimension
| designed specifically for processing data in parallel. so no
| matter if you process a 1 sequence or 128 sequences the costs are
| the same. i've read very few articles overstating this, so bear
| in mind - this is the primary stopper for competing local
| inference with cloud inference.
|
| - second, and this is not a light one to take - LLM-assisted
| text2sql is not trivial, not at all. you may think it is, you may
| expect cutting-edge models to do it right, but there are
| ...plenty of reasons models fail so badly at this seemingly
| trivial task. you may start with arbitrary article such as
| https://arxiv.org/pdf/2408.14717 and dig the references, sooner
| or later you will stumble on one of dozens overview papers by
| mostly Chinese researchers (such as
| https://arxiv.org/abs/2407.10956) where overview of approaches is
| summarized. Caution: you may feel both inspired AI will not take
| over your job, or you may feel miserable how much effort is spent
| on this task and how badly everything fails in real-world
| scenarios
|
| - finally, something we agreed with a professor advising a
| doctorate candidate whose thesis surprisingly was on the same
| topic. basically given GraphQL and other structured formats such
| as JSON, which LLMs are much better leaned on than the complex
| grammar of SQL which is not a regular grammar, but context-free
| one, which takes more complex machines to parse it and also very
| often recursion.
|
| - which brings us to the most important question - why commercial
| GPTs fare so much better on it than local models. well, it is
| presumed top players, not only use MoEs but they also employ beam
| search, perhaps speculative inference and all sorts of
| optimizations on the hardware level. while this all is not beyond
| comprehension for a casual researcher at a casual university
| (like myself) you don't get to easily run this all locally. I
| have not written an inference engine myself, but I imagine MoE
| and beam search is super compled, as beam search basically means
| - you fork the whole LLM execution state and go back and forth.
| Not sure how this even works together with batching.
|
| So basically - this is too expensive. Besides atm (to my
| knowledge) only vllm (the engine) has some sort of reasonably
| working local beam search. I would've loved to see llama.cpp's
| beam search get a rewrite, but it stalled. Trying to get
| beamsearch working with current python libs is nearly impossible
| for commodity hardware, even if you have 48gigs of ram, which
| already means a very powerful GPU.
| ianand wrote:
| Sounds like an interesting masters thesis. Is your masters
| thesis available online somewhere?
| ijk wrote:
| There's local applications of parallel processing; your average
| chatbot wouldn't use it, but a research bot with multiple
| simultaneous queries will, for example.
|
| Better local beamsearch would be really nice to have, though.
| ijk wrote:
| I do wonder if recursion is particularly hard for LLMs, given
| that they have a hard limit on how much they can loop for a
| given token. (Absent beam search, reasoning models, and other
| trickery.)
___________________________________________________________________
(page generated 2025-04-21 23:00 UTC)