[HN Gopher] Local LLM inference - impressive but too hard to wor...
       ___________________________________________________________________
        
       Local LLM inference - impressive but too hard to work with
        
       Author : aazo11
       Score  : 50 points
       Date   : 2025-04-21 16:42 UTC (6 hours ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | aazo11 wrote:
       | I spent a couple of weeks trying out local inference solutions
       | for a project. Wrote up my thoughts with some performance
       | benchmarks in a blog.
       | 
       | TLDR -- What these frameworks can do on off the shelf laptops is
       | astounding. However, it is very difficult to find and deploy a
       | task specific model and the models themselves (even with
       | quantization) are so large the download would kill UX for most
       | applications.
        
         | codelion wrote:
         | There are ways to improve the performance of local LLMs with
         | inference time techniques. You can try with optillm -
         | https://github.com/codelion/optillm it is possible to match the
         | performance of larger models on narrow tasks by doing more at
         | inference.
        
       | ranger_danger wrote:
       | I thought llamafile was supposed to be the solution to "too hard
       | to work with"?
       | 
       | https://github.com/Mozilla-Ocho/llamafile
        
         | archerx wrote:
         | Llamafile is great and love it. I run all my models using it
         | and it's super portable, I have tested it on windows and linux,
         | on a powerful PC and SBC. It worked great without too my
         | issues.
         | 
         | It takes about a month for the features from llama.cpp to
         | trickle in. Also figuring the best mix of context length size
         | to vram size to desired speed takes a while before it gets
         | intuitive.
        
         | rzzzt wrote:
         | I thought it's "docker model" (and OCI artifacts).
        
           | dust42 wrote:
           | llamafile is a multiplatform executable that wraps the model
           | and a slightly modified version of llama.cpp. IIRC funded by
           | Moz.
        
       | antirez wrote:
       | Download the model in background. Serve the client with an LLM
       | vendor API just for the first requests, or even using that same
       | local LLM installed on your own servers (likely cheaper). By
       | doing so, in the long run the inference cost is near-zero and
       | allows to use LLMs in otherwise impossible business models (like
       | freemium).
        
         | aazo11 wrote:
         | Exactly. Why does this not exist yet?
        
           | byyoung3 wrote:
           | its an if statement on whether the model has downloaded or
           | not
        
             | aazo11 wrote:
             | A better solution would train/finetune the smaller model
             | from the responses of the larger model and only push to the
             | inference to the edge if the smaller model is performant
             | and the hardware specs can handle the workload?
        
               | monoid73 wrote:
               | yeah, that'd b nice, some kind of self-bootstrapping
               | system where you start with a strong cloud model, then
               | fine-tune a smaller local one over time until it's good
               | enough to take over. tricky part is managing quality
               | drift and deciding when it's 'good enough' without
               | tanking UX. edge hardware's catching up though, so feels
               | more feasible by the day.
        
         | manmal wrote:
         | Personally, I only use locally run models when I absolutely
         | can't have the prompt/context uploaded to a cloud. For anything
         | else, I just use one of the commercial cloud hosted models. The
         | ones I'm using are way faster and better in _every_ way except
         | privacy. Eg if you are ok to spend more, you can get blazing
         | fast DeepSeek v3 or R1 via OpenRouter. Or, rather cheap Claude
         | Sonnet via Copilot (pre-release also has Gemini 2.5 Pro btw).
         | 
         | I've gotten carried away - I meant to express that using cloud
         | as a fallback for local models is something I absolutely don't
         | want or need, because privacy is the whole and only point to
         | local models.
        
       | ijk wrote:
       | There's two general categories of local inference:
       | 
       | - You're running a personal hosted instance. Good for
       | experimentation and personal use; though there's a tradeoff on
       | renting a cloud server.
       | 
       | - You want to run LLM inference on client machines (i.e., you
       | aren't directly supervising it while it is running).
       | 
       | I'd say that the article is mostly talking about the second one.
       | Doing the first one will get you familiar enough with the
       | ecosystem to handle some of the issues he ran into when
       | attempting the second (e.g., exactly which model to use). But the
       | second has a bunch of unique constraints--you want things to just
       | work for your users, after all.
       | 
       | I've done in-browser neural network stuff in the past (back when
       | using TensorFlow.js was a reasonable default choice) and based on
       | the way LLM trends are going I'd guess that edge device LLM will
       | be relatively reasonable soon; I'm not quite sure that I'd deploy
       | it in production this month but ask me again in a few.
       | 
       | Relatively tightly constrained applications are going to benefit
       | more than general-purpose chatbots; pick a small model that's
       | relatively good at your task and train it on enough of your data
       | and you can get a 1B or 3B model that has acceptable performance,
       | let alone the 7B ones being discussed here. It absolutely won't
       | replace ChatGPT (though we're getting closer to replacing ChatGPT
       | 3.5 with small models). But if you've got a specific use case
       | that will hold still enough to deploy a model it can definitely
       | give you the edge versus relying on the APIs.
       | 
       | I expect games to be one of the first to try this: per-player-
       | action API costs murder per-user revenue, most of the gaming
       | devices have some form of GPU already, and most games are shipped
       | as apps so bundling a few more GB in there is, if not reasonable,
       | at least not unprecedented.
        
         | aazo11 wrote:
         | Very interesting. I had not thought about gaming at all but
         | that makes a lot of sense.
         | 
         | I also agree the goal should not be to replace ChatGPT. I think
         | ChatGPT is way overkill for a lot of the workloads it is
         | handling. A good solution should probably use the cloud LLM
         | outputs to train a smaller model to deploy in the background.
        
         | CharlieRuan wrote:
         | Curious what are some examples of "per-player-action API costs"
         | for games?
        
           | ivape wrote:
           | What if I charge "whales" in games to talk to an anime girl?
           | Maybe I'll only let you talk to her once a day unless you pay
           | me like a kissing booth for every convo. There's going to be
           | some predatory stuff out there, I can see what the GP is
           | talking about with games.
        
           | kevingadd wrote:
           | For a while basically any mobile or browser freemium game you
           | tried would have progress timers for building things or
           | upgrading things and they'd charge you Actual Money to skip
           | the wait. That's kind of out of fashion now though some games
           | still do it.
        
           | ijk wrote:
           | Inference using an API costs money. Not a lot of money, per
           | million tokens, but it adds up if you have a lot of
           | tokens...and some of the obvious game uses really chew
           | through the tokens. Like chatting with a character, or having
           | the NPC character make decisions via reasoning model. Can
           | easily make the tokens add up.
           | 
           | Games, on the other hand, are mostly funded via up-front
           | purchase (so you get the money once and then have to keep the
           | servers running) or free to play, which very carefully tracks
           | user acquisition costs versus revenue. Most F2P games make a
           | tiny amount per player; they make up the difference via
           | volume (and whales). So even a handful of queries per day per
           | player can bankrupt you if you have a million players and no
           | way to recoup the inference cost.
           | 
           | Now, you can obviously add a subscription or ongoing charge
           | to offset it, but that's not how the industry is mostly set
           | up at the moment. I expect that the funding model _will_
           | change, but meanwhile having a model on the edge device is
           | the only currently realistic way to afford adding an LLM to a
           | big single player RPG, for example.
        
       | thot_experiment wrote:
       | Yikes what's the bar for dead simple these days? Even my totally
       | non-technical gamer friends are messing around with ollama
       | because I just have to give them one command to get any of the
       | popular LLMs up and running.
       | 
       | Now of course "non technical" here is still a pc gamer that's had
       | to fix drivers once or twice and messaged me to ask "hey how do i
       | into LLM, Mr. AI knower", but I don't think twice these days
       | about showing any pc owner how to use ollama because I know I
       | probably won't be on the hook for much technical support. My
       | sysadmin friends are easily writing clever scripts against
       | ollama's JSON output to do log analysis and other stuff.
        
         | aazo11 wrote:
         | By "too hard" I do not mean getting started with them to run
         | inference on a prompt. Ollama especially makes that quite easy.
         | But as an application developer, I feel these platforms are too
         | hard to build around. The main issues being: getting the
         | correct small enough task specific model and how long it takes
         | to download these models for the end user.
        
           | thot_experiment wrote:
           | I guess it depends on expectations, if your expectation is an
           | CRUD app that opens in 5 seconds, then sure, it's definitely
           | tedious. People do _install_ things though, the companion app
           | for DJI action cameras is 700mb (which is an abomination, but
           | still). Modern games are  > 100gb on the high side,
           | downloading 8-16gb of tensors one time is nbd. You mentioned
           | that there are 663 different models of dsr1-7b on
           | huggingface, sure, but if you want that model on ollama it's
           | just `ollama run deepseek-r1`
           | 
           | As a developer the amount of effort I'm likely to spend on
           | the infra side of getting the model onto the user's computer
           | and getting it running is now FAR FAR below the amount of
           | time I'll spend developing the app itself or getting together
           | a dataset to tune the model I want etc. Inference is solved
           | enough. "getting the correct small enough model" is something
           | that I would spend the day or two thinking about/testing when
           | building something regardless. It's not hard to check how
           | much VRAM someone has and get the right model, the decision
           | tree for that will have like 4 branches. It's just so little
           | effort compared to everything else you're going to have to do
           | to deliver something of value to someone. Especially in the
           | set of users that have a good reason to run locally.
        
       | bionhoward wrote:
       | LM Studio seems pretty good at making local models easier to use
        
         | jasonjmcghee wrote:
         | they made it so easy to do specdec, that alone sold it for me
         | 
         | Some models have even a 0.5B draft model. The speed increase is
         | incredible.
        
         | aazo11 wrote:
         | They look awesome. Will try it out.
        
         | ivape wrote:
         | Here is another:
         | 
         | https://msty.app/
        
         | resource_waste wrote:
         | I'm genuinely afraid its going to do telemetry one day.
         | 
         | I'm sure someone is watching their internet traffic, but I
         | don't.
         | 
         | I take the risk now, but I ask questions about myself,
         | relationships, conversations, etc... Stuff I don't exactly want
         | Microsoft/ChatGPT to have.
        
           | ivape wrote:
           | Local inferencing is synonymous with privacy for me. There is
           | no universe until laws get put into effect where your LLM
           | usage online is private as it stands now. I suspect most of
           | these companies are going to be putting in a Microsoft Clippy
           | style assistant in soon that will act as a recommendation/ad
           | engine very soon, and this of course requires parsing every
           | convo you've ever had. Paid tier may remove Clippy, but boy
           | oh boy the free tier (which most people will use) won't.
           | 
           | Clippy is coming back guys, and we have to be ready for it.
        
           | manmal wrote:
           | I've configured Little Snitch to only allow it access to
           | huggingface. I think for updates I need to reset LS to ,,ask
           | for each connection" or sthg like that.
        
         | manmal wrote:
         | A less known feature of LM Studio I really like is speculative
         | decoding: https://lmstudio.ai/blog/lmstudio-v0.3.10
         | 
         | Basically you let a very small model speculate on the next few
         | tokens, and the large model then blesses/rejects those
         | predictions. Depending on how well the small model performs,
         | you get massive speedups that way.
         | 
         | The small model has to be as close to the big model as possible
         | - I tried this with models from different vendors and it slowed
         | generation down by x3 or so. So, you need to use a small Qwen
         | 2.5 with a big Qwen 2.5, etc
        
       | resource_waste wrote:
       | They are using a Mac and complaining about how slow it is... This
       | is an Id10t error, not a problem with LLMs.
       | 
       | If you know, you know. CPU for LLMs is bad. No amount of Apple
       | Marketing can change that.
       | 
       | Even my $700 laptop with a 3050 produces near instant results
       | with 12B models.
       | 
       | I'm not sure what to tell you... Look to corporations who are
       | doing Local LLMs and look to see what they are buying? They arent
       | buying Apple, they are buying Nvidia.
        
       | zellyn wrote:
       | Weird to give MacBook Pro specs and omit RAM. Or did I miss it
       | somehow? That's one of the most important factors.
        
         | manmal wrote:
         | Using a 7B model on a M2 Max also isn't quite the most
         | impressive way to locally run an LLM. Why not use QwQ-32 and
         | let it give some commercial non-reasoning models a run for
         | their money?
        
           | zellyn wrote:
           | Exactly. You want to come close to maxing out your RAM for
           | model+context. I've run Gemma on a 64GB M1 and it was pretty
           | okay, although that was before the Quantization-Aware
           | Training version released last week, so it might be even
           | better now.
        
         | aazo11 wrote:
         | Thanks for calling that out. It was 32GB. I updated the post as
         | well.
        
       | larodi wrote:
       | Having done my masters on the topic of grammar-assisted text2sql
       | let me add some additional context here:
       | 
       | - first of all local inference can never beat cloud inference for
       | the very simple reason that costs go down with batching. it took
       | me two years to actually understand what batching is - the LLM
       | tensors flowing through transformer layers has a dimension
       | designed specifically for processing data in parallel. so no
       | matter if you process a 1 sequence or 128 sequences the costs are
       | the same. i've read very few articles overstating this, so bear
       | in mind - this is the primary stopper for competing local
       | inference with cloud inference.
       | 
       | - second, and this is not a light one to take - LLM-assisted
       | text2sql is not trivial, not at all. you may think it is, you may
       | expect cutting-edge models to do it right, but there are
       | ...plenty of reasons models fail so badly at this seemingly
       | trivial task. you may start with arbitrary article such as
       | https://arxiv.org/pdf/2408.14717 and dig the references, sooner
       | or later you will stumble on one of dozens overview papers by
       | mostly Chinese researchers (such as
       | https://arxiv.org/abs/2407.10956) where overview of approaches is
       | summarized. Caution: you may feel both inspired AI will not take
       | over your job, or you may feel miserable how much effort is spent
       | on this task and how badly everything fails in real-world
       | scenarios
       | 
       | - finally, something we agreed with a professor advising a
       | doctorate candidate whose thesis surprisingly was on the same
       | topic. basically given GraphQL and other structured formats such
       | as JSON, which LLMs are much better leaned on than the complex
       | grammar of SQL which is not a regular grammar, but context-free
       | one, which takes more complex machines to parse it and also very
       | often recursion.
       | 
       | - which brings us to the most important question - why commercial
       | GPTs fare so much better on it than local models. well, it is
       | presumed top players, not only use MoEs but they also employ beam
       | search, perhaps speculative inference and all sorts of
       | optimizations on the hardware level. while this all is not beyond
       | comprehension for a casual researcher at a casual university
       | (like myself) you don't get to easily run this all locally. I
       | have not written an inference engine myself, but I imagine MoE
       | and beam search is super compled, as beam search basically means
       | - you fork the whole LLM execution state and go back and forth.
       | Not sure how this even works together with batching.
       | 
       | So basically - this is too expensive. Besides atm (to my
       | knowledge) only vllm (the engine) has some sort of reasonably
       | working local beam search. I would've loved to see llama.cpp's
       | beam search get a rewrite, but it stalled. Trying to get
       | beamsearch working with current python libs is nearly impossible
       | for commodity hardware, even if you have 48gigs of ram, which
       | already means a very powerful GPU.
        
         | ianand wrote:
         | Sounds like an interesting masters thesis. Is your masters
         | thesis available online somewhere?
        
         | ijk wrote:
         | There's local applications of parallel processing; your average
         | chatbot wouldn't use it, but a research bot with multiple
         | simultaneous queries will, for example.
         | 
         | Better local beamsearch would be really nice to have, though.
        
         | ijk wrote:
         | I do wonder if recursion is particularly hard for LLMs, given
         | that they have a hard limit on how much they can loop for a
         | given token. (Absent beam search, reasoning models, and other
         | trickery.)
        
       ___________________________________________________________________
       (page generated 2025-04-21 23:00 UTC)