[HN Gopher] I want everything local - Building my offline AI wor...
       ___________________________________________________________________
        
       I want everything local - Building my offline AI workspace
        
       Author : mkagenius
       Score  : 340 points
       Date   : 2025-08-08 18:19 UTC (4 hours ago)
        
 (HTM) web link (instavm.io)
 (TXT) w3m dump (instavm.io)
        
       | shaky wrote:
       | This is something that I think about quite a bit and am grateful
       | for this write-up. The amount of friction to get privacy today is
       | astounding.
        
         | sneak wrote:
         | This writeup has nothing of the sort and is not helpful toward
         | that goal.
        
           | frank_nitti wrote:
           | I'd assume they are referring to being able to run your own
           | workloads in a home-built system, rather then surrendering
           | that ownership to the tech giants alone
        
             | Imustaskforhelp wrote:
             | Also you get a sort of complete privacy that the data never
             | leaves your home too whereas at best you would have to
             | trust the AI cloud providers that they are not training or
             | storing that data.
             | 
             | Its just more freedom and privacy in that matter.
        
               | doctorpangloss wrote:
               | The entire stack involved sends so much telemetry.
        
               | frank_nitti wrote:
               | This, in particular, is a big motivator and rewarding
               | factor in getting local setup and working. Turning off
               | the internet and seeing everything run end to end is a
               | joy
        
       | noelwelsh wrote:
       | It's the hardware more than the software that is the limiting
       | factor at the moment, no? Hardware to run a good LLM locally
       | starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few
       | Strix Halo iterations will make it considerably easier.
        
         | colecut wrote:
         | This is rapidly improving
         | 
         | https://simonwillison.net/2025/Jul/29/space-invaders/
        
           | Imustaskforhelp wrote:
           | I hope it improves at such a steady rate! Please lets just
           | hope that there is still room for improvement to packing even
           | more improvements in such LLMS which can help the home
           | labbing community in general.
        
         | ramesh31 wrote:
         | >Hardware to run a good LLM locally starts around $2000 (e.g.
         | Strix Halo / AI Max 395) I think a few Strix Halo iterations
         | will make it considerably easier.
         | 
         | And "good" is still questionable. The thing that makes this
         | stuff useful is when it works instantly like magic. Once you
         | find yourself fiddling around with subpar results at slower
         | speeds, essentially all of the value is gone. Local models have
         | come a long way but there is still nothing even close to Claude
         | levels when it comes to coding. I just tried taking the latest
         | Qwen and GLM models for a spin through OpenRouter with Cline
         | recently and they feel roughly on par with Claude 3.0.
         | Benchmarks are one thing, but reality is a completely different
         | story.
        
       | ahmedbaracat wrote:
       | Thanks for sharing. Note that the GitHub at the end of the
       | article is not working...
        
         | mkagenius wrote:
         | Thanks for the heads up. It's fixed now -
         | 
         | Coderunner-UI: https://github.com/instavm/coderunner-ui
         | 
         | Coderunner: https://github.com/instavm/coderunner
        
       | navbaker wrote:
       | Open Web UI is a great alternative for a chat interface. You can
       | point to an OpenAI API like vLLM or use the native Ollama
       | integration and it has cool features like being able to say
       | something like "generate code for an HTML and JavaScript pong
       | game" and have it display the running code inline with the chat
       | for testing
        
       | dmezzetti wrote:
       | I built TxtAI with this philosophy in mind:
       | https://github.com/neuml/txtai
        
       | pyman wrote:
       | Mr Stallman? Richard, is that you?
        
       | tcdent wrote:
       | I'm constantly tempted by the idealism of this experience, but
       | when you factor in the performance of the models you have access
       | to, and the cost of running them on-demand in a cloud, it's
       | really just a fun hobby instead of a viable strategy to benefit
       | your life.
       | 
       | As the hardware continues to iterate at a rapid pace, anything
       | you pick up second-hand will still deprecate at that pace, making
       | any real investment in hardware unjustifiable.
       | 
       | Coupled with the dramatically inferior performance of the weights
       | you would be running in a local environment, it's just not worth
       | it.
       | 
       | I expect this will change in the future, and am excited to invest
       | in a local inference stack when the weights become available.
       | Until then, you're idling a relatively expensive, rapidly
       | depreciating asset.
        
         | braooo wrote:
         | Running LLMs at home is a repeat of the mess we make with "run
         | a K8s cluster at home" thinking
         | 
         | You're not OpenAI or Google. Just use pytorch, opencv, etc to
         | build the small models you need.
         | 
         | You don't need Docker even! You can share over a simple code
         | based HTTP router app and pre-shared certs with friends.
         | 
         | You're recreating the patterns required to manage a massive
         | data center in 2-3 computers in your closet. That's insane.
        
           | frank_nitti wrote:
           | For me, this is essential. On priciple, I won't pay money to
           | be a software engineer.
           | 
           | I never paid for cloud infrastructure out of pocket, but
           | still became the go-to person and achieved lead architecture
           | roles for cloud systems, because learning the FOSS/local
           | tooling "the hard way" put me in a better position to
           | understand what exactly my corporate employers can leverage
           | with the big cash they pay the CSPs.
           | 
           | The same is shaping up in this space. Learning the nuts and
           | bolts of wiring systems together locally with whatever Gen AI
           | workloads it can support, and tinkering with parts of the
           | process, is the only thing that can actually keep me
           | interested and able to excel on this front relative to my
           | peers who just fork out their own money to the fat cats that
           | own billions worth of compute.
           | 
           | I'll continue to support efforts to keep us on the track of
           | engineers still understanding and able to 'own' their
           | technology from the ground up, if only at local tinkering
           | scale
        
             | jtbaker wrote:
             | Self hosting my own LLM setup in the homelab was what
             | really helped me learn the fundamentals of K8s. If nothing
             | else I'm grateful for that!
        
           | Imustaskforhelp wrote:
           | So I love linux and would wish to learn devops one day in its
           | entirety to be an expert to actually comment on the whole
           | post but
           | 
           | I feel like they actually used docker for just the isolation
           | part or as a sandbox (technically they didn't use docker but
           | something similar to it for mac (apple containers) ) I don't
           | think that it has anything to do with k8s or scalability or
           | pre shared cert or http router :/
        
         | jeremyjh wrote:
         | I expect it will never change. In two years if there is a local
         | option as good as GPT-5 there will be a much better cloud
         | option and you'll have the same tradeoffs to make.
        
           | c-hendricks wrote:
           | Why would AI be one of the few areas where locally-hosted
           | options can't reach "good enough"?
        
             | hombre_fatal wrote:
             | For some use-cases, like making big complex changes to big
             | complex important code or doing important research, you're
             | pretty much always going to prefer the best model rather
             | than leave intelligence on the table.
             | 
             | For other use-cases, like translations or basic queries,
             | there's a "good enough".
        
               | kelnos wrote:
               | That depends on what you value, though. If local control
               | is that important to you for whatever reason (owning your
               | own destiny, privacy, whatever), you might find that
               | trade off acceptable.
               | 
               | And I expect that over time the gap will narrow. Sure,
               | it's likely that commercially-built LLMs will be a step
               | ahead of the open models, but -- just to make up numbers
               | -- say today the commercially-built ones are 50% better.
               | I could see that narrowing to 5% or something like that,
               | after some number of years have passed. Maybe 5% is a
               | reasonable trade-off for some people to make, depending
               | on what they care about.
               | 
               | Also consider that OpenAI, Anthropic, et al. are all
               | burning through VC money like nobody's business. That
               | money isn't going to last forever. Maybe at some point
               | Anthropic's Pro plan becomes $100/mo, and Max becomes
               | $500-$1000/mo. Building and maintaining your own
               | hardware, and settling for the not-quite-the-best models
               | might be very much worth it.
        
               | m11a wrote:
               | Agree, for now.
               | 
               | But the foundation models will eventually hit a limit,
               | and the open-source ecosystem, which trails by around a
               | year or two, will catch up.
        
           | bbarnett wrote:
           | I grew up in a time when listening to an mp3 was too
           | computationally expensive and nigh impossible for the average
           | desktop. Now tiny phones can decode high def video realtime
           | due to CPU extensions.
           | 
           | And my phone uses a tiny, tiny amount of power,
           | comparatively, to do so.
           | 
           | CPU extensions and other improvements will make AI a simple,
           | tiny task. Many of the improvements will come from robotics.
        
             | oblio wrote:
             | At a certain point Moore's Law died and that point was
             | about 20 years ago but fortunately for MP3s, it happened
             | after MP3 became easily usable. There's no point in
             | comparing anything before 2005 or so from that perspective.
             | 
             | We have long entered an era where computing is becoming
             | more expensive and power hungry, we're just lucky regular
             | computer usage has largely plateaued at a level where the
             | already obtained performance is good enough.
             | 
             | But major leaps are a lot more costly these days.
        
           | pfannkuchen wrote:
           | It might change once the companies switch away from lighting
           | VC money on fire mode and switch to profit maximizing mode.
           | 
           | I remember Uber and AirBnB used to seem like unbelievably
           | good deals, for example. That stopped eventually.
        
             | jeremyjh wrote:
             | This I could see.
        
             | oblio wrote:
             | AirBNB is so good that it's half the size of Booking.com
             | these days.
             | 
             | And Uber is still big but about 30% of the time in places I
             | go to, in Europe, it's just another website/app to call
             | local taxis from (medallion and all). And I'm fairly sure
             | locals generally just use the website/app of the local
             | company, directly, and Uber is just a frontend for
             | foreigners unfamiliar with that.
        
               | pfannkuchen wrote:
               | Right but if you wanted to start a competitor it would be
               | a lot easier today vs back then. And running one for
               | yourself doesn't really apply to these but spend
               | magnitude difference wise it's the same idea.
        
           | duxup wrote:
           | Maybe, but my phone has become is a "good enough" computer
           | for most tasks compared to a desktop or my laptop.
           | 
           | Seems plausible the same goes for AI.
        
           | kasey_junk wrote:
           | I'd be surprised by that outcome. At one point databases were
           | cutting edge tech with each engine leap frogging each other
           | in capability. Still the proprietary db often have features
           | that aren't matched elsewhere.
           | 
           | But the open db got good enough that you need to justify not
           | using them with specific reasons why.
           | 
           | That seems at least as likely an outcome for models as they
           | continue to improve infinitely into the stars.
        
           | victorbjorklund wrote:
           | Next two years probably. But at some point we will either hit
           | scales where you really dont need anything better (lets say
           | cloud is 10000 token/s and local is 5000 token/s. Makes no
           | difference for most individual users) or we will hit som wall
           | where ai doesnt get smarter but cost of hardware continues to
           | fall
        
           | kvakerok wrote:
           | What is even a point of having a self hosted gpt5 equivalent
           | that's not into petabytes of knowledge?
        
           | zwnow wrote:
           | You know there's a ceiling to all this with the current LLM
           | approaches right? They won't become that much better, its
           | even more likely they will degrade. There are cases of bad
           | actors attacking LLMs by feeding it false information and
           | propaganda. I dont see this changing in the future.
        
           | Aurornis wrote:
           | There will always be something better on big data center
           | hardware.
           | 
           | However, small models are continuing to improve at the same
           | time that large RAM capacity computing hardware is becoming
           | cheaper. These two will eventually intersect at a point where
           | local performance is good enough and fast enough.
        
             | kingo55 wrote:
             | If you've tried gpt-oss:120b and Moonshot AIs Kimi Dev, it
             | feels like this is getting closer to reality. Mac Studios,
             | while expensive are now offering 512gb of usable RAM as
             | well. The tooling available to running local models is also
             | becoming more accessible than even just a year ago.
        
         | meta_ai_x wrote:
         | This is especially true since AI is a large multiplicative
         | factor to your productivity.
         | 
         | If Cloud LLMs have 10 IQ points > local LLM, within a month,
         | you'll notice you'll be struggling behind the dude who just
         | used Cloud LLM.
         | 
         | LocalLlama is for hobbies or your job depends on running
         | locallama.
         | 
         | This is not one-time upfront setup cost vs payoff later
         | tradeoff. It is a tradeoff you are making every query which
         | compounds pretty quickly.
         | 
         | Edit : I expect nothing better than downvotes from this crowd.
         | How HN has fallen on AI will be a case study for the ages
        
         | bigyabai wrote:
         | > anything you pick up second-hand will still deprecate at that
         | pace
         | 
         | Not really? The people who do local inference most (from what
         | I've seen) are owners of Apple Silicon and Nvidia hardware.
         | Apple Silicon has ~7 years of decent enough LLM support under
         | it's belt, and Nvidia is _only now_ starting to depreciate
         | 11-year-old GPU hardware in drivers.
         | 
         | If you bought a decently powerful inference machine 3 or 5
         | years ago, it's probably still plugging away with great tok/s.
         | Maybe even faster inference because of MoE architectures or
         | improvements in the backend.
        
           | Aurornis wrote:
           | > If you bought a decently powerful inference machine 3 or 5
           | years ago, it's probably still plugging away with great
           | tok/s.
           | 
           | I think this is the difference between people who embrace
           | hobby LLMs and people who don't:
           | 
           | The token/s output speed on affordable local hardware for
           | large models is not great for me. I already wish the cloud
           | hosted solutions were several times faster. Any time I go to
           | a local model it feels like I'm writing e-mails back and
           | forth to an LLM, not working with it.
           | 
           | And also, the first Apple M1 chip was released less than 5
           | years ago, not 7.
        
             | bigyabai wrote:
             | > Any time I go to a local model it feels like I'm writing
             | e-mails back and forth
             | 
             | Do you have a good accelerator? If you're offloading to a
             | powerful GPU it shouldn't feel like that at all. I've
             | gotten ChatGPT speeds from a 4060 running the OSS 20B and
             | Qwen3 30B models, both of which are competitive with
             | OpenAI's last-gen models.
             | 
             | > the first Apple M1 chip was released less than 5 years
             | ago
             | 
             | Core ML has been running on Apple-designed silicon for 8
             | years now, if we really want to get pedantic. But sure,
             | _actual_ LLM /transformer use is a more recent phenomenon.
        
           | Uehreka wrote:
           | People on HN do a lot of wishful thinking when it comes to
           | the macOS LLM situation. I feel like most of the people
           | touting the Mac's ability to run LLMs are either impressed
           | that they run at all, are doing fairly simple tasks, or just
           | have a toy model they like to mess around with and it doesn't
           | matter if it messes up.
           | 
           | And that's fine! But then people come into the conversation
           | from Claude Code and think there's a way to run a coding
           | assistant on Mac, saying "sure it won't be as good as Claude
           | Sonnet, but if it's even half as good that'll be fine!"
           | 
           | And then they realize that the heavvvvily quantized models
           | that you can run on a mac (that isn't a $6000 beast) can't
           | invoke tools properly, and try to "bridge the gap" by
           | hallucinating tool outputs, and it becomes clear that the
           | models that are small enough to run locally aren't "20-50% as
           | good as Claude Sonnet", they're like toddlers by comparison.
           | 
           | People need to be more clear about what they mean when they
           | say they're running models locally. If you want to build an
           | image-captioner, fine, go ahead, grab Gemma 7b or something.
           | If you want an assistant you can talk to that will give you
           | advice or help you with arbitrary tasks for work, that's not
           | something that's on the menu.
        
             | bigyabai wrote:
             | I agree completely. My larger point is that Apple and
             | Nvidia's hardware has depreciated less slowly, because
             | they've been shipping highly dense chips for a while now.
             | Apple's software situation is utterly derelict and it
             | cannot be seriously compared to CUDA in the same sentence.
             | 
             | For inference purposes, though, compute shaders have worked
             | fine for all 3 manufacturers. It's really only Nvidia users
             | that benefit from the wealth of finetuning/training
             | programs that are typically CUDA-native.
        
         | motorest wrote:
         | > As the hardware continues to iterate at a rapid pace,
         | anything you pick up second-hand will still deprecate at that
         | pace, making any real investment in hardware unjustifiable.
         | 
         | Can you explain your rationale? It seems that the worst case
         | scenario is that your setup might not be the most performant
         | ever, but it will still work and run models just as it always
         | did.
         | 
         | This sounds like a classical and very basic opex vs capex
         | tradeoff analysis, and these are renowned for showing that on
         | financial terms cloud providers are a preferable option only in
         | a very specific corner case: short-term investment to jump-
         | start infrastructure when you do not know your scaling needs.
         | This is not the case for LLMs.
         | 
         | OP seems to have invested around $600. This is around 3 months
         | worth of an equivalent EC2 instance. Knowing this, can you
         | support your rationale with numbers?
        
           | tcdent wrote:
           | When considering used hardware you have to take quantization
           | into account; gpt-oss-120b for example is running a very new
           | MXFP4 which will use far more than 80GB to fit into the
           | available fp types on older hardware or Apple silicon.
           | 
           | Open models are trained on modern hardware and will continue
           | to take advantage of cutting edge numeric types, and older
           | hardware will continue to suffer worse performance and larger
           | memory requirements.
        
             | motorest wrote:
             | You're using a lot of words to say "I believe yesterday's
             | hardware might not run models as as fast as today's
             | hardware."
             | 
             | That's fine. The point is that yesterday's hardware is
             | quite capable of running yesterday's models, and obviously
             | it will also run tomorrow's models.
             | 
             | So the question is cost. Capex vs opex. The fact is that
             | buying your own hardware is proven to be far more cost-
             | effective than paying cloud providers to rent some cycles.
             | 
             | I brought data to the discussion: for the price tag of OP's
             | home lab, you only afford around 3 months worth of an
             | equivalent EC2 instance. What's your counter argument?
        
               | kelnos wrote:
               | Not the GP, but my take on this:
               | 
               | You're right about the cost question, but I think the
               | added dimension that people are worried about is the
               | current pace of change.
               | 
               | To abuse the idiom a bit, yesterday's hardware should be
               | able to run tomorrow's models, as you say, but it might
               | not be able to run next month's models (acceptably or at
               | all).
               | 
               | Fast-forward some number of years, as the pace slows.
               | Then-yesterday's hardware might still be able to run
               | next-next year's models acceptably, and someone might
               | find that hardware to be a better, safer, longer-term
               | investment.
               | 
               | I think of this similarly to how the pace of mobile phone
               | development has changed over time. In 2010 it was
               | somewhat reasonable to want to upgrade your smartphone
               | every two years or so: every year the newer flagship
               | models were actually significantly faster than the
               | previous year, and you could tell that the new OS
               | versions would run slower on your not-quite-new-anymore
               | phone, and even some apps might not perform as well. But
               | today in 2025? I expect to have my current phone for 6-7
               | years (as long as Google keeps releasing updates for it)
               | before upgrading. LLM development over time may follow at
               | least a superficially similar curve.
               | 
               | Regarding the equivalent EC2 instance, I'm not comparing
               | it to the cost of a homelab, I'm comparing it to the cost
               | of an Anthropic Pro or Max subscription. I can't justify
               | the cost of a homelab (the capex, plus the opex of
               | electricity, which is expensive where I live), when in a
               | year that hardware might be showing its age, and in two
               | years might not meet my (future) needs. And if I can't
               | justify spending the homelab cost every two years, I
               | certainly can't justify spending that same amount in 3
               | months for EC2.
        
               | motorest wrote:
               | > Fast-forward some number of years (...)
               | 
               | I repeat: OP's home server costs as much as a few months
               | of a cloud provider's infrastructure.
               | 
               | To put it another way, OP can buy brand new hardware a
               | few times per year and still save money compared with
               | paying a cloud provider for equivalent hardware.
               | 
               | > Regarding the equivalent EC2 instance, I'm not
               | comparing it to the cost of a homelab, I'm comparing it
               | to the cost of an Anthropic Pro or Max subscription.
               | 
               | OP stated quite clearly their goal was to run models
               | locally.
        
               | ac29 wrote:
               | > OP stated quite clearly their goal was to run models
               | locally.
               | 
               | Fair, but at the point you trust Amazon hosting your
               | "local" LLM, its not a huge reach to just use Amazon
               | Bedrock or something
        
               | tcdent wrote:
               | I incorporated the quantization aspect because it's not
               | that simple.
               | 
               | Yes, old hardware will be slower, but you will also need
               | a significant amount more of it to even operate.
               | 
               | RAM is the expensive part. You need lots of it. You need
               | even more of it for older hardware which has less
               | efficient float implementations.
               | 
               | https://developer.nvidia.com/blog/floating-point-8-an-
               | introd...
        
         | Aurornis wrote:
         | I think the local LLM scene is very fun and I enjoy following
         | what people do.
         | 
         | However every time I run local models on my MacBook Pro with a
         | ton of RAM, I'm reminded of the gap between local hosted models
         | and the frontier models that I can get for $20/month or nominal
         | price per token from different providers. The difference in
         | speed and quality is massive.
         | 
         | The current local models are very impressive, but they're still
         | a big step behind the SaaS frontier models. I feel like the
         | benchmark charts don't capture this gap well, presumably
         | because the models are trained to perform well on those
         | benchmarks.
         | 
         | I already find the frontier models from OpenAI and Anthropic to
         | be slow and frequently error prone, so dropping speed and
         | quality even further isn't attractive.
         | 
         | I agree that it's fun as a hobby or for people who can't or
         | won't take any privacy risks. For me, I'd rather wait and see
         | what an M5 or M6 MacBook Pro with 128GB of RAM can do before I
         | start trying to put together another dedicated purchase for
         | LLMs.
        
           | Uehreka wrote:
           | I was talking about this in another comment, and I think the
           | big issue at the moment is that a lot of the local models
           | seem to really struggle with tool calling. Like, just
           | straight up can't do it even though they're advertised as
           | being able to. Most of the models I've tried with Goose
           | (models which say they can do tool calls) will respond to my
           | questions about a codebase with "I don't have any ability to
           | read files, sorry!"
           | 
           | So that's a real brick wall for a lot of people. It doesn't
           | matter how smart a local model is if it can't put that
           | smartness to work because it can't touch anything. The
           | difference between manually copy/pasting code from LM Studio
           | and having an assistant that can read and respond to errors
           | in log files is light years. So until this situation changes,
           | this asterisk needs to be mentioned every time someone says
           | "You can run coding models on a MacBook!"
        
             | jauntywundrkind wrote:
             | Agreed that this is a huge limit. There's a lot of examples
             | actually of "tool calling" but it's all bespoke code-it-
             | yourself: very few of these systems have MCP integration.
             | 
             | I have a ton of respect for SGLang as a runtime. I'm hoping
             | something can be done there. https://github.com/sgl-
             | project/sglang/discussions/4461 . As noted in that thread,
             | it is _really_ great that Qwen3-Coder has a tool-parser
             | built-in: hopefully can be some kind useful reference
             | /start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-I
             | nstruct/b...
        
             | mxmlnkn wrote:
             | This resonates. I have finally started looking into local
             | inference a bit more recently.
             | 
             | I have tried Cursor a bit, and whatever it used worked
             | somewhat alright to generate a starting point for a feature
             | and for a large refactor and break through writer's blocks.
             | It was fun to see it behave similarly to my workflow by
             | creating step-by-step plans before doing work, then
             | searching for functions to look for locations and change
             | stuff. I feel like one could learn structured thinking
             | approaches from looking at these agentic AI logs. There
             | were lots of issues with both of these tasks, though, e.g.,
             | many missed locations for the refactor and spuriously
             | deleted or indented code, but it was a starting point and
             | somewhat workable with git. The refactoring usage caused me
             | to reach free token limits in only two days. Based on the
             | usage, it used millions of tokens in minutes, only rarely
             | less than 100K tokens per request, and therefore probably
             | needs a similarly large context length for best
             | performance.
             | 
             | I wanted to replicate this with VSCodium and Cline or
             | Continue because I want to use it without exfiltrating all
             | my data to megacorps as payment and use it to work on non-
             | open-source projects, and maybe even use it offline. Having
             | Cursor start indexing everything, including possibly
             | private data, in the project folder as soon as it starts,
             | left a bad taste, as useful as it is. But, I quickly ran
             | into context length problems with Cline, and Continue does
             | not seem to work very well. Some models did not work at
             | all, DeepSeek was thinking for hours in loops (default
             | temperature too high, should supposedly be <0.5). And even
             | after getting tool use to work somewhat with qwen qwq 32B
             | Q4, it feels like it does not have a full view of the
             | codebase, even though it has been indexed. For one refactor
             | request mentioning names from the project, it started by
             | doing useless web searches. It might also be a context
             | length issue. But larger contexts really eat up memory.
             | 
             | I am also contemplating a new system for local AI, but it
             | is really hard to decide. You have the choice between fast
             | GPU inference, e.g., RTX 5090 if you have money, or 1-2
             | used RTX 3090, or slow, but qualitatively better CPU /
             | unified memory integrated GPU inference with systems such
             | as the DGX Spark, the Framework Desktop AMD Ryzen AI Max,
             | or the Mac Pro systems. Neither is ideal (and cheap).
             | Although my problems with context length and low-performing
             | agentic models seem to indicate that going for the slower
             | but more helpful models on a large unified memory seems to
             | be better for my use case. My use case would mostly be
             | agentic coding. Code completion does not seem to fit me
             | because I find it distracting, and I don't require much
             | boilerplating.
             | 
             | It also feels like the GPU is wasted, and local inference
             | might be a red herring altogether. Looking at how a batch
             | size of 1 is one of the worst cases for GPU computation and
             | how it would only be used in bursts, any cloud solution
             | will be easily an order of magnitude or two more efficient
             | because of these, if I understand this correctly. Maybe
             | local inference will therefore never fully take off,
             | barring even more specialized hardware or hard requirements
             | on privacy, e.g., for companies. To solve that, it would
             | take something like computing on encrypted data, which
             | seems impossible.
             | 
             | Then again, if the batch size of 1 is indeed so bad as I
             | think it to be, then maybe simply generate a batch of
             | results in parallel and choose the best of the answers?
             | Maybe this is not a thing because it would increase memory
             | usage even more.
        
             | com2kid wrote:
             | > Like, just straight up can't do it even though they're
             | advertised as being able to. Most of the models I've tried
             | with Goose (models which say they can do tool calls) will
             | respond to my questions about a codebase with "I don't have
             | any ability to read files, sorry!"
             | 
             | I'm working on solving this problem in two steps. The first
             | is a library prefilled-json, that lets small models
             | properly fill out JSON objects. The second is a unpublished
             | library called Ultra Small Tool Call that presents tools in
             | a way that small models can understand, and basically walks
             | the model through filling out the tool call with the help
             | of prefilled-json. It'll combine a number of techniques,
             | including tool call RAG (pulls in tool definitions using
             | RAG) and, honestly, just not throwing entire JSON schemas
             | at the model but instead using context engineering to keep
             | the model focused.
             | 
             | IMHO the better solution for local on device workflows
             | would be if someone trained a custom small parameter model
             | that just determined if a tool call was needed and if so
             | which tool.
        
           | 1oooqooq wrote:
           | more interesting is the extent apple convinced people a
           | laptop can replace a desktop or server. mind blowing reality
           | distortion field (as will be proven by some twenty comments
           | telling I'm wrong 3... 2... 1).
        
             | bionsystem wrote:
             | I'm a desktop guy, considering the switch to a laptop-only
             | setup, what would I miss ?
        
               | kelipso wrote:
               | For $10k, you too can get the power of a $2k desktop, and
               | enjoy burning your lap everyday, or something like that.
               | If I were to do local compute and wanted to use my
               | laptop, I would only consider a setup where I ssh in to
               | my desktop. So I guess only difference from saas llm
               | would be privacy and the cool factor. And rate limits,
               | and paying more if you go over, etc.
        
               | com2kid wrote:
               | $2k laptops now days come with 16 cores. They are
               | thermally limited, but they are going to get you 60-80%
               | the perf of their desktop counterparts.
               | 
               | The real limit is on the Nvidia cards. They are cut down
               | a fair bit, often with less VRAM until you really go up
               | in price point.
               | 
               | They also come with NPUs but the docs are bad and none of
               | the local LLM inference engines seem to use the NPU, even
               | though they could in theory be happy running smaller
               | models.
        
               | baobun wrote:
               | Upgradability, repairability, thermals (translating into
               | widely different performance for the same specs), I/O,
               | connectivity.
        
           | jauntywundrkind wrote:
           | I agree and disagree. Many of the best models are open
           | source, just _too big_ to run for most people.
           | 
           | And there are plenty of ways to fit these models! A Mac
           | Studio M3 Ultra with 512 GB unified memory though has huge
           | capacity, and a decent chunk of bandwidth (800GB/s. Compare
           | vs a 5090's ~1800GB/s). $10k is a lot of money, but that
           | ability to fit these very large models & get quality results
           | is very impressive. Performance is even less, but a single
           | AMD Turin chip with it's 12-channels DDR5-6000 can get you to
           | almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+
           | in ram costs, plus $4800 for for example a 48 core Turin to
           | go with it. (But if you go to older generations,
           | affordability goes way up! Special part, but the 48-core 7R13
           | is <$1000).
           | 
           | Still, those costs come to $5000 at the low end. And come
           | with much less token/s. The "grid compute" "utility compute"
           | "cloud compute" model of getting work done on a hot gpu with
           | a model already on it by someone else is very very direct &
           | clear. And are very big investments. It's just not likely any
           | of us will have anything but burst demands for GPUs, so
           | structurally it makes sense. But it really feels like there's
           | only small things getting in the way of running big models at
           | home!
           | 
           | Strix Halo is kind of close. 96GB usable memory isn't quite
           | enough to really do the thing though (and only 256GB/s). Even
           | if/when they put the new 64GB DDR5 onto the platform (for
           | 256GB, lets say 224 usable), one still has to sacrifice
           | quality some to fit 400B+ models. Next gen Medusa Halo is not
           | coming for a while, but goes from 4->6 channels, so 384GB
           | total: not bad.
           | 
           | (It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-
           | direction. Compared to the need here, it's no-where near
           | enough to have a big memory host and smaller memory gpu)
        
             | jstummbillig wrote:
             | > Many of the best models are open source, just too big to
             | run for most people
             | 
             | I don't think that's a likely future, when you consider all
             | the big players doing enormous infrastructure projects and
             | the money that this increasingly demands. Powerful LLMs are
             | simply not a great open source candidate. The models are
             | not a by-product of the bigger thing you do. They are the
             | bigger thing. Open sourcing a LLM means you are essentially
             | investing money to just give it away. That simply does not
             | make a lot of sense from a business perspective. You can do
             | that in a limited fashion for a limited time, for example
             | when you are scaling or it's not really your core business
             | and you just write it off as expenses, while you try to
             | figure yet another thing out (looking at you Meta).
             | 
             | But with the current paradigm, one thing seems to be very
             | clear: Building and running ever bigger LLMs is a money
             | burning machine the likes of which we have rarely or ever
             | seen, and operating that machine at a loss will make you
             | run out of any amount of money really, really fast.
        
             | esseph wrote:
             | https://pcisig.com/pci-sig-announces-
             | pcie-80-specification-t...
             | 
             | From 2003-2016, 13 years, we had PCIE 1,2,3.
             | 
             | 2017 - PCIE 4.0
             | 
             | 2019 - PCIE 5.0
             | 
             | 2022 - PCIE 6.0
             | 
             | 2025 - PCIE 7.0
             | 
             | 2028 - PCIE 8.0
             | 
             | Manufacturing and vendors are having a hard time keeping
             | up. And the PCIE 5.0 memory is.. not always the most
             | stable.
        
               | dcrazy wrote:
               | Are you conflating GDDR5x with PCIe 5.0?
        
               | esseph wrote:
               | No.
               | 
               | I'm saying we're due for faster memory but seem to be
               | having trouble scaling bus speeds as well (in production)
               | and reliable memory. And the network is changing a lot,
               | too.
               | 
               | It's a neverending cycle I guess.
        
             | Aurornis wrote:
             | > Many of the best models are open source, just too big to
             | run for most people.
             | 
             | You can find all of the open models hosted across different
             | providers. You can pay per token to try them out.
             | 
             | I just don't see the open models as being at the same
             | quality level as the best from Anthropic and OpenAI.
             | They're _good_ but in my experience they 're not as good as
             | the benchmarks would suggest.
             | 
             | > $10k is a lot of money, but that ability to fit these
             | very large models & get quality results is very impressive.
             | 
             | This is why I only appreciate the local LLM scene from a
             | distance.
             | 
             | It's really cool that this can be done, but $10K to run
             | lower quality models at slower speeds is a hard sell. I can
             | rent a lot of hours on an on-demand cloud server for a lot
             | less than that price or I can pay $20-$200/month and get
             | great performance and good quality from Anthropic.
             | 
             | I think the local LLM scene is fun where it intersects with
             | hardware I would buy anyway (MacBook Pro with a lot of RAM)
             | but spending $10K to run open models locally is a very
             | expensive hobby.
        
         | kelnos wrote:
         | > _I expect this will change in the future_
         | 
         | I'm really hoping for that too. As I've started to adopt Claude
         | Code more and more into my workflow, I don't want to depend on
         | a company for day-to-day coding tasks. I don't want to have to
         | worry about rate limits or API spend, or having to put up
         | $100-$200/mo for this. I don't want everything I do to be
         | potentially monitored or mined by the AI company I use.
         | 
         | To me, this is very similar to why all of the smart-home stuff
         | I've purchased all must have local control, and why I run my
         | own smart-home software, and self-host the bits that let me
         | access it from outside my home. I don't want any of this or
         | that tied to some company that could disappear tomorrow, jack
         | up their pricing, or sell my data to third parties. Or even use
         | my data for their own purposes.
         | 
         | But yeah, I can't see myself trying to set any LLMs up for my
         | own use right now, either on hardware I own, or in a VPS I
         | manage myself. The cost is very high (I'm only paying Anthropic
         | $20/mo right now, and I'm very happy with what I get for that
         | price), and it's just too fiddly and requires too much
         | knowledge to set up and maintain, knowledge that I'm not all
         | that interested in acquiring right now. Some people enjoy doing
         | that, but that's not me. And the current open models and
         | tooling around them just don't seem to be in the same class as
         | what you can get from Anthropic et al.
         | 
         | But yes, I hope and expect this will change!
        
         | cyanydeez wrote:
         | Anything you build in the LLM cloud will be. Must be. Rug
         | pulled either via locking success or utter bankruptcy or just a
         | model context prompt change.
         | 
         | Unless you're a billionaire with pull, you're building tools
         | you cant control, cant own and are ephermap wisps.
         | 
         | That's even if you can even trust these large models in
         | consistency.
        
         | ActorNightly wrote:
         | >but when you factor in the performance of the models you have
         | access to, and the cost of running them on-demand in a cloud,
         | it's really just a fun hobby instead of a viable strategy to
         | benefit your life.
         | 
         | Its because people are thinking too linearly about this,
         | equating model size with usability.
         | 
         | Without going into too much detail because this may be a viable
         | business plan for me, but I have had very good success with
         | Gemma QAT model that runs quite well on a 3090 wrapped up in a
         | very custom agent format that goes beyond simple
         | prompt->response use. It can do things that even the full size
         | large language models fail to do.
        
         | alliao wrote:
         | really depends on whether local model satisfies your own usage
         | right? if it works locally well enough, just package it up and
         | be content? as long as it's providing value now at least it's
         | local...
        
       | sneak wrote:
       | Halfway through he gives up and uses remote models. The basic
       | premise here is false.
       | 
       | Also, the term "remote code execution" in the beginning is
       | misused. Ironically, remote code execution refers to execution of
       | code locally - by a remote attacker. Claude Code does in fact
       | have that, but I'm not sure if that's what they're referring to.
        
         | thepoet wrote:
         | The blog says more about keeping the user data private. The
         | remote models in the context are operating blind. I am not sure
         | why you are nitpicking, almost nobody reading the blog would
         | take remote code execution in that context.
        
           | vunderba wrote:
           | The MCP aspect (for code/tool execution) is completely
           | orthogonal to the issue of data privacy.
           | 
           | If you put a remote LLM in the chain than it is 100% going to
           | inadvertently send user data up to them at some point.
           | 
           | e.g. if I attach a PDF to my context that contains private
           | data, it _WILL_ be sent to the LLM. I have no idea what
           | "operating blind" means in this context. Connecting to a
           | remote LLM means your outgoing requests are tied to a
           | specific authenticated API key.
        
       | mark_l_watson wrote:
       | That is fairly cool. I was talking about this on X yesterday:
       | another angle however, I use a local web scraper and search
       | engine via meilisearch the main tech web sites I am interested
       | in. For my personal research I use three web search APIs, but
       | there is some latency. Having a big chuck of the web that I am
       | interested in available locally with close to zero latency is
       | nice when running local models, my own MCP services that might
       | need web search, etc.
        
       | luke14free wrote:
       | you might want to check out what we built -> https://inference.sh
       | supports most major open source/weight models from wan 2.2 video,
       | qwen image, flux, most llms, hunyan 3d etc.. works in a
       | containerized way locally by allowing you to bring your own gpu
       | as an engine (fully free) or allows you to rent remote gpu/pool
       | from a common cloud in case you want to run more complex models.
       | for each model we tried to add quantized/ggufs versions to even
       | wan2.2/qwen image/gemma become possible to execute with as little
       | as 8gb vram gpus. mcp support coming soon in our chat interface
       | so it can access other apps from the ecosystem.
        
       | rshemet wrote:
       | if you ever end up trying to take this in the mobile direction,
       | consider running on-device AI with Cactus -
       | 
       | https://cactuscompute.com/
       | 
       | Blazing-fast, cross-platform, and supports nearly all recent OS
       | models.
        
       | xt00 wrote:
       | Yea in an ideal world there would be a legal construct around AI
       | agents in the cloud doing something on your behalf that could not
       | be blocked by various stakeholders deciding they don't like the
       | thing you are doing even if totally legal. Things that would be
       | considered fair use, or maybe annoying to certain companies
       | should not be easy for companies to just wholesale block by
       | leveraging business relationships. Barring that, then yea, a
       | local AI setup is the way to go.
        
       | sabareesh wrote:
       | Here is my rig, running GLM 4.5 Air. Very impressed by this model
       | 
       | https://sabareesh.com/posts/llm-rig/
       | 
       | https://huggingface.co/zai-org/GLM-4.5
        
       | mkummer wrote:
       | Super cool and well thought out!
       | 
       | I'm working on something similar focused on being able to easily
       | jump between the two (cloud and fully local) using a Bring Your
       | Own [API] Key model - all data/config/settings/prompts are fully
       | stored locally and provider API calls are routed directly (never
       | pass through our servers). Currently using mlc-llm for models &
       | inference fully local in the browser (Qwen3-1.7b has been working
       | great)
       | 
       | [1] https://hypersonic.chat/
        
       | Imustaskforhelp wrote:
       | I think I still prefer local but I feel like that's because that
       | most AI inference is kinda slow or comparable to local. But I
       | recently tried out cerebras or (I have heard about groq too) and
       | honestly when you try things at 1000 tk/s or similar, your mental
       | model really shifts and becomes quite impatient. Cerebras does
       | say that they don't log your data or anything in general and you
       | would have to trust me to say that I am not sponsored by them
       | (Wish I was tho) Its just that they are kinda nice.
       | 
       | But I still hope that we can someday actually have some
       | meaningful improvements in speed too. Diffusion models seem to be
       | really fast in architecture.
        
       | retrocog wrote:
       | Its all about context and purpose, isn't it? For certain
       | lightweight uses cases, especially those concerning sensitive
       | user data, a local implementation may make a lot of sense.
        
       | kaindume wrote:
       | Self hosted and offline AI systems would be great for privacy but
       | the hardware and electricity cost are much too high for most
       | users. I am hoping for a P2P decentralized solution that runs on
       | distributed hardware not controlled by a single corporation.
        
         | user3939382 wrote:
         | I'd settle for homomorphic encryption but that's a long way off
         | if ever
        
       | woadwarrior01 wrote:
       | > LLMs: Ollama for local models (also private models for now)
       | 
       | Incidentally, I decided to try to Ollama macOS app yesterday, and
       | the first thing it tries to do upon launch is try to connect to
       | some google domain. Not very private.
       | 
       | https://imgur.com/a/7wVHnBA
        
         | Aurornis wrote:
         | Automatic update checks
         | https://github.com/ollama/ollama/blob/main/docs/faq.md
        
         | abtinf wrote:
         | Yep, and I've noticed the same thing with in vscode with both
         | the cline plugin and the copilot plugin.
         | 
         | I configure them both to use local ollama, block their outbound
         | connections via little snitch, and they just flat out don't
         | work without the ability to phone home or posthog.
         | 
         | Super disappointing that Cline tries to do so much outbound
         | comms, even after turning off telemetry in the settings.
        
         | eric-burel wrote:
         | But can be audited which I'd buy everyday. It's probably not to
         | hard to find network calls in a codebase if this task must be
         | automated on update.
        
       | adsharma wrote:
       | https://github.com/adsharma/ask-me-anything
       | 
       | Supports MLX on Apple silicon. Electron app.
       | 
       | There is a CI to build downloadable binaries. Looking to make a
       | v0.1 release.
        
       | andylizf wrote:
       | This is fantastic work. The focus on a local, sandboxed execution
       | layer is a huge piece of the puzzle for a private AI workspace.
       | The `coderunner` tool looks incredibly useful.
       | 
       | A complementary challenge is the knowledge layer: making the AI
       | aware of your personal data (emails, notes, files) via RAG. As
       | soon as you try this on a large scale, storage becomes a massive
       | bottleneck. A vector database for years of emails can easily
       | exceed 50GB.
       | 
       | (Full disclosure: I'm part of the team at Berkeley that tackled
       | this). We built LEANN, a vector index that cuts storage by ~97%
       | by not storing the embeddings at all. It makes indexing your
       | entire digital life locally actually feasible.
       | 
       | Combining a local execution engine like this with a hyper-
       | efficient knowledge index like LEANN feels like the real path to
       | a true "local Jarvis."
       | 
       | Code: https://github.com/yichuan-w/LEANN Paper:
       | https://arxiv.org/abs/2405.08051
        
         | sebmellen wrote:
         | I know next to nothing about embeddings.
         | 
         | Are there projects that implement this same "pruned graph"
         | approach for cloud embeddings?
        
         | doctoboggan wrote:
         | > A vector database for years of emails can easily exceed 50GB.
         | 
         | In 2025 I would consider this a relatively meager requirement.
        
           | andylizf wrote:
           | Yeah, that's a fair point at first glance. 50GB might not
           | sound like a huge burden for a modern SSD.
           | 
           | However, the 50GB figure was just a starting point for
           | emails. A true "local Jarvis," would need to index
           | everything: all your code repositories, documents, notes, and
           | chat histories. That raw data can easily be hundreds of
           | gigabytes.
           | 
           | For a 200GB text corpus, a traditional vector index can swell
           | to >500GB. At that point, it's no longer a "meager"
           | requirement. It becomes a heavy "tax" on your primary drive,
           | which is often non-upgradable on modern laptops.
           | 
           | The goal for practical local AI shouldn't just be that it's
           | possible, but that it's also lightweight and sustainable.
           | That's the problem we focused on: making a comprehensive
           | local knowledge base feasible without forcing users to
           | dedicate half their SSD to a single index.
        
         | oblio wrote:
         | It feels weird that the search index is bigger than the
         | underlying data, weren't search indexes supposed to be
         | efficient formats giving fast access to the underlying data?
        
           | andylizf wrote:
           | Exactly. That's because instead of just mapping keywords,
           | vector search stores the rich meaning of the text as massive
           | data structures, and LEANN is our solution to that
           | paradoxical inefficiency.
        
           | yichuan wrote:
           | I guess for semantic search(rather than keyword search), the
           | index is larger than the text because we need to embed them
           | into a huge semantic space, which make sense to me
        
         | wfn wrote:
         | Thank you for the pointer to LEANN! I've been experimenting
         | with RAGs and missed this one.
         | 
         | I am particularly excited about using RAG as the knowledge
         | layer for LLM agents/pipelines/execution engines _to make it
         | feasible for LLMs to work with large codebases_. It seems like
         | the current solution is _already_ worth a try. It really makes
         | it easier that your RAG solution already has Claude Code
         | integration![1]
         | 
         | Has anyone tried the above challenge (RAG + some LLM for
         | working with large codebases)? I'm very curious how it goes
         | (thinking it may require some careful system-prompting to push
         | agent to make heavy use of RAG index/graph/KB, but that is
         | fine).
         | 
         | I think I'll give it a try later (using cloud frontier model
         | for LLM though, for now...)
         | 
         | [1]:
         | https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...
        
       | com2kid wrote:
       | > Even with help from the "world's best" LLMs, things didn't go
       | quite as smoothly as we had expected. They hallucinated steps,
       | missed platform-specific quirks, and often left us worse off.
       | 
       | This shows how little native app training data is even available.
       | 
       | People rarely write blog posts about designing native apps, long
       | winded medium tutorials don't exist, heck even the number of open
       | source projects for native desktop apps is a small percentage
       | compared to mobile and web apps.
       | 
       | Historically Microsoft paid some of the best technical writers in
       | the world to write amazing books on how to code for Windows (see:
       | Charles Petzold), but now days that entire industry is almost
       | dead.
       | 
       | These types of holes in training data are going to be a larger
       | and larger problem.
       | 
       | Although this is just representative of software engineering in
       | general - few people want to write native desktop apps because it
       | is a career dead end. Back in the 90s knowing how to write
       | Windows desktop apps was _great_ , it was pretty much a promised
       | middle class lifestyle with a pretty large barrier to entry
       | (C/C++ programming was hard, the Windows APIs were not easy to
       | learn, even though MS dumped tons of money into training
       | programs), but things have changed a lot. Outside of the OS
       | vendors themselves (Microsoft, Apple) and a few legacy app teams
       | (Adobe, Autodesk, etc), very few jobs exist for writing desktop
       | apps.
        
         | thorncorona wrote:
         | I mean outside of HPC why would you when the browser is the
         | world's most ubiquitous VM?
        
       | gen2brain wrote:
       | People are talking about AI everywhere, but where can we find
       | documentation, examples, and proof of how it works? It all ends
       | with chat. Which chat is better and cheaper? This local story is
       | just using some publicly available model, but downloaded? When is
       | this going to stop?
        
       | bling1 wrote:
       | On a similar vibe, we developed app.czero.cc to run an LLM inside
       | your chrome browser on your machine hardware without installation
       | (you do have to download the models). Hard to run big models, but
       | it doesnt get more local than that without having to install
       | anything.
        
       | btbuildem wrote:
       | I didn't see any mention of the hardware OP is planning to run
       | this on -- any hints?
        
       | vunderba wrote:
       | Infra notwithstanding - I'd be interested in hearing how much
       | success they actually had using a locally hosted MCP-capable LLM
       | (and which ones in particular) because the E2E tests in the
       | article seem to be against remote models like Claude.
        
       | ruler88 wrote:
       | At least you won't be needing a heater for the winter
        
       | mikeyanderson1 wrote:
       | We have this in closed alpha right now getting ready to roll out
       | to our most active builders in the coming weeks at ThinkAgents.ai
        
       | LastTrain wrote:
       | I get it but I can't get over the irony that you are using a tool
       | that only works precisely because people don't do this.
        
       | eric-burel wrote:
       | An llm on your computer is a fun hobby, an llm in your SME for 10
       | people is a business idea. There are not enough resources on this
       | topic at all and the need is growing extremely fast. Local LLMs
       | are needed for many use cases and business where cloud is not
       | possible.
        
       | nenadg wrote:
       | did this by running models in chroot
        
       ___________________________________________________________________
       (page generated 2025-08-08 23:00 UTC)