[HN Gopher] I want everything local - Building my offline AI wor...
___________________________________________________________________
I want everything local - Building my offline AI workspace
Author : mkagenius
Score : 340 points
Date : 2025-08-08 18:19 UTC (4 hours ago)
(HTM) web link (instavm.io)
(TXT) w3m dump (instavm.io)
| shaky wrote:
| This is something that I think about quite a bit and am grateful
| for this write-up. The amount of friction to get privacy today is
| astounding.
| sneak wrote:
| This writeup has nothing of the sort and is not helpful toward
| that goal.
| frank_nitti wrote:
| I'd assume they are referring to being able to run your own
| workloads in a home-built system, rather then surrendering
| that ownership to the tech giants alone
| Imustaskforhelp wrote:
| Also you get a sort of complete privacy that the data never
| leaves your home too whereas at best you would have to
| trust the AI cloud providers that they are not training or
| storing that data.
|
| Its just more freedom and privacy in that matter.
| doctorpangloss wrote:
| The entire stack involved sends so much telemetry.
| frank_nitti wrote:
| This, in particular, is a big motivator and rewarding
| factor in getting local setup and working. Turning off
| the internet and seeing everything run end to end is a
| joy
| noelwelsh wrote:
| It's the hardware more than the software that is the limiting
| factor at the moment, no? Hardware to run a good LLM locally
| starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few
| Strix Halo iterations will make it considerably easier.
| colecut wrote:
| This is rapidly improving
|
| https://simonwillison.net/2025/Jul/29/space-invaders/
| Imustaskforhelp wrote:
| I hope it improves at such a steady rate! Please lets just
| hope that there is still room for improvement to packing even
| more improvements in such LLMS which can help the home
| labbing community in general.
| ramesh31 wrote:
| >Hardware to run a good LLM locally starts around $2000 (e.g.
| Strix Halo / AI Max 395) I think a few Strix Halo iterations
| will make it considerably easier.
|
| And "good" is still questionable. The thing that makes this
| stuff useful is when it works instantly like magic. Once you
| find yourself fiddling around with subpar results at slower
| speeds, essentially all of the value is gone. Local models have
| come a long way but there is still nothing even close to Claude
| levels when it comes to coding. I just tried taking the latest
| Qwen and GLM models for a spin through OpenRouter with Cline
| recently and they feel roughly on par with Claude 3.0.
| Benchmarks are one thing, but reality is a completely different
| story.
| ahmedbaracat wrote:
| Thanks for sharing. Note that the GitHub at the end of the
| article is not working...
| mkagenius wrote:
| Thanks for the heads up. It's fixed now -
|
| Coderunner-UI: https://github.com/instavm/coderunner-ui
|
| Coderunner: https://github.com/instavm/coderunner
| navbaker wrote:
| Open Web UI is a great alternative for a chat interface. You can
| point to an OpenAI API like vLLM or use the native Ollama
| integration and it has cool features like being able to say
| something like "generate code for an HTML and JavaScript pong
| game" and have it display the running code inline with the chat
| for testing
| dmezzetti wrote:
| I built TxtAI with this philosophy in mind:
| https://github.com/neuml/txtai
| pyman wrote:
| Mr Stallman? Richard, is that you?
| tcdent wrote:
| I'm constantly tempted by the idealism of this experience, but
| when you factor in the performance of the models you have access
| to, and the cost of running them on-demand in a cloud, it's
| really just a fun hobby instead of a viable strategy to benefit
| your life.
|
| As the hardware continues to iterate at a rapid pace, anything
| you pick up second-hand will still deprecate at that pace, making
| any real investment in hardware unjustifiable.
|
| Coupled with the dramatically inferior performance of the weights
| you would be running in a local environment, it's just not worth
| it.
|
| I expect this will change in the future, and am excited to invest
| in a local inference stack when the weights become available.
| Until then, you're idling a relatively expensive, rapidly
| depreciating asset.
| braooo wrote:
| Running LLMs at home is a repeat of the mess we make with "run
| a K8s cluster at home" thinking
|
| You're not OpenAI or Google. Just use pytorch, opencv, etc to
| build the small models you need.
|
| You don't need Docker even! You can share over a simple code
| based HTTP router app and pre-shared certs with friends.
|
| You're recreating the patterns required to manage a massive
| data center in 2-3 computers in your closet. That's insane.
| frank_nitti wrote:
| For me, this is essential. On priciple, I won't pay money to
| be a software engineer.
|
| I never paid for cloud infrastructure out of pocket, but
| still became the go-to person and achieved lead architecture
| roles for cloud systems, because learning the FOSS/local
| tooling "the hard way" put me in a better position to
| understand what exactly my corporate employers can leverage
| with the big cash they pay the CSPs.
|
| The same is shaping up in this space. Learning the nuts and
| bolts of wiring systems together locally with whatever Gen AI
| workloads it can support, and tinkering with parts of the
| process, is the only thing that can actually keep me
| interested and able to excel on this front relative to my
| peers who just fork out their own money to the fat cats that
| own billions worth of compute.
|
| I'll continue to support efforts to keep us on the track of
| engineers still understanding and able to 'own' their
| technology from the ground up, if only at local tinkering
| scale
| jtbaker wrote:
| Self hosting my own LLM setup in the homelab was what
| really helped me learn the fundamentals of K8s. If nothing
| else I'm grateful for that!
| Imustaskforhelp wrote:
| So I love linux and would wish to learn devops one day in its
| entirety to be an expert to actually comment on the whole
| post but
|
| I feel like they actually used docker for just the isolation
| part or as a sandbox (technically they didn't use docker but
| something similar to it for mac (apple containers) ) I don't
| think that it has anything to do with k8s or scalability or
| pre shared cert or http router :/
| jeremyjh wrote:
| I expect it will never change. In two years if there is a local
| option as good as GPT-5 there will be a much better cloud
| option and you'll have the same tradeoffs to make.
| c-hendricks wrote:
| Why would AI be one of the few areas where locally-hosted
| options can't reach "good enough"?
| hombre_fatal wrote:
| For some use-cases, like making big complex changes to big
| complex important code or doing important research, you're
| pretty much always going to prefer the best model rather
| than leave intelligence on the table.
|
| For other use-cases, like translations or basic queries,
| there's a "good enough".
| kelnos wrote:
| That depends on what you value, though. If local control
| is that important to you for whatever reason (owning your
| own destiny, privacy, whatever), you might find that
| trade off acceptable.
|
| And I expect that over time the gap will narrow. Sure,
| it's likely that commercially-built LLMs will be a step
| ahead of the open models, but -- just to make up numbers
| -- say today the commercially-built ones are 50% better.
| I could see that narrowing to 5% or something like that,
| after some number of years have passed. Maybe 5% is a
| reasonable trade-off for some people to make, depending
| on what they care about.
|
| Also consider that OpenAI, Anthropic, et al. are all
| burning through VC money like nobody's business. That
| money isn't going to last forever. Maybe at some point
| Anthropic's Pro plan becomes $100/mo, and Max becomes
| $500-$1000/mo. Building and maintaining your own
| hardware, and settling for the not-quite-the-best models
| might be very much worth it.
| m11a wrote:
| Agree, for now.
|
| But the foundation models will eventually hit a limit,
| and the open-source ecosystem, which trails by around a
| year or two, will catch up.
| bbarnett wrote:
| I grew up in a time when listening to an mp3 was too
| computationally expensive and nigh impossible for the average
| desktop. Now tiny phones can decode high def video realtime
| due to CPU extensions.
|
| And my phone uses a tiny, tiny amount of power,
| comparatively, to do so.
|
| CPU extensions and other improvements will make AI a simple,
| tiny task. Many of the improvements will come from robotics.
| oblio wrote:
| At a certain point Moore's Law died and that point was
| about 20 years ago but fortunately for MP3s, it happened
| after MP3 became easily usable. There's no point in
| comparing anything before 2005 or so from that perspective.
|
| We have long entered an era where computing is becoming
| more expensive and power hungry, we're just lucky regular
| computer usage has largely plateaued at a level where the
| already obtained performance is good enough.
|
| But major leaps are a lot more costly these days.
| pfannkuchen wrote:
| It might change once the companies switch away from lighting
| VC money on fire mode and switch to profit maximizing mode.
|
| I remember Uber and AirBnB used to seem like unbelievably
| good deals, for example. That stopped eventually.
| jeremyjh wrote:
| This I could see.
| oblio wrote:
| AirBNB is so good that it's half the size of Booking.com
| these days.
|
| And Uber is still big but about 30% of the time in places I
| go to, in Europe, it's just another website/app to call
| local taxis from (medallion and all). And I'm fairly sure
| locals generally just use the website/app of the local
| company, directly, and Uber is just a frontend for
| foreigners unfamiliar with that.
| pfannkuchen wrote:
| Right but if you wanted to start a competitor it would be
| a lot easier today vs back then. And running one for
| yourself doesn't really apply to these but spend
| magnitude difference wise it's the same idea.
| duxup wrote:
| Maybe, but my phone has become is a "good enough" computer
| for most tasks compared to a desktop or my laptop.
|
| Seems plausible the same goes for AI.
| kasey_junk wrote:
| I'd be surprised by that outcome. At one point databases were
| cutting edge tech with each engine leap frogging each other
| in capability. Still the proprietary db often have features
| that aren't matched elsewhere.
|
| But the open db got good enough that you need to justify not
| using them with specific reasons why.
|
| That seems at least as likely an outcome for models as they
| continue to improve infinitely into the stars.
| victorbjorklund wrote:
| Next two years probably. But at some point we will either hit
| scales where you really dont need anything better (lets say
| cloud is 10000 token/s and local is 5000 token/s. Makes no
| difference for most individual users) or we will hit som wall
| where ai doesnt get smarter but cost of hardware continues to
| fall
| kvakerok wrote:
| What is even a point of having a self hosted gpt5 equivalent
| that's not into petabytes of knowledge?
| zwnow wrote:
| You know there's a ceiling to all this with the current LLM
| approaches right? They won't become that much better, its
| even more likely they will degrade. There are cases of bad
| actors attacking LLMs by feeding it false information and
| propaganda. I dont see this changing in the future.
| Aurornis wrote:
| There will always be something better on big data center
| hardware.
|
| However, small models are continuing to improve at the same
| time that large RAM capacity computing hardware is becoming
| cheaper. These two will eventually intersect at a point where
| local performance is good enough and fast enough.
| kingo55 wrote:
| If you've tried gpt-oss:120b and Moonshot AIs Kimi Dev, it
| feels like this is getting closer to reality. Mac Studios,
| while expensive are now offering 512gb of usable RAM as
| well. The tooling available to running local models is also
| becoming more accessible than even just a year ago.
| meta_ai_x wrote:
| This is especially true since AI is a large multiplicative
| factor to your productivity.
|
| If Cloud LLMs have 10 IQ points > local LLM, within a month,
| you'll notice you'll be struggling behind the dude who just
| used Cloud LLM.
|
| LocalLlama is for hobbies or your job depends on running
| locallama.
|
| This is not one-time upfront setup cost vs payoff later
| tradeoff. It is a tradeoff you are making every query which
| compounds pretty quickly.
|
| Edit : I expect nothing better than downvotes from this crowd.
| How HN has fallen on AI will be a case study for the ages
| bigyabai wrote:
| > anything you pick up second-hand will still deprecate at that
| pace
|
| Not really? The people who do local inference most (from what
| I've seen) are owners of Apple Silicon and Nvidia hardware.
| Apple Silicon has ~7 years of decent enough LLM support under
| it's belt, and Nvidia is _only now_ starting to depreciate
| 11-year-old GPU hardware in drivers.
|
| If you bought a decently powerful inference machine 3 or 5
| years ago, it's probably still plugging away with great tok/s.
| Maybe even faster inference because of MoE architectures or
| improvements in the backend.
| Aurornis wrote:
| > If you bought a decently powerful inference machine 3 or 5
| years ago, it's probably still plugging away with great
| tok/s.
|
| I think this is the difference between people who embrace
| hobby LLMs and people who don't:
|
| The token/s output speed on affordable local hardware for
| large models is not great for me. I already wish the cloud
| hosted solutions were several times faster. Any time I go to
| a local model it feels like I'm writing e-mails back and
| forth to an LLM, not working with it.
|
| And also, the first Apple M1 chip was released less than 5
| years ago, not 7.
| bigyabai wrote:
| > Any time I go to a local model it feels like I'm writing
| e-mails back and forth
|
| Do you have a good accelerator? If you're offloading to a
| powerful GPU it shouldn't feel like that at all. I've
| gotten ChatGPT speeds from a 4060 running the OSS 20B and
| Qwen3 30B models, both of which are competitive with
| OpenAI's last-gen models.
|
| > the first Apple M1 chip was released less than 5 years
| ago
|
| Core ML has been running on Apple-designed silicon for 8
| years now, if we really want to get pedantic. But sure,
| _actual_ LLM /transformer use is a more recent phenomenon.
| Uehreka wrote:
| People on HN do a lot of wishful thinking when it comes to
| the macOS LLM situation. I feel like most of the people
| touting the Mac's ability to run LLMs are either impressed
| that they run at all, are doing fairly simple tasks, or just
| have a toy model they like to mess around with and it doesn't
| matter if it messes up.
|
| And that's fine! But then people come into the conversation
| from Claude Code and think there's a way to run a coding
| assistant on Mac, saying "sure it won't be as good as Claude
| Sonnet, but if it's even half as good that'll be fine!"
|
| And then they realize that the heavvvvily quantized models
| that you can run on a mac (that isn't a $6000 beast) can't
| invoke tools properly, and try to "bridge the gap" by
| hallucinating tool outputs, and it becomes clear that the
| models that are small enough to run locally aren't "20-50% as
| good as Claude Sonnet", they're like toddlers by comparison.
|
| People need to be more clear about what they mean when they
| say they're running models locally. If you want to build an
| image-captioner, fine, go ahead, grab Gemma 7b or something.
| If you want an assistant you can talk to that will give you
| advice or help you with arbitrary tasks for work, that's not
| something that's on the menu.
| bigyabai wrote:
| I agree completely. My larger point is that Apple and
| Nvidia's hardware has depreciated less slowly, because
| they've been shipping highly dense chips for a while now.
| Apple's software situation is utterly derelict and it
| cannot be seriously compared to CUDA in the same sentence.
|
| For inference purposes, though, compute shaders have worked
| fine for all 3 manufacturers. It's really only Nvidia users
| that benefit from the wealth of finetuning/training
| programs that are typically CUDA-native.
| motorest wrote:
| > As the hardware continues to iterate at a rapid pace,
| anything you pick up second-hand will still deprecate at that
| pace, making any real investment in hardware unjustifiable.
|
| Can you explain your rationale? It seems that the worst case
| scenario is that your setup might not be the most performant
| ever, but it will still work and run models just as it always
| did.
|
| This sounds like a classical and very basic opex vs capex
| tradeoff analysis, and these are renowned for showing that on
| financial terms cloud providers are a preferable option only in
| a very specific corner case: short-term investment to jump-
| start infrastructure when you do not know your scaling needs.
| This is not the case for LLMs.
|
| OP seems to have invested around $600. This is around 3 months
| worth of an equivalent EC2 instance. Knowing this, can you
| support your rationale with numbers?
| tcdent wrote:
| When considering used hardware you have to take quantization
| into account; gpt-oss-120b for example is running a very new
| MXFP4 which will use far more than 80GB to fit into the
| available fp types on older hardware or Apple silicon.
|
| Open models are trained on modern hardware and will continue
| to take advantage of cutting edge numeric types, and older
| hardware will continue to suffer worse performance and larger
| memory requirements.
| motorest wrote:
| You're using a lot of words to say "I believe yesterday's
| hardware might not run models as as fast as today's
| hardware."
|
| That's fine. The point is that yesterday's hardware is
| quite capable of running yesterday's models, and obviously
| it will also run tomorrow's models.
|
| So the question is cost. Capex vs opex. The fact is that
| buying your own hardware is proven to be far more cost-
| effective than paying cloud providers to rent some cycles.
|
| I brought data to the discussion: for the price tag of OP's
| home lab, you only afford around 3 months worth of an
| equivalent EC2 instance. What's your counter argument?
| kelnos wrote:
| Not the GP, but my take on this:
|
| You're right about the cost question, but I think the
| added dimension that people are worried about is the
| current pace of change.
|
| To abuse the idiom a bit, yesterday's hardware should be
| able to run tomorrow's models, as you say, but it might
| not be able to run next month's models (acceptably or at
| all).
|
| Fast-forward some number of years, as the pace slows.
| Then-yesterday's hardware might still be able to run
| next-next year's models acceptably, and someone might
| find that hardware to be a better, safer, longer-term
| investment.
|
| I think of this similarly to how the pace of mobile phone
| development has changed over time. In 2010 it was
| somewhat reasonable to want to upgrade your smartphone
| every two years or so: every year the newer flagship
| models were actually significantly faster than the
| previous year, and you could tell that the new OS
| versions would run slower on your not-quite-new-anymore
| phone, and even some apps might not perform as well. But
| today in 2025? I expect to have my current phone for 6-7
| years (as long as Google keeps releasing updates for it)
| before upgrading. LLM development over time may follow at
| least a superficially similar curve.
|
| Regarding the equivalent EC2 instance, I'm not comparing
| it to the cost of a homelab, I'm comparing it to the cost
| of an Anthropic Pro or Max subscription. I can't justify
| the cost of a homelab (the capex, plus the opex of
| electricity, which is expensive where I live), when in a
| year that hardware might be showing its age, and in two
| years might not meet my (future) needs. And if I can't
| justify spending the homelab cost every two years, I
| certainly can't justify spending that same amount in 3
| months for EC2.
| motorest wrote:
| > Fast-forward some number of years (...)
|
| I repeat: OP's home server costs as much as a few months
| of a cloud provider's infrastructure.
|
| To put it another way, OP can buy brand new hardware a
| few times per year and still save money compared with
| paying a cloud provider for equivalent hardware.
|
| > Regarding the equivalent EC2 instance, I'm not
| comparing it to the cost of a homelab, I'm comparing it
| to the cost of an Anthropic Pro or Max subscription.
|
| OP stated quite clearly their goal was to run models
| locally.
| ac29 wrote:
| > OP stated quite clearly their goal was to run models
| locally.
|
| Fair, but at the point you trust Amazon hosting your
| "local" LLM, its not a huge reach to just use Amazon
| Bedrock or something
| tcdent wrote:
| I incorporated the quantization aspect because it's not
| that simple.
|
| Yes, old hardware will be slower, but you will also need
| a significant amount more of it to even operate.
|
| RAM is the expensive part. You need lots of it. You need
| even more of it for older hardware which has less
| efficient float implementations.
|
| https://developer.nvidia.com/blog/floating-point-8-an-
| introd...
| Aurornis wrote:
| I think the local LLM scene is very fun and I enjoy following
| what people do.
|
| However every time I run local models on my MacBook Pro with a
| ton of RAM, I'm reminded of the gap between local hosted models
| and the frontier models that I can get for $20/month or nominal
| price per token from different providers. The difference in
| speed and quality is massive.
|
| The current local models are very impressive, but they're still
| a big step behind the SaaS frontier models. I feel like the
| benchmark charts don't capture this gap well, presumably
| because the models are trained to perform well on those
| benchmarks.
|
| I already find the frontier models from OpenAI and Anthropic to
| be slow and frequently error prone, so dropping speed and
| quality even further isn't attractive.
|
| I agree that it's fun as a hobby or for people who can't or
| won't take any privacy risks. For me, I'd rather wait and see
| what an M5 or M6 MacBook Pro with 128GB of RAM can do before I
| start trying to put together another dedicated purchase for
| LLMs.
| Uehreka wrote:
| I was talking about this in another comment, and I think the
| big issue at the moment is that a lot of the local models
| seem to really struggle with tool calling. Like, just
| straight up can't do it even though they're advertised as
| being able to. Most of the models I've tried with Goose
| (models which say they can do tool calls) will respond to my
| questions about a codebase with "I don't have any ability to
| read files, sorry!"
|
| So that's a real brick wall for a lot of people. It doesn't
| matter how smart a local model is if it can't put that
| smartness to work because it can't touch anything. The
| difference between manually copy/pasting code from LM Studio
| and having an assistant that can read and respond to errors
| in log files is light years. So until this situation changes,
| this asterisk needs to be mentioned every time someone says
| "You can run coding models on a MacBook!"
| jauntywundrkind wrote:
| Agreed that this is a huge limit. There's a lot of examples
| actually of "tool calling" but it's all bespoke code-it-
| yourself: very few of these systems have MCP integration.
|
| I have a ton of respect for SGLang as a runtime. I'm hoping
| something can be done there. https://github.com/sgl-
| project/sglang/discussions/4461 . As noted in that thread,
| it is _really_ great that Qwen3-Coder has a tool-parser
| built-in: hopefully can be some kind useful reference
| /start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-I
| nstruct/b...
| mxmlnkn wrote:
| This resonates. I have finally started looking into local
| inference a bit more recently.
|
| I have tried Cursor a bit, and whatever it used worked
| somewhat alright to generate a starting point for a feature
| and for a large refactor and break through writer's blocks.
| It was fun to see it behave similarly to my workflow by
| creating step-by-step plans before doing work, then
| searching for functions to look for locations and change
| stuff. I feel like one could learn structured thinking
| approaches from looking at these agentic AI logs. There
| were lots of issues with both of these tasks, though, e.g.,
| many missed locations for the refactor and spuriously
| deleted or indented code, but it was a starting point and
| somewhat workable with git. The refactoring usage caused me
| to reach free token limits in only two days. Based on the
| usage, it used millions of tokens in minutes, only rarely
| less than 100K tokens per request, and therefore probably
| needs a similarly large context length for best
| performance.
|
| I wanted to replicate this with VSCodium and Cline or
| Continue because I want to use it without exfiltrating all
| my data to megacorps as payment and use it to work on non-
| open-source projects, and maybe even use it offline. Having
| Cursor start indexing everything, including possibly
| private data, in the project folder as soon as it starts,
| left a bad taste, as useful as it is. But, I quickly ran
| into context length problems with Cline, and Continue does
| not seem to work very well. Some models did not work at
| all, DeepSeek was thinking for hours in loops (default
| temperature too high, should supposedly be <0.5). And even
| after getting tool use to work somewhat with qwen qwq 32B
| Q4, it feels like it does not have a full view of the
| codebase, even though it has been indexed. For one refactor
| request mentioning names from the project, it started by
| doing useless web searches. It might also be a context
| length issue. But larger contexts really eat up memory.
|
| I am also contemplating a new system for local AI, but it
| is really hard to decide. You have the choice between fast
| GPU inference, e.g., RTX 5090 if you have money, or 1-2
| used RTX 3090, or slow, but qualitatively better CPU /
| unified memory integrated GPU inference with systems such
| as the DGX Spark, the Framework Desktop AMD Ryzen AI Max,
| or the Mac Pro systems. Neither is ideal (and cheap).
| Although my problems with context length and low-performing
| agentic models seem to indicate that going for the slower
| but more helpful models on a large unified memory seems to
| be better for my use case. My use case would mostly be
| agentic coding. Code completion does not seem to fit me
| because I find it distracting, and I don't require much
| boilerplating.
|
| It also feels like the GPU is wasted, and local inference
| might be a red herring altogether. Looking at how a batch
| size of 1 is one of the worst cases for GPU computation and
| how it would only be used in bursts, any cloud solution
| will be easily an order of magnitude or two more efficient
| because of these, if I understand this correctly. Maybe
| local inference will therefore never fully take off,
| barring even more specialized hardware or hard requirements
| on privacy, e.g., for companies. To solve that, it would
| take something like computing on encrypted data, which
| seems impossible.
|
| Then again, if the batch size of 1 is indeed so bad as I
| think it to be, then maybe simply generate a batch of
| results in parallel and choose the best of the answers?
| Maybe this is not a thing because it would increase memory
| usage even more.
| com2kid wrote:
| > Like, just straight up can't do it even though they're
| advertised as being able to. Most of the models I've tried
| with Goose (models which say they can do tool calls) will
| respond to my questions about a codebase with "I don't have
| any ability to read files, sorry!"
|
| I'm working on solving this problem in two steps. The first
| is a library prefilled-json, that lets small models
| properly fill out JSON objects. The second is a unpublished
| library called Ultra Small Tool Call that presents tools in
| a way that small models can understand, and basically walks
| the model through filling out the tool call with the help
| of prefilled-json. It'll combine a number of techniques,
| including tool call RAG (pulls in tool definitions using
| RAG) and, honestly, just not throwing entire JSON schemas
| at the model but instead using context engineering to keep
| the model focused.
|
| IMHO the better solution for local on device workflows
| would be if someone trained a custom small parameter model
| that just determined if a tool call was needed and if so
| which tool.
| 1oooqooq wrote:
| more interesting is the extent apple convinced people a
| laptop can replace a desktop or server. mind blowing reality
| distortion field (as will be proven by some twenty comments
| telling I'm wrong 3... 2... 1).
| bionsystem wrote:
| I'm a desktop guy, considering the switch to a laptop-only
| setup, what would I miss ?
| kelipso wrote:
| For $10k, you too can get the power of a $2k desktop, and
| enjoy burning your lap everyday, or something like that.
| If I were to do local compute and wanted to use my
| laptop, I would only consider a setup where I ssh in to
| my desktop. So I guess only difference from saas llm
| would be privacy and the cool factor. And rate limits,
| and paying more if you go over, etc.
| com2kid wrote:
| $2k laptops now days come with 16 cores. They are
| thermally limited, but they are going to get you 60-80%
| the perf of their desktop counterparts.
|
| The real limit is on the Nvidia cards. They are cut down
| a fair bit, often with less VRAM until you really go up
| in price point.
|
| They also come with NPUs but the docs are bad and none of
| the local LLM inference engines seem to use the NPU, even
| though they could in theory be happy running smaller
| models.
| baobun wrote:
| Upgradability, repairability, thermals (translating into
| widely different performance for the same specs), I/O,
| connectivity.
| jauntywundrkind wrote:
| I agree and disagree. Many of the best models are open
| source, just _too big_ to run for most people.
|
| And there are plenty of ways to fit these models! A Mac
| Studio M3 Ultra with 512 GB unified memory though has huge
| capacity, and a decent chunk of bandwidth (800GB/s. Compare
| vs a 5090's ~1800GB/s). $10k is a lot of money, but that
| ability to fit these very large models & get quality results
| is very impressive. Performance is even less, but a single
| AMD Turin chip with it's 12-channels DDR5-6000 can get you to
| almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+
| in ram costs, plus $4800 for for example a 48 core Turin to
| go with it. (But if you go to older generations,
| affordability goes way up! Special part, but the 48-core 7R13
| is <$1000).
|
| Still, those costs come to $5000 at the low end. And come
| with much less token/s. The "grid compute" "utility compute"
| "cloud compute" model of getting work done on a hot gpu with
| a model already on it by someone else is very very direct &
| clear. And are very big investments. It's just not likely any
| of us will have anything but burst demands for GPUs, so
| structurally it makes sense. But it really feels like there's
| only small things getting in the way of running big models at
| home!
|
| Strix Halo is kind of close. 96GB usable memory isn't quite
| enough to really do the thing though (and only 256GB/s). Even
| if/when they put the new 64GB DDR5 onto the platform (for
| 256GB, lets say 224 usable), one still has to sacrifice
| quality some to fit 400B+ models. Next gen Medusa Halo is not
| coming for a while, but goes from 4->6 channels, so 384GB
| total: not bad.
|
| (It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-
| direction. Compared to the need here, it's no-where near
| enough to have a big memory host and smaller memory gpu)
| jstummbillig wrote:
| > Many of the best models are open source, just too big to
| run for most people
|
| I don't think that's a likely future, when you consider all
| the big players doing enormous infrastructure projects and
| the money that this increasingly demands. Powerful LLMs are
| simply not a great open source candidate. The models are
| not a by-product of the bigger thing you do. They are the
| bigger thing. Open sourcing a LLM means you are essentially
| investing money to just give it away. That simply does not
| make a lot of sense from a business perspective. You can do
| that in a limited fashion for a limited time, for example
| when you are scaling or it's not really your core business
| and you just write it off as expenses, while you try to
| figure yet another thing out (looking at you Meta).
|
| But with the current paradigm, one thing seems to be very
| clear: Building and running ever bigger LLMs is a money
| burning machine the likes of which we have rarely or ever
| seen, and operating that machine at a loss will make you
| run out of any amount of money really, really fast.
| esseph wrote:
| https://pcisig.com/pci-sig-announces-
| pcie-80-specification-t...
|
| From 2003-2016, 13 years, we had PCIE 1,2,3.
|
| 2017 - PCIE 4.0
|
| 2019 - PCIE 5.0
|
| 2022 - PCIE 6.0
|
| 2025 - PCIE 7.0
|
| 2028 - PCIE 8.0
|
| Manufacturing and vendors are having a hard time keeping
| up. And the PCIE 5.0 memory is.. not always the most
| stable.
| dcrazy wrote:
| Are you conflating GDDR5x with PCIe 5.0?
| esseph wrote:
| No.
|
| I'm saying we're due for faster memory but seem to be
| having trouble scaling bus speeds as well (in production)
| and reliable memory. And the network is changing a lot,
| too.
|
| It's a neverending cycle I guess.
| Aurornis wrote:
| > Many of the best models are open source, just too big to
| run for most people.
|
| You can find all of the open models hosted across different
| providers. You can pay per token to try them out.
|
| I just don't see the open models as being at the same
| quality level as the best from Anthropic and OpenAI.
| They're _good_ but in my experience they 're not as good as
| the benchmarks would suggest.
|
| > $10k is a lot of money, but that ability to fit these
| very large models & get quality results is very impressive.
|
| This is why I only appreciate the local LLM scene from a
| distance.
|
| It's really cool that this can be done, but $10K to run
| lower quality models at slower speeds is a hard sell. I can
| rent a lot of hours on an on-demand cloud server for a lot
| less than that price or I can pay $20-$200/month and get
| great performance and good quality from Anthropic.
|
| I think the local LLM scene is fun where it intersects with
| hardware I would buy anyway (MacBook Pro with a lot of RAM)
| but spending $10K to run open models locally is a very
| expensive hobby.
| kelnos wrote:
| > _I expect this will change in the future_
|
| I'm really hoping for that too. As I've started to adopt Claude
| Code more and more into my workflow, I don't want to depend on
| a company for day-to-day coding tasks. I don't want to have to
| worry about rate limits or API spend, or having to put up
| $100-$200/mo for this. I don't want everything I do to be
| potentially monitored or mined by the AI company I use.
|
| To me, this is very similar to why all of the smart-home stuff
| I've purchased all must have local control, and why I run my
| own smart-home software, and self-host the bits that let me
| access it from outside my home. I don't want any of this or
| that tied to some company that could disappear tomorrow, jack
| up their pricing, or sell my data to third parties. Or even use
| my data for their own purposes.
|
| But yeah, I can't see myself trying to set any LLMs up for my
| own use right now, either on hardware I own, or in a VPS I
| manage myself. The cost is very high (I'm only paying Anthropic
| $20/mo right now, and I'm very happy with what I get for that
| price), and it's just too fiddly and requires too much
| knowledge to set up and maintain, knowledge that I'm not all
| that interested in acquiring right now. Some people enjoy doing
| that, but that's not me. And the current open models and
| tooling around them just don't seem to be in the same class as
| what you can get from Anthropic et al.
|
| But yes, I hope and expect this will change!
| cyanydeez wrote:
| Anything you build in the LLM cloud will be. Must be. Rug
| pulled either via locking success or utter bankruptcy or just a
| model context prompt change.
|
| Unless you're a billionaire with pull, you're building tools
| you cant control, cant own and are ephermap wisps.
|
| That's even if you can even trust these large models in
| consistency.
| ActorNightly wrote:
| >but when you factor in the performance of the models you have
| access to, and the cost of running them on-demand in a cloud,
| it's really just a fun hobby instead of a viable strategy to
| benefit your life.
|
| Its because people are thinking too linearly about this,
| equating model size with usability.
|
| Without going into too much detail because this may be a viable
| business plan for me, but I have had very good success with
| Gemma QAT model that runs quite well on a 3090 wrapped up in a
| very custom agent format that goes beyond simple
| prompt->response use. It can do things that even the full size
| large language models fail to do.
| alliao wrote:
| really depends on whether local model satisfies your own usage
| right? if it works locally well enough, just package it up and
| be content? as long as it's providing value now at least it's
| local...
| sneak wrote:
| Halfway through he gives up and uses remote models. The basic
| premise here is false.
|
| Also, the term "remote code execution" in the beginning is
| misused. Ironically, remote code execution refers to execution of
| code locally - by a remote attacker. Claude Code does in fact
| have that, but I'm not sure if that's what they're referring to.
| thepoet wrote:
| The blog says more about keeping the user data private. The
| remote models in the context are operating blind. I am not sure
| why you are nitpicking, almost nobody reading the blog would
| take remote code execution in that context.
| vunderba wrote:
| The MCP aspect (for code/tool execution) is completely
| orthogonal to the issue of data privacy.
|
| If you put a remote LLM in the chain than it is 100% going to
| inadvertently send user data up to them at some point.
|
| e.g. if I attach a PDF to my context that contains private
| data, it _WILL_ be sent to the LLM. I have no idea what
| "operating blind" means in this context. Connecting to a
| remote LLM means your outgoing requests are tied to a
| specific authenticated API key.
| mark_l_watson wrote:
| That is fairly cool. I was talking about this on X yesterday:
| another angle however, I use a local web scraper and search
| engine via meilisearch the main tech web sites I am interested
| in. For my personal research I use three web search APIs, but
| there is some latency. Having a big chuck of the web that I am
| interested in available locally with close to zero latency is
| nice when running local models, my own MCP services that might
| need web search, etc.
| luke14free wrote:
| you might want to check out what we built -> https://inference.sh
| supports most major open source/weight models from wan 2.2 video,
| qwen image, flux, most llms, hunyan 3d etc.. works in a
| containerized way locally by allowing you to bring your own gpu
| as an engine (fully free) or allows you to rent remote gpu/pool
| from a common cloud in case you want to run more complex models.
| for each model we tried to add quantized/ggufs versions to even
| wan2.2/qwen image/gemma become possible to execute with as little
| as 8gb vram gpus. mcp support coming soon in our chat interface
| so it can access other apps from the ecosystem.
| rshemet wrote:
| if you ever end up trying to take this in the mobile direction,
| consider running on-device AI with Cactus -
|
| https://cactuscompute.com/
|
| Blazing-fast, cross-platform, and supports nearly all recent OS
| models.
| xt00 wrote:
| Yea in an ideal world there would be a legal construct around AI
| agents in the cloud doing something on your behalf that could not
| be blocked by various stakeholders deciding they don't like the
| thing you are doing even if totally legal. Things that would be
| considered fair use, or maybe annoying to certain companies
| should not be easy for companies to just wholesale block by
| leveraging business relationships. Barring that, then yea, a
| local AI setup is the way to go.
| sabareesh wrote:
| Here is my rig, running GLM 4.5 Air. Very impressed by this model
|
| https://sabareesh.com/posts/llm-rig/
|
| https://huggingface.co/zai-org/GLM-4.5
| mkummer wrote:
| Super cool and well thought out!
|
| I'm working on something similar focused on being able to easily
| jump between the two (cloud and fully local) using a Bring Your
| Own [API] Key model - all data/config/settings/prompts are fully
| stored locally and provider API calls are routed directly (never
| pass through our servers). Currently using mlc-llm for models &
| inference fully local in the browser (Qwen3-1.7b has been working
| great)
|
| [1] https://hypersonic.chat/
| Imustaskforhelp wrote:
| I think I still prefer local but I feel like that's because that
| most AI inference is kinda slow or comparable to local. But I
| recently tried out cerebras or (I have heard about groq too) and
| honestly when you try things at 1000 tk/s or similar, your mental
| model really shifts and becomes quite impatient. Cerebras does
| say that they don't log your data or anything in general and you
| would have to trust me to say that I am not sponsored by them
| (Wish I was tho) Its just that they are kinda nice.
|
| But I still hope that we can someday actually have some
| meaningful improvements in speed too. Diffusion models seem to be
| really fast in architecture.
| retrocog wrote:
| Its all about context and purpose, isn't it? For certain
| lightweight uses cases, especially those concerning sensitive
| user data, a local implementation may make a lot of sense.
| kaindume wrote:
| Self hosted and offline AI systems would be great for privacy but
| the hardware and electricity cost are much too high for most
| users. I am hoping for a P2P decentralized solution that runs on
| distributed hardware not controlled by a single corporation.
| user3939382 wrote:
| I'd settle for homomorphic encryption but that's a long way off
| if ever
| woadwarrior01 wrote:
| > LLMs: Ollama for local models (also private models for now)
|
| Incidentally, I decided to try to Ollama macOS app yesterday, and
| the first thing it tries to do upon launch is try to connect to
| some google domain. Not very private.
|
| https://imgur.com/a/7wVHnBA
| Aurornis wrote:
| Automatic update checks
| https://github.com/ollama/ollama/blob/main/docs/faq.md
| abtinf wrote:
| Yep, and I've noticed the same thing with in vscode with both
| the cline plugin and the copilot plugin.
|
| I configure them both to use local ollama, block their outbound
| connections via little snitch, and they just flat out don't
| work without the ability to phone home or posthog.
|
| Super disappointing that Cline tries to do so much outbound
| comms, even after turning off telemetry in the settings.
| eric-burel wrote:
| But can be audited which I'd buy everyday. It's probably not to
| hard to find network calls in a codebase if this task must be
| automated on update.
| adsharma wrote:
| https://github.com/adsharma/ask-me-anything
|
| Supports MLX on Apple silicon. Electron app.
|
| There is a CI to build downloadable binaries. Looking to make a
| v0.1 release.
| andylizf wrote:
| This is fantastic work. The focus on a local, sandboxed execution
| layer is a huge piece of the puzzle for a private AI workspace.
| The `coderunner` tool looks incredibly useful.
|
| A complementary challenge is the knowledge layer: making the AI
| aware of your personal data (emails, notes, files) via RAG. As
| soon as you try this on a large scale, storage becomes a massive
| bottleneck. A vector database for years of emails can easily
| exceed 50GB.
|
| (Full disclosure: I'm part of the team at Berkeley that tackled
| this). We built LEANN, a vector index that cuts storage by ~97%
| by not storing the embeddings at all. It makes indexing your
| entire digital life locally actually feasible.
|
| Combining a local execution engine like this with a hyper-
| efficient knowledge index like LEANN feels like the real path to
| a true "local Jarvis."
|
| Code: https://github.com/yichuan-w/LEANN Paper:
| https://arxiv.org/abs/2405.08051
| sebmellen wrote:
| I know next to nothing about embeddings.
|
| Are there projects that implement this same "pruned graph"
| approach for cloud embeddings?
| doctoboggan wrote:
| > A vector database for years of emails can easily exceed 50GB.
|
| In 2025 I would consider this a relatively meager requirement.
| andylizf wrote:
| Yeah, that's a fair point at first glance. 50GB might not
| sound like a huge burden for a modern SSD.
|
| However, the 50GB figure was just a starting point for
| emails. A true "local Jarvis," would need to index
| everything: all your code repositories, documents, notes, and
| chat histories. That raw data can easily be hundreds of
| gigabytes.
|
| For a 200GB text corpus, a traditional vector index can swell
| to >500GB. At that point, it's no longer a "meager"
| requirement. It becomes a heavy "tax" on your primary drive,
| which is often non-upgradable on modern laptops.
|
| The goal for practical local AI shouldn't just be that it's
| possible, but that it's also lightweight and sustainable.
| That's the problem we focused on: making a comprehensive
| local knowledge base feasible without forcing users to
| dedicate half their SSD to a single index.
| oblio wrote:
| It feels weird that the search index is bigger than the
| underlying data, weren't search indexes supposed to be
| efficient formats giving fast access to the underlying data?
| andylizf wrote:
| Exactly. That's because instead of just mapping keywords,
| vector search stores the rich meaning of the text as massive
| data structures, and LEANN is our solution to that
| paradoxical inefficiency.
| yichuan wrote:
| I guess for semantic search(rather than keyword search), the
| index is larger than the text because we need to embed them
| into a huge semantic space, which make sense to me
| wfn wrote:
| Thank you for the pointer to LEANN! I've been experimenting
| with RAGs and missed this one.
|
| I am particularly excited about using RAG as the knowledge
| layer for LLM agents/pipelines/execution engines _to make it
| feasible for LLMs to work with large codebases_. It seems like
| the current solution is _already_ worth a try. It really makes
| it easier that your RAG solution already has Claude Code
| integration![1]
|
| Has anyone tried the above challenge (RAG + some LLM for
| working with large codebases)? I'm very curious how it goes
| (thinking it may require some careful system-prompting to push
| agent to make heavy use of RAG index/graph/KB, but that is
| fine).
|
| I think I'll give it a try later (using cloud frontier model
| for LLM though, for now...)
|
| [1]:
| https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...
| com2kid wrote:
| > Even with help from the "world's best" LLMs, things didn't go
| quite as smoothly as we had expected. They hallucinated steps,
| missed platform-specific quirks, and often left us worse off.
|
| This shows how little native app training data is even available.
|
| People rarely write blog posts about designing native apps, long
| winded medium tutorials don't exist, heck even the number of open
| source projects for native desktop apps is a small percentage
| compared to mobile and web apps.
|
| Historically Microsoft paid some of the best technical writers in
| the world to write amazing books on how to code for Windows (see:
| Charles Petzold), but now days that entire industry is almost
| dead.
|
| These types of holes in training data are going to be a larger
| and larger problem.
|
| Although this is just representative of software engineering in
| general - few people want to write native desktop apps because it
| is a career dead end. Back in the 90s knowing how to write
| Windows desktop apps was _great_ , it was pretty much a promised
| middle class lifestyle with a pretty large barrier to entry
| (C/C++ programming was hard, the Windows APIs were not easy to
| learn, even though MS dumped tons of money into training
| programs), but things have changed a lot. Outside of the OS
| vendors themselves (Microsoft, Apple) and a few legacy app teams
| (Adobe, Autodesk, etc), very few jobs exist for writing desktop
| apps.
| thorncorona wrote:
| I mean outside of HPC why would you when the browser is the
| world's most ubiquitous VM?
| gen2brain wrote:
| People are talking about AI everywhere, but where can we find
| documentation, examples, and proof of how it works? It all ends
| with chat. Which chat is better and cheaper? This local story is
| just using some publicly available model, but downloaded? When is
| this going to stop?
| bling1 wrote:
| On a similar vibe, we developed app.czero.cc to run an LLM inside
| your chrome browser on your machine hardware without installation
| (you do have to download the models). Hard to run big models, but
| it doesnt get more local than that without having to install
| anything.
| btbuildem wrote:
| I didn't see any mention of the hardware OP is planning to run
| this on -- any hints?
| vunderba wrote:
| Infra notwithstanding - I'd be interested in hearing how much
| success they actually had using a locally hosted MCP-capable LLM
| (and which ones in particular) because the E2E tests in the
| article seem to be against remote models like Claude.
| ruler88 wrote:
| At least you won't be needing a heater for the winter
| mikeyanderson1 wrote:
| We have this in closed alpha right now getting ready to roll out
| to our most active builders in the coming weeks at ThinkAgents.ai
| LastTrain wrote:
| I get it but I can't get over the irony that you are using a tool
| that only works precisely because people don't do this.
| eric-burel wrote:
| An llm on your computer is a fun hobby, an llm in your SME for 10
| people is a business idea. There are not enough resources on this
| topic at all and the need is growing extremely fast. Local LLMs
| are needed for many use cases and business where cloud is not
| possible.
| nenadg wrote:
| did this by running models in chroot
___________________________________________________________________
(page generated 2025-08-08 23:00 UTC)