[HN Gopher] Ask HN: What is the best LLM for consumer grade hard...
       ___________________________________________________________________
        
       Ask HN: What is the best LLM for consumer grade hardware?
        
       I have a 5060ti with 16GB VRAM. I'm looking for a model that can
       hold basic conversations, no physics or advanced math required.
       Ideally something that can run reasonably fast, near real time.
        
       Author : VladVladikoff
       Score  : 191 points
       Date   : 2025-05-30 11:02 UTC (11 hours ago)
        
       | btreecat wrote:
       | I only have 8gb of vram to work with currently, but I'm running
       | OpenWebUI as a frontend to ollamma and I have a very easy time
       | loading up multiple models and letting them duke it out either at
       | the same time or in a round robin.
       | 
       | You can even keep track of the quality of the answers over time
       | to help guide your choice.
       | 
       | https://openwebui.com/
        
         | rthnbgrredf wrote:
         | Be aware of the the recent license change of "Open"WebUI. It is
         | no longer open source.
        
           | lolinder wrote:
           | Thanks, somehow I missed that.
           | 
           | https://docs.openwebui.com/license/
        
         | nicholasjarnold wrote:
         | AMD 6700XT owner here (12Gb VRAM) - Can confirm.
         | 
         | Once I figured out my local ROCm setup Ollama was able to run
         | with GPU acceleration no problem. Connecting an OpenWebUI
         | docker instance to my local Ollama server is as easy as a
         | docker run command where you specify the OLLAMA_BASE_URL env
         | var value. This isn't a production setup, but it works nicely
         | for local usages like what the immediate parent is describing.
        
       | benterix wrote:
       | I'm afraid that 1) you are not going to get a definite answer, 2)
       | an objective answer is very hard to give, 3) you really need to
       | try a few most recent models on your own and give them the tasks
       | that seem most useful/meaningful to you. There is drastic
       | difference in output quality depending on the task type.
        
       | kouteiheika wrote:
       | If you want to run LLMs locally then the localllama community is
       | your friend: https://old.reddit.com/r/LocalLLaMA/
       | 
       | In general there's no "best" LLM model, all of them will have
       | some strengths and weaknesses. There are a bunch of good picks;
       | for example:
       | 
       | > DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-
       | ai/DeepSeek-R1-0528-Qwen3-8B
       | 
       | Released today; probably the best reasoning model in 8B size.
       | 
       | > Qwen3 -
       | https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
       | 
       | Recently released. Hybrid thinking/non-thinking models with
       | really great performance and plethora of sizes for every
       | hardware. The Qwen3-30B-A3B can even run on CPU with acceptable
       | speeds. Even the tiny 0.6B one is somewhat coherent, which is
       | crazy.
        
         | ignoramous wrote:
         | > _DeepSeek-R1-0528-Qwen3-8Bhttps://huggingface.co/deepseek-
         | ai/DeepSeek-R1-0528-Qwen3-8B ... Released today; probably the
         | best reasoning model in 8B size._                 ... we
         | distilled the chain-of-thought from DeepSeek-R1-0528 to post-
         | train Qwen3-8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B ... on
         | AIME 2024, surpassing Qwen3-8B by +10.0% & matching the
         | performance of Qwen3-235B-thinking.
         | 
         | Wild how effective distillation is turning out to be. No
         | wonder, most shops have begun to "hide" CoT now:
         | https://news.ycombinator.com/item?id=41525201
        
           | bn-l wrote:
           | > Beyond its improved reasoning capabilities, this version
           | also offers a reduced hallucination rate, enhanced support
           | for function calling, and better experience for vibe coding.
           | 
           | Thank you for thinking of the vibe coders.
        
         | luke-stanley wrote:
         | > Released today; probably the best reasoning model in 8B size.
         | 
         | Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday
         | (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new
         | version came out since! I am waiting for the other sizes! ;D
        
         | ivape wrote:
         | I'd also recommend you go with something like 8b, so you can
         | have the other 8GB of vram for a decent sized context window.
         | There's tons of good 8b ones, as mentioned above. If you go for
         | the largest model you can fit, you'll have slower inference (as
         | you pass in more tokens) and smaller context.
        
           | svachalek wrote:
           | 8b is the number of parameters. The most common quant is 4
           | bits per parameter so 8b params is roughly 4GB of VRAM.
           | (Typically more like 4.5GB)
        
             | rhdunn wrote:
             | The number of quantized bits is a trade off between size
             | and quality. Ideally you should be aiming for a 6-bit or
             | 5-bit model. I've seen some models be unstable at 4-bit
             | (where they will either repeat words or start generating
             | random words).
             | 
             | Anything below 4-bits is usually not worth it unless you
             | want to experiment with running a 70B+ model -- though I
             | don't have any experience of doing that, so I don't know
             | how well the increased parameter size balances the
             | quantization.
             | 
             | See https://github.com/ggml-org/llama.cpp/pull/1684 and htt
             | ps://gist.github.com/Artefact2/b5f810600771265fc1e3944228..
             | . for comparisons between quantization levels.
        
               | kouteiheika wrote:
               | > The number of quantized bits is a trade off between
               | size and quality. Ideally you should be aiming for a
               | 6-bit or 5-bit model. I've seen some models be unstable
               | at 4-bit (where they will either repeat words or start
               | generating random words).
               | 
               | Note that that's a skill issue of whoever quantized the
               | model. In general quantization even as low as 3-bit can
               | be almost loseless when you do quantization-aware
               | finetuning[1] (and apparently you don't even need that
               | many training tokens), but even if you don't want to do
               | any extra training you can be smart as to which parts of
               | the model you're quantizing and by how much to minimize
               | the damage (e.g. in the worst case over-quantizing even a
               | _single_ weight can have disastrous consequences[2])
               | 
               | Some time ago I ran an experiment where I finetuned a
               | small model while quantizing parts of it to 2-bits to see
               | which parts are most sensitive (the numbers are the final
               | loss; lower is better):                   1.5275
               | mlp.downscale         1.5061   mlp.upscale         1.4665
               | mlp.gate         1.4531   lm_head         1.3998
               | attn.out_proj         1.3962   attn.v_proj         1.3794
               | attn.k_proj         1.3811   input_embedding
               | 1.3662   attn.q_proj         1.3397   unquantized
               | baseline
               | 
               | So as you can see quantizing some parts of the model
               | affects it more strongly. The downprojection in the MLP
               | layers is the most sensitive part of the model (which
               | also matches with what [2] found), so it makes sense to
               | quantize this part of the model less and instead quantize
               | other parts more strongly. But if you'll just do the
               | naive "quantize everything in 4-bit" then sure, you might
               | get broken models.
               | 
               | [1] - https://arxiv.org/pdf/2502.02631 [2] -
               | https://arxiv.org/pdf/2411.07191
        
               | rhdunn wrote:
               | Interesting. I was aware of using an imatrix for the
               | i-quants but didn't know you could use them for k-quants.
               | I've not experimented with using imatrices in my local
               | setup yet.
               | 
               | And it's not a skill issue... it's the default
               | behaviour/logic when using k-quants to quantize a model
               | with llama.cpp.
        
           | diggan wrote:
           | I think your recommendation falls within
           | 
           | > all of them will have some strengths and weaknesses
           | 
           | Sometimes a higher parameter model with less quantization and
           | low context will be the best, sometimes lower parameter model
           | with some quantization and huge context will be the best,
           | sometimes high parameter count + lots of quantization +
           | medium context will be the best.
           | 
           | It's really hard to say one model is better than another in a
           | general way, since it depends on so many things like your use
           | case, the prompts, the settings, quantization, quantization
           | method and so on.
           | 
           | If you're building/trying to build stuff depending on LLMs in
           | any capacity, the first step is coming up with your own
           | custom benchmark/evaluation that you can run with your
           | specific use cases being put under test. Don't share this
           | publicly (so it doesn't end up in the training data) and run
           | it in order to figure out what model is best for that
           | specific problem.
        
           | evilduck wrote:
           | With a 16GB GPU you can comfortably run like Qwen3 14B or
           | Mistral Small 24B models at Q4 to Q6 and still have plenty of
           | context space and get much better abilities than an 8B model.
        
           | bee_rider wrote:
           | I'm curious (as someone who knows nothing about this
           | stuff!)--the context window is basically a record of the
           | conversation so far and other info that isn't part of the
           | model, right?
           | 
           | I'm a bit surprised that 8GB is useful as a context window if
           | that is the case--it just seems like you could fit a ton of
           | research papers, emails, and textbooks in 2GB, for example.
           | 
           | But, I'm commenting from a place of ignorance and curiosity.
           | Do models blow up the info in the context window, maybe do
           | some processing to pre-digest it?
        
             | hexomancer wrote:
             | Yes, every token is expanded into a vector that can be many
             | thousand of dimensions. The vectors are stored for every
             | token and every layer.
             | 
             | You absolutely can not fill even a single research paper in
             | 2 GB much less an entire book.
        
           | datameta wrote:
           | Can system RAM be used for context (albeit at lower parsing
           | speeds)?
        
         | moffkalast wrote:
         | Yes at this point it's starting to become almost a matter of
         | how much you like the model's personality since they're all
         | fairly decent. OP just has to start downloading and trying them
         | out. With 16GB one can do partial DDR5 offloading with
         | llama.cpp and run anything up to about 30B (even dense) or even
         | more at a "reasonable" speed for chat purposes. Especially with
         | tensor offload.
         | 
         | I wouldn't count Qwen as that much of a conversationalist
         | though. Mistral Nemo and Small are pretty decent. All of Llama
         | 3.X are still very good models even by today's standards. Gemma
         | 3s are great but a bit unhinged. And of course QwQ when you
         | need GPT4 at home. And probably lots of others I'm forgetting.
        
         | diggan wrote:
         | > If you want to run LLMs locally then the localllama community
         | is your friend: https://old.reddit.com/r/LocalLLaMA/
         | 
         | For folks new to reddit, it's worth noting that LocalLlama,
         | just like the rest of the internet but especially reddit, is
         | filled with misinformed people spreading incorrect "facts" as
         | truth, and you really can't use the upvote/downvote count as an
         | indicator of quality or how truthful something is there.
         | 
         | Something that is more accurate but put in a boring way will
         | often be downvoted, while straight up incorrect but
         | funny/emotional/"fitting the group think" comments usually get
         | upvoted.
         | 
         | For us who've spent a lot of time on the web, this sort of
         | bullshit detector is basically built-in at this point, but if
         | you're new to places where the group think is so heavy as on
         | reddit, it's worth being careful taking anything at face value.
        
           | ddoolin wrote:
           | This is entirely why I can't bring myself to use it. The
           | groupthink and virtue signaling is _intense_ , when it's not
           | just extremely low effort crud that rises to the top. And
           | yes, before anyone says, I know, "curate." No, thank you.
        
             | dingnuts wrote:
             | Friend, this website is EXACTLY the same
        
               | rafterydj wrote:
               | I understand that the core similarities are there, but I
               | disagree. The comparisons have been around since I
               | started browsing HN years ago. The moderation on this
               | site, for one, emphasizes constructive conversation and
               | discussion in a way that most subreddits can only dream
               | of.
               | 
               | It also helps that the target audience has been filtered
               | with that moderation, so over time this site (on average)
               | skews more technical and informed.
        
               | ffsm8 wrote:
               | Frankly, no. As an obvious example that can be stated
               | nowadays: musk has always been an over-promising liar.
               | 
               | Eg just look at the 2012+ videos of thunderf00t.
               | 
               | Yet people were literally banned here just for pointing
               | out that he hasn't actually delivered on anything in the
               | capacity he promised until he did the salute.
               | 
               | It's pointless to list other examples, as this page is-
               | as dingnuts pointed out - exactly the same and most
               | people aren't actually willing to change their opinion
               | based on arguments. They're set in their opinions and
               | think everyone else is dumb.
        
               | lcnPylGDnU4H9OF wrote:
               | I don't see how that example refutes their point. It can
               | be true both that there have been disagreeable bans and
               | that the bans, in general, tend to result in higher
               | quality discussions. The disagreeable bans seem to be
               | outliers.
               | 
               | > They're set in their opinions and think everyone else
               | is dumb.
               | 
               | Well, anyway, I read and post comments here because
               | commenters here think critically about discussion topics.
               | It's not a perfect community with perfect moderation but
               | the discussions are of a quality that's hard to find
               | elsewhere, let alone reddit.
        
               | chucksmash wrote:
               | > Yet people were literally banned here just for pointing
               | out that he hasn't actually delivered on anything in the
               | capacity he promised until he did the salute.
               | 
               | I'd be shocked if they (you?) were banned _just_ for
               | critiquing Musk. So please link the post. I 'm prepared
               | to be shocked.
               | 
               | I'm also pretty sure that I could make a throwaway
               | account that only posted critiques of Musk (or about any
               | single subject for that matter) and manage to keep it
               | alive by making the critiques timely, on-topic and
               | thoughtful or get it banned by being repetitive and
               | unconstructive. So would you say I was banned for talking
               | about <topic>? Or would you say I was banned for my
               | behavior while talking about <topic>?
        
               | janalsncm wrote:
               | Aside from the fact that I highly doubt anyone was banned
               | as you describe, EM's _stories_ have gotten more and more
               | grandiose. So it's not the same.
               | 
               | Today he's pitching moonshot projects as core to Tesla.
               | 
               | 10 years ago he was saying self-driving was easy, but he
               | was also selling by far the best electric vehicle on the
               | market. So lying about self driving and Tesla semis
               | mattered less.
               | 
               | Fwiw I've been subbed to tf00t since his 50 part
               | creationist videos in early 2010s.
        
               | Freedom2 wrote:
               | This sites commenters attempt to apply technical
               | solutions to social problems, then pats itself on the
               | back despite their comments being entirely inappropriate
               | to the problem space.
               | 
               | There's also no actual constructive discussion when it
               | comes to future looking tech. The Cybertruck, Vision Pro,
               | LLMs are some of the most recent items that were
               | absolutely inaccurately called by the most popular
               | comments. And their reasoning for their prediction had no
               | actual substance in their comments.
        
               | yieldcrv wrote:
               | And the crypto asset discussions are very nontechnical
               | here, veering into elementary and inaccurate
               | philosophical discussions, despite this being a great
               | forum to talk about technical aspects. every network has
               | pull requests and governance proposals worth discussing,
               | and the deepest discussion here is resurrected from 2012
               | about the entire concept not having a licit use case that
               | the poster could imagine
        
               | Maxatar wrote:
               | HackerNews isn't not exactly like reddit, sure, but it's
               | not much better. People are much better behaved, but
               | still spread a great deal of misinformation.
               | 
               | One way to gauge this property of a community is whether
               | people who are known experts in a respective field
               | participate in it, and unfortunately there are very few
               | of them on HackerNews (this was not always the case).
               | I've had some opportunities to meet with people who are
               | experts, usually at conferences/industry events, and
               | while many of them tend to be active on Twitter... they
               | all say the same things about this site, namely that it's
               | simply full of bad information and the amount of effort
               | needed to dispel that information is significantly higher
               | than the amount of effort needed to spread it.
               | 
               | Next time someone posts an article about a topic you are
               | intimately familiar with, like top 1% subject matter
               | expert in... review the comment section for it and you'll
               | find just heaps of misconceptions, superficial knowledge,
               | and my favorite are the contrarians who take these very
               | strong opinions on a subject they have some passing
               | knowledge about but talk about their contrarian opinion
               | with such a high degree of confidence.
               | 
               | One issue is you may not actually be a subject matter
               | expert on a topic that comes up a lot on HackerNews, so
               | you won't recognize that this happens... but while people
               | here are a lot more polite and the moderation policies do
               | encourage good behavior... moderation policies don't do a
               | lot to stop the spread of bad information from poorly
               | informed people.
        
               | SamPatt wrote:
               | One of the things I appreciate most about HN is the fact
               | that experts are often found in the comments.
               | 
               | Perhaps we are defining experts differently?
        
               | bb88 wrote:
               | There was a lot of pseudo science being published and
               | voted up in the comments with Ivermectin/HCQ/etc and
               | Covid, when those people weren't experts and before the
               | Ivermectin paper got serious scrutiny.
               | 
               | The other aspect is that people on here think they're
               | that if they are an expert in one thing, they instantly
               | become an expert in another thing.
        
               | tuwtuwtuwtuw wrote:
               | Do you have any sources to back up those claims?
        
               | ddoolin wrote:
               | It happens in degrees, and the degree here is much lower.
        
               | turtlesdown11 wrote:
               | its actually the reverse, dunning kruger is off the
               | charts on hacker news
        
               | ddoolin wrote:
               | I don't think there's a lot of groupthink or virtue
               | signaling here, and those are the things that irritate me
               | the most. If people here overestimate their knowledge or
               | abilities, that's okay because I don't treat things
               | people say as gospel/fact/truth unless I have clear and
               | compelling reasons to do so. This is the internet after
               | all.
               | 
               | Personally I also think the submissions that make it to
               | the front page(s) are much better than any subreddit.
        
               | bigyabai wrote:
               | I disagree. Reddit users are out to impress nobody but
               | themselves, but the other day I saw someone submit a
               | "Show HN" with AI-generated testimonials.
               | 
               | HN has an active grifter culture reinforced by the VC
               | funding cycles. Reddit can only dream about lying as well
               | as HN does.
        
               | rafaelmn wrote:
               | That's a tangential problem.
               | 
               | HN tends to push up grifter hype slop, and there are a
               | lot of those people around cause VC, but you can still
               | see comments pushing back.
               | 
               | Reading reddit reminds me of highschool forum arguments
               | I've had 20 years ago, but lower quality because of
               | population selection. It's just too mainstream at this
               | point and shows you what the middle of the bell curve
               | looks like.
        
               | janalsncm wrote:
               | Strongly disagree.
               | 
               | Scroll to the bottom of comment sections on HN, you'll
               | find the kind of low-effort drive-by comments that are
               | usually at the top of Reddit comment sections.
               | 
               | In other words, it helps to have real moderators.
        
               | k__ wrote:
               | While the tone on HN is much more civil than on Reddit.
               | It's still quite the echo chamber.
        
               | mountainriver wrote:
               | Strong disagree as well, this is one of the few places on
               | the Internet which avoids this. I wish there were more
        
               | adolph wrote:
               | > > . . . The groupthink and virtue signaling is intense
               | . . .
               | 
               | > Friend, this website is EXACTLY the same
               | 
               | And it gnows it:
               | https://news.ycombinator.com/item?id=4881042
        
           | ivape wrote:
           | Well the unfortunate truth is HN has been behind the curve on
           | local llm discussions so localllama has been the only one
           | picking up the slack. There are just waaaaaaaay to many "ai
           | is just hype" people here and the grassroots
           | hardware/localllm discussions have been quite scant.
           | 
           | Like, we're fucking two years in and only now do we have a
           | thread about something like this? The whole crowd here needs
           | to speed up to catch up.
        
             | saltcured wrote:
             | There are people who think LLMs are the future and a
             | sweeping change you must embrace or be left behind.
             | 
             | There are others wondering if this is another hype
             | juggernaut like CORBA, J2EE, WSDL, XML, no-SQL, or who-
             | knows-what. A way to do things that some people treated as
             | the new One True Way, but others could completely bypass
             | for their entire, successful career and look at it now in
             | hindsight with a chuckle.
        
           | Der_Einzige wrote:
           | Lol this is true but also a TON of sampling innovations that
           | are getting their love right now from the AI community (see
           | min_p oral at ICLR 2025) came right from r/localllama so
           | don't be a hater!!!
        
             | rahimnathwani wrote:
             | Poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.
             | png?t=174...
        
           | ijk wrote:
           | LocalLlama is good for:
           | 
           | - Learning basic terms and concepts.
           | 
           | - Learning how to run local inference.
           | 
           | - Inference-level considerations (e.g., sampling).
           | 
           | - Pointers to where to get other information.
           | 
           | - Getting the vibe of where things are.
           | 
           | - Healthy skepticism about benchmarks.
           | 
           | - Some new research; there have been a number of significant
           | discoveries that either originated in LocalLlama or got
           | popularized there.
           | 
           | LocalLlama is bad because:
           | 
           | - Confusing information about finetuning; there's a lot of
           | myths from early experiments that get repeated uncritically.
           | 
           | - Lots of newbie questions get repeated.
           | 
           | - Endless complaints that it's been took long since a new
           | model was released.
           | 
           | - Most new research; sometimes a paper gets posted but most
           | of the audience doesn't have enough background to evaluate
           | the implications of things even if they're highly relevant.
           | I've seen a lot of cutting edge stuff get overlooked because
           | there weren't enough upvoters who understood what they were
           | looking at.
        
           | drillsteps5 wrote:
           | I use it as a discovery tool. Like if anybody mentions
           | something interesting I go and research install/start playing
           | with it. I could care less if they like it or not I'll make
           | my own opinion.
           | 
           | For example I find all comments about model X be more
           | "friendly" or "chatty" and model Y being more "unhinged" or
           | whatever to be mostly BS. Like there's gazillion ways a
           | conversation can go and I don't find model X or Y to be
           | consistently chatty or unhinged or creative or whatever every
           | time.
        
         | nico wrote:
         | What do you recommend for coding with aider or roo?
         | 
         | Sometimes it's hard to find models that can effectively use
         | tools
        
           | cchance wrote:
           | I havent found one good locally, i use DeepSeek r1 0528 its
           | slow but free and really good at coding (openrouter has it
           | free currently)
        
         | xtracto wrote:
         | There was this great post the other day [1] showing that with
         | llama-cpp you could offload some specific tensors to the CPU
         | and maintain good performance. That's a good way to use
         | lare(ish) models in commodity hardware.
         | 
         | Normally with llama-cpp you specifiy how many (full) layers you
         | want to put in GPU (-ngl) . But CPU-offloading specific tensors
         | that don't require heavy computation , saves GPU space without
         | affecting speed that much.
         | 
         | I've also read a paper on loading only "hot" neurons into the
         | cpu [2] . The future of home AI looks so cool!
         | 
         | [1]
         | https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...
         | 
         | [2] https://arxiv.org/abs/2312.12456
        
       | srazzaque wrote:
       | Agree with what others have said: you need to try a few out. But
       | I'd put Qwen3-14B on your list of things to try out.
        
       | emson wrote:
       | Good question. I've had some success with Qwen2.5-Coder 14B, I
       | did use the quantised version:
       | huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct-GGUF:latest It
       | worked well on my MacBook Pro M1 32Gb. It does get a bit hot on a
       | laptop though.
        
       | ProllyInfamous wrote:
       | VEGA64 (8GB) is pretty much obsolete _for this AI stuff, right_
       | (compared to e.g. M2Pro (16GB))?
       | 
       | I'll give Qwen2.5 a try on the Apple Silicon, thanks.
        
       | kekePower wrote:
       | I have an RTX 3070 with 8GB VRAM and for me Qwen3:30B-A3B is fast
       | enough. It's not lightning fast, but more than adequate if you
       | have a _little_ patience.
       | 
       | I've found that Qwen3 is generally really good at following
       | instructions and you can also very easily turn on or off the
       | reasoning by adding "/no_think" in the prompt to turn it off.
       | 
       | The reason Qwen3:30B works so well is because it's a MoE. I have
       | tested the 14B model and it's noticeably slower because it's a
       | dense model.
        
         | tedivm wrote:
         | How are you getting Qwen3:30B-A3B running with 8GB? On my
         | system it takes 20GB of VRAM to launch it.
        
           | fennecfoxy wrote:
           | Probably offload to regular ram I'd wager. Or really, really,
           | reaaaaaaally quantized to absolute fuck. Qwen3:30B-A3B Q1
           | with a 1k Q4 context uses 5.84GB of vram.
        
           | kekePower wrote:
           | It offloads to system memory, but since there are "only" 3
           | Billion active parameters, it works surprisingly well. I've
           | been able to run models that are up to 29GB in size, albeit
           | very, very slow on my system with 32GB RAM.
        
       | dcminter wrote:
       | I think you'll find that on that card most models that are
       | approaching the 16G memory size will be more than fast enough and
       | sufficient for chat. You're in the happy position of needing
       | steeper requirements rather than faster hardware! :D
       | 
       | Ollama is the easiest way to get started trying things out IMO:
       | https://ollama.com/
        
         | giorgioz wrote:
         | I found LM Studios so much easier than ollama given it has a
         | UI: https://lmstudio.ai/ Did you know about LM Studio? Why is
         | ollama still recommended given it's just a CLI with worse UX?
        
           | ekianjo wrote:
           | lM studio is closed source
        
             | prophesi wrote:
             | Any FOSS solutions that let you browse models and
             | guesstimates for you on whether you have enough VRAM to
             | fully load the model? That's the only selling point to LM
             | Studio for me.
             | 
             | Ollama's default context length is frustratingly short in
             | the era of 100k+ context windows.
             | 
             | My solution so far has been to boot up LM Studio to check
             | if a model will work well on my machine, manually download
             | the model myself through huggingface, run llama.cpp, and
             | hook it up to open-webui. Which is less than ideal, and LM
             | Studio's proprietary code has access to my machine specs.
        
               | nickthegreek wrote:
               | https://huggingface.co/docs/accelerate/v0.32.0/en/usage_g
               | uid...
        
               | prophesi wrote:
               | Thanks! That's really helpful.
        
           | dcminter wrote:
           | I recommended ollama because IMO that is the easiest way to
           | get started (as I said).
        
       | spacecadet wrote:
       | Look for something in the 500m-3b parameters range. 3 might push
       | it...
       | 
       | SmolVLM is pretty useful.
       | https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
        
         | rhdunn wrote:
         | It is feasible to run 7B, 8B models with q6_0 in 8GB VRAM, or
         | q5_k_m/q4_k_m if you have to or want to free up some VRAM for
         | other things. With q4_k_m you can run 10B and even 12B models.
        
       | vladslav wrote:
       | I use Gemma3:12b on a Mac M3 Pro, basically like Grammarly.
        
       | giorgioz wrote:
       | Try out some models with LM Studio: https://lmstudio.ai/ It has a
       | UI so it's very easy to download the model and have a UI similar
       | to the chatGPT app to query that model.
        
       | redman25 wrote:
       | Gemma-3-12b-qat https://huggingface.co/google/gemma-3-12b-it-
       | qat-q4_0-gguf
       | 
       | Qwen_Qwen3-14B-IQ4_XS.gguf
       | https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF
       | 
       | Gemma3 is a good conversationalist but tends to hallucinate.
       | Qwen3 is very smart but also very stubborn (not very steerable).
        
       | m4r71n wrote:
       | What is everyone using their local LLMs for primarily? Unless you
       | have a beefy machine, you'll never approach the level of quality
       | of proprietary models like Gemini or Claude, but I'm guessing
       | these smaller models still have their use cases, just not sure
       | what those are.
        
         | ativzzz wrote:
         | I think that the future of local LLMs is delegation. You give
         | it a prompt and it very quickly identifies what should be used
         | to solve the prompt.
         | 
         | Can it be solved locally with locally running MCPs? Or maybe
         | it's a system API - like reading your calendar or checking your
         | email. Otherwise it identifies the best cloud model and sends
         | the prompt there.
         | 
         | Basically Siri if it was good
        
         | Rotundo wrote:
         | Not everyone is comfortable with sending their data and/or
         | questions and prompts to an external party.
        
           | DennisP wrote:
           | Especially now that a court has ordered OpenAI to keep
           | records of it all.
           | 
           | https://www.adweek.com/media/a-federal-judge-ordered-
           | openai-...
        
         | diggan wrote:
         | I'm currently experimenting with Devstral for my own local
         | coding agent I've slowly built together. It's in many ways
         | nicer than Codex in that 1) full access to my hardware so can
         | start VMs, make network requests and everything else I can do,
         | which Codex cannot and 2) it's way faster both in initial
         | setup, working through things and creating a patch.
         | 
         | Of course, it still isn't at the same level as Codex itself,
         | the model Codex is using is just way better so of course it'll
         | get better results. But Devstral (as I currently use it) is
         | able to make smaller changes and refactors, and I think if I
         | evolve the software a bit more, can start making larger changes
         | too.
        
           | brandall10 wrote:
           | Why are you comparing it to Codex and not Claude Code, which
           | can do all those things?
           | 
           | And why not just use Openhands, which it was designed around
           | which I presume can also do all those things?
        
         | cratermoon wrote:
         | I have a large repository of notes, article drafts, and
         | commonplace book-type stuff. I experimented a year or so ago
         | with a system using RAG to "ask myself" what I have to say
         | about various topics. (I suppose nowadays I would use MCP
         | instead of RAG?) I was not especially impressed by the results
         | with the models I was able to run: long-winded responses full
         | of slop and repetition, irrelevant information pulled in from
         | notes that had some semantically similar ideas, and such. I'm
         | certainly not going to feed the contents of my private
         | notebooks to any of the AI companies.
        
           | notfromhere wrote:
           | You'd still use RAG, just use MCP to more easily connect an
           | LLM to your RAG pipeline
        
             | cratermoon wrote:
             | To clarify: what I was doing was first querying for the
             | documents via a standard document database query and then
             | feeding the best matching documents to the LLM. My
             | understanding is that with MCP I'd delegate the document
             | query from the LLM to the tool.
        
               | longtimelistnr wrote:
               | As a beginner, I also haven't had much luck with embedded
               | vector queries either. Firstly, setting it up was a major
               | pain in the ass and I couldn't even get it to ingest
               | anything beyond .txt files. Second, maybe it was my AI
               | system prompt or the lack of outside search capabilities
               | but unless i was very specific with my query the response
               | was essentially "can't find what youre looking for"
        
         | moffkalast wrote:
         | > unless you have a beefy machine
         | 
         | The average person in r/locallama has a machine that would make
         | r/pcmasterrace users blush.
        
           | rollcat wrote:
           | An Apple M1 is decent enough for LMs. My friend wondered why
           | I got so excited about it when it came out five years ago. It
           | wasn't that it was particularly powerful - it's decent. What
           | it did was to set a new bar for "low end".
        
             | vel0city wrote:
             | A new Mac is easily starting around $1k and quickly goes up
             | from there if you want a storage or RAM upgrade, especially
             | for enough memory to really run some local models. Insane
             | that a $1,000 computer is called "decent" and "low end". My
             | daily driver personal laptop brand new was $300.
        
               | moffkalast wrote:
               | That's fun to hear given that low end laptops are now
               | $800, mid range is like $1.5k and upper end is $3k+ even
               | for non-Apple vendors. Inflation makes fools of us all.
        
               | vel0city wrote:
               | Low end laptops can still easily be found for far less
               | than $800.
               | 
               | https://www.microcenter.com/product/676305/acer-
               | aspire-3-a31...
        
               | Mr-Frog wrote:
               | The first IBM PC in 1981 cost $1,565, which is comparable
               | to $5,500 after inflation.
        
               | evilduck wrote:
               | An M1 Mac is about 5 years old at this point and can be
               | had for far less than a grand.
               | 
               | A brand new Mac Mini M4 is only $499.
        
               | vel0city wrote:
               | Ah, I was focusing on the laptops, my bad. But still its
               | more than $499. Just looked on the Apple store website,
               | Mac Mini M4 starting at $599 (not $499), with only 256GB
               | of storage.
               | 
               | https://www.apple.com/shop/buy-mac/mac-mini/m4
        
               | nickthegreek wrote:
               | microcenter routinely sells that system for $450.
               | 
               | https://www.microcenter.com/product/688173/Mac_mini_MU9D3
               | LL-...
        
               | fennecfoxy wrote:
               | You're right - memory size and then bandwidth is
               | imperative for LLMs. Apple currently lacks great memory
               | bandwidth with their unified memory. But it's not a bad
               | option if you can find one for a good price. The prices
               | for new are just bonkers.
        
               | rollcat wrote:
               | Of course it depends on what you consider "low end" -
               | it's relative to your expectations. I have a G4 TiBook,
               | _the_ definition of a high-end laptop, by 2002 standards.
               | If you consider a $300 laptop a good daily driver, I 'll
               | one-up you with this: <https://www.chrisfenton.com/diy-
               | laptop-v2/>
        
               | vel0city wrote:
               | My $300 laptop is a few years old. It has a Ryzen 3 3200U
               | CPU, it has a 14" 1080p display, backlit keyboard. It
               | came with 8GB of RAM and a 128GB SSD, I upgraded to 16GB
               | from RAM acquired by a dumpster dive and a 256GB SSD for
               | like $10 on clearance at Microcenter. I upgraded the WiFi
               | to an Intel AX210 6e for about another $10 off Amazon. It
               | gets 6-8 hours of battery life doing browsing and texting
               | editing kind of workloads.
               | 
               | The only thing that is itching me to get a new machine is
               | it needs a 19V power supply. Luckily it's a pretty common
               | barrel size, I already had several power cables laying
               | around that work just fine. I'd prefer to just have all
               | my portable devices to run off USB-C though.
        
               | pram wrote:
               | I know I speak for everyone that your dumpster laptop is
               | very impressive, give yourself a big pat on the back. You
               | deserve it.
        
         | ozim wrote:
         | You still can get decent stuff out of local ones.
         | 
         | Mostly I use it for testing tools and integrations via API not
         | to spend money on subscriptions. When I see something working I
         | switch it to proprietary one to get best results.
        
           | nomel wrote:
           | If you're comfortable with the API, all the services provide
           | pay-as-you-go API access that can be much cheaper. I've tried
           | local, but the _time_ cost of getting it to spit out
           | something reasonable wasn 't worth the _literal pennies_ the
           | answers from the flagship would cost.
        
             | qingcharles wrote:
             | This. The APIs are so cheap and they are up and running
             | _right now_ with 10x better quality output. Unless whatever
             | you are doing is Totally Top Secret or completely
             | nefarious, then send your prompts to an API.
        
         | barnabee wrote:
         | I generally try a local model first for most prompts. It's good
         | enough surprisingly often (over 50% for sure). Every time I
         | avoid using a cloud service is a win.
        
         | mixmastamyk wrote:
         | Shouldn't the (MoE) mixture of experts approach allow one to
         | conserve memory by working on specific problem type at a time?
         | 
         | > (MoE) divides an AI model into separate sub-networks (or
         | "experts"), each specializing in a subset of the input data, to
         | jointly perform a task.
        
           | ijk wrote:
           | Sort of, but the "experts" aren't easily divisible in a
           | conceptually interpretable way so the naive understanding of
           | MoE is misleading.
           | 
           | What you typically end up with in memory constrained
           | environments is that the core shared layers are in fast
           | memory (VRAM, ideally) and the rest are in slower memory
           | (system RAM or even a fast SSD).
           | 
           | MoE models are typically very shallow-but-wide in comparison
           | with the dense models, so they end up being faster than an
           | equivalent dense model, because they're ultimately running
           | through fewer layers each token.
        
         | qingcharles wrote:
         | If you look on localllama you'll see most of the people there
         | are really just trying to do NSFW or other questionable or
         | unethical things with it.
         | 
         | The stuff you can run on reasonable home hardware (e.g. a
         | single GPU) isn't going to blow your mind. You can get pretty
         | close to GPT3.5, but it'll feel dated and clunky compared to
         | what you're used to.
         | 
         | Unless you have already spent big $$ on a GPU for gaming, I
         | really don't think buying GPUs for home makes sense,
         | considering the hardware and running costs, when you can go to
         | a site like vast.ai and borrow one for an insanely cheap amount
         | to try it out. You'll probably get bored and be glad you didn't
         | spend your kids' college fund on a rack of H100s.
        
         | teleforce wrote:
         | This is an excellent example of local LLM application [1].
         | 
         | It's an AI-driven chat system designed to support students in
         | the Introduction to Computing course (ECE 120) at UIUC,
         | offering assistance with course content, homework, or
         | troubleshooting common problems.
         | 
         | It serves as an educational aid integrated into the course's
         | learning environment using UIUC Illinois Chat system [2].
         | 
         | Personally I've found it's really useful that it provides the
         | details portions of course study materials for examples slides
         | that's directly related to the discussions so the students can
         | check the sources veracity of the answers provided by the LLM.
         | 
         | It seems to me that RAG is the killer feature for local LLM
         | [3]. It directly addressed the main pain point of LLM
         | hallucinations and help LLMs stick to the facts.
         | 
         | [1] Introduction to Computing course (ECE 120) Chatbot:
         | 
         | https://www.uiuc.chat/ece120/chat
         | 
         | [2] UIUC Illinois Chat:
         | 
         | https://uiuc.chat/
         | 
         | [3] Retrieval-augmented generation [RAG]:
         | 
         | https://en.wikipedia.org/wiki/Retrieval-augmented_generation
        
           | staticcaucasian wrote:
           | Does this actually need to be local? Since the chat bot is
           | open to the public and I assume the course material used for
           | RAG all on this page
           | (https://canvas.illinois.edu/courses/54315/pages/exam-
           | schedul...) all stays freely accessible - I clicked a few
           | links without being a student - I assume a pre-prompted
           | larger non-local LLM would outperform the local instance.
           | Though, you can imagine an equivalent course with all of its
           | content ACL-gated/'paywalled' could benefit from local RAG, I
           | guess.
        
         | ijk wrote:
         | General local inference strengths:
         | 
         | - Experiments with inference-level control; can't do the
         | Outlines / Instructor stuff with most API services, can't do
         | the advanced sampling strategies, etc. (They're catching up but
         | they're 12 months behind what you can do locally.)
         | 
         | - Small, fast, finetuned models; _if you know what your domain
         | is sufficiently to train a model you can outperform everything
         | else_. General models _usually_ win, if only due to ease of
         | prompt engineering, but not always.
         | 
         | - Control over which model is being run. Some drift is
         | inevitable as your real-world data changes, but when your model
         | is also changing underneath you it can be harder to build
         | something sustainable.
         | 
         | - More control over costs; this is the classic on-prem versus
         | cloud decision. Most cases you just want to pay for the
         | cloud...but we're not in ZIRP anymore and having a predictable
         | power bill can trump sudden unpredictable API bills.
         | 
         | In general, the move to cloud services was originally a cynical
         | OpenAI move to keep GPT-3 locked away. They've built up a bunch
         | of reasons to prefer the in-cloud models (heavily subsidized
         | fast inference, the biggest and most cutting edge models, etc.)
         | so if you need the latest and greatest right now and are
         | willing to pay, it's probably the right business move for most
         | businesses.
         | 
         | This is likely to change as we get models that can reasonably
         | run on edge devices; right now it's hard to build an app or a
         | video game that incidentally uses LLM tech because user revenue
         | is unlikely to exceed inference costs without a lot of careful
         | planning or a subscription. Not impossible, but definitely adds
         | business challenges. Small models running on end-user devices
         | opens up an entirely new level of applications in terms of
         | cost-effectiveness.
         | 
         | If you need the right answer, sometimes only the biggest cloud
         | API model is acceptable. If you've got some wiggle room on
         | accuracy and can live with sometimes getting a substandard
         | response, then you've got a lot more options. The trick is that
         | the things that an LLM is best at are always going to be things
         | where less than five nines of reliability are acceptable, so
         | even though the biggest models have more reliability, an
         | average there are many tasks where you might be just fine with
         | a small fast model that you have more control over.
        
         | drillsteps5 wrote:
         | I avoid using cloud whenever I can on principle. For instance,
         | OpenAI recently indicated that they are working on some social
         | network-like service for ChatGPT users to share their chats.
         | 
         | Running it locally helps me understand how these things work
         | under the hood, which raises my value on the job market. I also
         | play with various ideas which have LLM on the backend (think
         | LLM-powered Web search, agents, things of that nature), I don't
         | have to pay cloud providers, and I already had a gaming rig
         | when LLaMa was released.
        
       | jaggs wrote:
       | hf.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF:Q6_K
       | is a decent performing model, if you're not looking for blinding
       | speed. It definitely ticks all the boxes in terms of model
       | quality. Try a smaller quant if you need more speed.
        
       | cowpig wrote:
       | Ollama[0] has a collection of models that are either already
       | small or quantized/distilled, and come with hyperparameters that
       | are pretty reasonable, and they make it easy to try them out. I
       | recommend you install it and just try a bunch because they all
       | have different "personalities", different strengths and
       | weaknesses. My personal go-tos are:
       | 
       | Qwen3 family from Alibaba seem to be the best reasoning models
       | that fit on local hardware right now. Reasoning models on local
       | hardware are annoying in contexts where you just want an
       | immediate response, but vastly outperform non-reasoning models on
       | things where you want the model to be less naive/foolish.
       | 
       | Gemma3 from google is really good at intuition-oriented stuff,
       | but with an obnoxious HR Boy Scout personality where you
       | basically have to add "please don't add any disclaimers" to the
       | system prompt for it to function. Like, just tell me how long you
       | think this sprain will take to heal, I already know you are not a
       | medical professional, jfc.
       | 
       | Devstral from Mistral performs the best on my command line
       | utility where I describe the command I want and it executes that
       | for me (e.g. give me a 1-liner to list the dotfiles in this
       | folder and all subfolders that were created in the last month).
       | 
       | Nemo from Mistral, I have heard (but not tested) is really good
       | for routing-type jobs, where you need something with to make a
       | simple multiple-choice decision competently with low latency, and
       | is easy to fine-tune if you want to get that sophisticated.
       | 
       | [0] https://ollama.com/search
        
       | unethical_ban wrote:
       | Phi-4 is scared to talk about anything controversial, as if
       | they're being watched.
       | 
       | I asked it a question about militias. It thought for a few pages
       | about the answer and whether to tell me, then came back with "I
       | cannot comply".
       | 
       | Nidum is the name of uncensored Gemma, it does a good job most of
       | the time.
        
       | emmelaich wrote:
       | Generally speaking, how can you tell how much vram a model will
       | take? It seems to be a valuable bit of data which is missing from
       | downloadable models (gguf) files.
        
         | omneity wrote:
         | Very rougly you can consider the Bs of a model as GBs of memory
         | then it depends on the quantization level. Say for an 8B model:
         | 
         | - FP16: 2x 8GB = 16GB
         | 
         | - Q8: 1x 8GB
         | 
         | - Q4: 0.5x 8GB = 4GB
         | 
         | It doesn't 100% neatly map like this but this gives you a rough
         | measure. In top of this you need some more memory depending on
         | the context length and some other stuff.
         | 
         | Rationale for the calculation above: A model is basically a
         | billions of variables with a floating number value. So the size
         | of a model roughly maps to number of variables (weights) x
         | word-precision of each variable (4, 8, 16bits..)
         | 
         | You don't have to quantize all layers to the same precision
         | this is why sometimes you see fractional quantizations like
         | 1.58bits.
        
           | rhdunn wrote:
           | The 1.58bit quantization is using 3 values -- -1, 0, 1. The
           | bits number comes from log_2(3) = 1.58....
           | 
           | For that level you can pack 4 weights in a byte using 2 bits
           | per byte. However, there is one bit configuration in each
           | that is unused.
           | 
           | More complex packing arrangements are done by grouping
           | weights together (e.g. a group of 3) and assigning a bit
           | configuration to each combination of values into a lookup
           | table. This allows greater compression closer to the 1.68
           | bits value.
        
         | fennecfoxy wrote:
         | Depends on quantization etc. But there are good calculators
         | that will calculate for your KV cache etc as well:
         | https://apxml.com/tools/vram-calculator.
        
       | binary132 wrote:
       | I've had awesome results with Qwen3-30B-A3B compared to other
       | local LMs I've tried. Still not crazy good but a lot better and
       | very fast. I have 24GB of VRAM though
        
       | tiahura wrote:
       | What about for a 5090?
        
         | sgt wrote:
         | Comes with 32GB VRAM right?
         | 
         | Speaking of, would a Ryzen 9 12 core be nice for a 5090 setup?
         | 
         | Or should one really go dual 5090?
        
       | mrbonner wrote:
       | Did someone have a chance to try local llama for the new AMD AI
       | Max+ with 128gb of unified RAM?
        
       | nosecreek wrote:
       | Related question: what is everyone using to run a local LLM? I'm
       | using Jan.ai and it's been okay. I also see OpenWebUI mentioned
       | quite often.
        
         | op00to wrote:
         | LMStudio, and sometimes AnythingLLM.
        
         | Havoc wrote:
         | LM studio if you just want an app. openwebui is just a front
         | end - you'd need to have either llama.cpp or vllm behind it to
         | serve the model
        
         | fennecfoxy wrote:
         | KoboldCPP + SillyTavern, has worked the best for me.
        
       | Havoc wrote:
       | It's a bit like asking what flavour of icecream is the best. Try
       | a few and see.
       | 
       | For 16gb and speed you could try Qwen3-30B-A3B with some offload
       | to system ram or use a dense model Probably a 14B quant
        
       | fennecfoxy wrote:
       | Basic conversations are essentially RP I suppose. You can look at
       | KoboldCPP or SillyTavern reddit.
       | 
       | I was trying Patricide unslop mell and some of the Qwen ones
       | recently. Up to a point more params is better than worrying about
       | quantization. But eventually you'll hit a compute wall with high
       | params.
       | 
       | KV cache quantization is awesome (I use q4 for a 32k context with
       | a 1080ti!) and context shifting is also awesome for long
       | conversations/stories/games. I was using ooba but found recently
       | that KoboldCPP not only runs faster for the same model/settings
       | but also Kobold's context shifting works much more consistently
       | than Ooba's "streaming_llm" option, which almost always re-
       | evaluates the prompt when hooked up to something like ST.
        
       | sabareesh wrote:
       | This is what i have https://sabareesh.com/posts/llm-rig/ All You
       | Need is 4x 4090 GPUs to Train Your Own Model
        
         | yapyap wrote:
         | 4 4090s is easily 8000$, nothing to scoff at IMO
        
           | ge96 wrote:
           | Imagine some SLI 16 x 1080
        
         | dr_kiszonka wrote:
         | Could you explain what is your use case for training 1B models?
         | Learning or perhaps fine tuning?
        
           | sabareesh wrote:
           | Learning, prototype and then scale it in to cloud. Also can
           | be used as inference engine to train another model if you are
           | using model as a judge for RL.
        
       | arh68 wrote:
       | Wow, a 5060Ti. 16gb + I'm guessing >=32gb ram. And here I am
       | spinning Ye Olde RX 570 4gb + 32gb.
       | 
       | I'd like to know how many tokens you can get out of the larger
       | models especially (using Ollama + Open WebUI on Docker Desktop,
       | or LM Studio whatever). I'm probably not upgrading GPU this year,
       | but I'd appreciate an anecdotal benchmark.                 -
       | gemma3:12b       - phi4:latest (14b)       - qwen2.5:14b [I get
       | ~3 t/s on all these small models, acceptably slow]            -
       | qwen2.5:32b [this is about my machine's limit; verrry slow, ~1
       | t/s]       - qwen2.5:72b [beyond my machine's limit, but maybe
       | not yours]
        
         | diggan wrote:
         | I'm guessing you probably also want to include the quantization
         | levels you're using, as otherwise they'll be a huge variance in
         | your comparisons with others :)
        
           | arh68 wrote:
           | True, true. All Q4_K_M unless I'm mistaken. Thanks
        
       | reissbaker wrote:
       | At 16GB a Q4 quant of Mistral Small 3.1, or Qwen3-14B at FP8,
       | will probably serve you best. You'd be cutting it a little close
       | on context length due to the VRAM usage... If you want longer
       | context, a Q4 quant of Qwen3-14B will be a bit dumber than FP8
       | but will leave you more breathing room. Mistral Small can take
       | images as input, and Qwen3 will be a bit better at math/coding;
       | YMMV otherwise.
       | 
       | Going below Q4 isn't worth it IMO. If you want significantly more
       | context, probably drop down to a Q4 quant of Qwen3-8B rather than
       | continuing to lobotomize the 14B.
       | 
       | Some folks have been recommending Qwen3-30B-A3, but I think 16GB
       | of VRAM is probably not quite enough for that: at Q4 you'd be
       | looking at 15GB for the weights alone. Qwen3-14B should be pretty
       | similar in practice though despite being lower in param count,
       | since it's a dense model rather than a sparse one: dense models
       | are generally smarter-per-param than sparse models, but somewhat
       | slower. Your 5060 should be plenty fast enough for the 14B as
       | long as you keep everything on-GPU and stay away from CPU
       | offloading.
       | 
       | Since you're on a Blackwell-generation Nvidia chip, using LLMs
       | quantized to NVFP4 specifically will provide some speed
       | improvements at some quality cost compared to FP8 (and will be
       | faster than Q4 GGUF, although ~equally dumb). Ollama doesn't
       | support NVFP4 yet, so you'd need to use vLLM (which isn't too
       | hard, and will give better token throughput anyway). Finding pre-
       | quantized models at NVFP4 will be more difficult since there's
       | less-broad support, but you can use llmcompressor [1] to
       | statically compress any FP16 LLM to NVFP4 locally -- you'll
       | probably need to use accelerate to offload params to CPU during
       | the one-time compression process, which llmcompressor has
       | documentation for.
       | 
       | I wouldn't reach for this particular power tool until you've
       | decided on an LLM already, and just want faster perf, since it's
       | a bit more involved than just using ollama and the initial
       | quantization process will be slow due to CPU offload during
       | compression (albeit it's only a one-time cost). But if you land
       | on a Q4 model, it's not a bad choice once you have a favorite.
       | 
       | 1: https://github.com/vllm-project/llm-compressor
        
       | FuriouslyAdrift wrote:
       | I've had good luck with GPT4All (Nomic) and either reason v1
       | (Qwen 2.5 - Coder 7B) or Llama 3 8B Instruct.
        
       | DiabloD3 wrote:
       | I'd suggest buying a better GPU, only because all the models you
       | want need a 24GB card. Nvidia... more or less robbed you.
       | 
       | That said, Unsloth's version of Qwen3 30B, running via llama.cpp
       | (don't waste your time with any other inference engine), with the
       | following arguments (documented in Unsloth's docs, but sometimes
       | hard to find): `--threads (number of threads your CPU has) --ctx-
       | size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407
       | --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along
       | with the other arguments you need.
       | 
       | Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF
       | (since you have 16GB, grab Q3_K_XL, since it fits in vram and
       | leaves about 3-4GB left for the other apps on your desktop and
       | other allocations llama.cpp needs to make).
       | 
       | Also, why 30B and not the full fat 235B? You don't have 120-240GB
       | of VRAM. The 14B and less ones are also not what you want: more
       | parameters are better, parameter precision is vastly less
       | important (which is why Unsloth has their specially crafted
       | <=2bit versions that are 85%+ as good, yet are ridiculously tiny
       | in comparison to their originals).
       | 
       | Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3
        
         | bigyabai wrote:
         | > only because all the models you want need a 24GB card
         | 
         | ???
         | 
         | Just run a q4 quant of the same model and it will fit no
         | problem.
        
           | DiabloD3 wrote:
           | Q4_K_M is the "default" for a lot of models on HF, and they
           | generally require ~20GB of VRAM to run. It will not fit
           | entirely on a 16GB card. You want to be about 3-4GB VRAM on
           | top of what the model requires.
           | 
           | A back of the envelope estimate of specifically
           | unsloth/Qwen3-30B-A3B-128K-GGUF is 18.6GB for Q4_K_M.
        
       | antirez wrote:
       | The largest Gemma 3 and Qwen 3 you can run. Offload to RAM as
       | many layers you can.
        
       | lxe wrote:
       | There's a new one every day it seems. Follow
       | https://x.com/reach_vb from huggingface.
        
       | PhilippGille wrote:
       | Haven't seen Mozilla's LocalScore [1] mentioned in the comments
       | yet. It's exactly made for the purpose of finding out how well
       | different models run on different hardware.
       | 
       | [1] https://www.localscore.ai/
        
       | janalsncm wrote:
       | People ask this question a lot and annoyingly the answer is:
       | there are many definitions of "best". Speed, capabilities (e.g.
       | do you want it to be able to handle images or just text?),
       | quality, etc.
       | 
       | It's like asking what the best pair of shoes is.
       | 
       | Go on Ollama and look at the most popular models. You can decide
       | for yourself what you value.
       | 
       | And start small, these things are GBs in size so you don't want
       | to wait an hour for a download only to find out a model runs at 1
       | token / second.
        
       | speedgoose wrote:
       | My personal preference this month is the biggest Gemma3 you can
       | fit on your hardware.
        
       | arnaudsm wrote:
       | Qwen3 family (and the R1 qwen3-8b distill) is #1 in programming
       | and reasoning.
       | 
       | However it's heavily censored on political topics because of its
       | Chinese origin. For world knowledge, I'd recommend Gemma3.
       | 
       | This post will be outdated in a month. Check https://livebench.ai
       | and https://aider.chat/docs/leaderboards/ for up to date
       | benchmarks
        
         | the_sleaze_ wrote:
         | > This post will be outdated in a month
         | 
         | The pace of change is mind boggling. Not only for the models
         | but even the tools to put them to use. Routers, tools, MCP,
         | streaming libraries, SDKs...
         | 
         | Do you have any advice for someone who is interested,
         | developing alone and not surrounded by coworkers or meetups who
         | wants to be able to do discovery and stay up to date?
        
       | nickdothutton wrote:
       | I find Ollama + TypingMind (or similar interface) to work well
       | for me. As for which models, I think this is changing from one
       | month to the next (perhaps not quite that fast). We are in that
       | kind of period. You'll need to make sure the model layers fit in
       | VRAM.
        
       | nsxwolf wrote:
       | I'm running llama3.2 out of the box on my 2013 Mac Pro, the low
       | end quad core Xeon one, with 64GB of RAM.
       | 
       | It's slow-ish but still useful, getting 5-10 tokens per second.
        
       | nurettin wrote:
       | pretty much all Q_4 models on huggingface fit in consumer grade
       | cards.
        
       | depingus wrote:
       | Captain Eris Violet 12B fits those requirements.
        
       | drillsteps5 wrote:
       | I concur LocalLLama subreddit recommendation. Not in terms of
       | choosing "the best model" but to answer questions, find guides,
       | latest news and gossip, names of the tools, various models and
       | how they stack against each other, etc.
       | 
       | There's no one "best" model, you just try a few and play with
       | parameters and see which one fits your needs the best.
       | 
       | Since you're on HN, I'd recommend skipping Ollama and LMStudio.
       | They might restrict access to the latest models and you typically
       | only choose from the ones they tested with. And besides what kind
       | of fun is this when you don't get to peek under the hood?
       | 
       | llamacpp can do a lot itself, and you can do most recently
       | released models (when changes are needed they adjust literally
       | within a few days). You can get models from huggingface
       | obviously. I prefer GGUF format, saves me some memory (you can
       | use lower quantization, I find most 6-bit somewhat satisfactory).
       | 
       | I find that the the size of the model's GGUF file with roughly
       | tell me if it'll fit in my VRAM. For example 24Gb GGUF model will
       | NOT fit in 16Gb, whereas 12Gb likely will. However, the more
       | context you add the more RAM will be needed.
       | 
       | Keep in mind that models are trained with certain context window.
       | If it has 8Kb context (like most older models do) and you load it
       | with 32Kb context it won't be much help.
       | 
       | You can run llamacpp on Linux, Windows, or MacOS, you can get the
       | binaries or compile on your local. It can split the model between
       | VRAM and RAM (if the model doesn't fit in your 16Gb). It even has
       | simple React front-end (llamacpp-server). The same module
       | provides REST service which has similar (but simpler) protocol to
       | OpenAI and all the other "big" guys.
       | 
       | Since it implements OpenAI REST API, it also works with a lot of
       | front-end tools if you want more functionality (ie oobabooga aka
       | textgeneration webui).
       | 
       | Koboldcpp is another backend you can try if you find llamacpp to
       | be too raw (I believe it's the still llamacpp under the hood).
        
       ___________________________________________________________________
       (page generated 2025-05-30 23:01 UTC)