[HN Gopher] Ask HN: What is the best LLM for consumer grade hard...
___________________________________________________________________
Ask HN: What is the best LLM for consumer grade hardware?
I have a 5060ti with 16GB VRAM. I'm looking for a model that can
hold basic conversations, no physics or advanced math required.
Ideally something that can run reasonably fast, near real time.
Author : VladVladikoff
Score : 191 points
Date : 2025-05-30 11:02 UTC (11 hours ago)
| btreecat wrote:
| I only have 8gb of vram to work with currently, but I'm running
| OpenWebUI as a frontend to ollamma and I have a very easy time
| loading up multiple models and letting them duke it out either at
| the same time or in a round robin.
|
| You can even keep track of the quality of the answers over time
| to help guide your choice.
|
| https://openwebui.com/
| rthnbgrredf wrote:
| Be aware of the the recent license change of "Open"WebUI. It is
| no longer open source.
| lolinder wrote:
| Thanks, somehow I missed that.
|
| https://docs.openwebui.com/license/
| nicholasjarnold wrote:
| AMD 6700XT owner here (12Gb VRAM) - Can confirm.
|
| Once I figured out my local ROCm setup Ollama was able to run
| with GPU acceleration no problem. Connecting an OpenWebUI
| docker instance to my local Ollama server is as easy as a
| docker run command where you specify the OLLAMA_BASE_URL env
| var value. This isn't a production setup, but it works nicely
| for local usages like what the immediate parent is describing.
| benterix wrote:
| I'm afraid that 1) you are not going to get a definite answer, 2)
| an objective answer is very hard to give, 3) you really need to
| try a few most recent models on your own and give them the tasks
| that seem most useful/meaningful to you. There is drastic
| difference in output quality depending on the task type.
| kouteiheika wrote:
| If you want to run LLMs locally then the localllama community is
| your friend: https://old.reddit.com/r/LocalLLaMA/
|
| In general there's no "best" LLM model, all of them will have
| some strengths and weaknesses. There are a bunch of good picks;
| for example:
|
| > DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-
| ai/DeepSeek-R1-0528-Qwen3-8B
|
| Released today; probably the best reasoning model in 8B size.
|
| > Qwen3 -
| https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
|
| Recently released. Hybrid thinking/non-thinking models with
| really great performance and plethora of sizes for every
| hardware. The Qwen3-30B-A3B can even run on CPU with acceptable
| speeds. Even the tiny 0.6B one is somewhat coherent, which is
| crazy.
| ignoramous wrote:
| > _DeepSeek-R1-0528-Qwen3-8Bhttps://huggingface.co/deepseek-
| ai/DeepSeek-R1-0528-Qwen3-8B ... Released today; probably the
| best reasoning model in 8B size._ ... we
| distilled the chain-of-thought from DeepSeek-R1-0528 to post-
| train Qwen3-8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B ... on
| AIME 2024, surpassing Qwen3-8B by +10.0% & matching the
| performance of Qwen3-235B-thinking.
|
| Wild how effective distillation is turning out to be. No
| wonder, most shops have begun to "hide" CoT now:
| https://news.ycombinator.com/item?id=41525201
| bn-l wrote:
| > Beyond its improved reasoning capabilities, this version
| also offers a reduced hallucination rate, enhanced support
| for function calling, and better experience for vibe coding.
|
| Thank you for thinking of the vibe coders.
| luke-stanley wrote:
| > Released today; probably the best reasoning model in 8B size.
|
| Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday
| (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new
| version came out since! I am waiting for the other sizes! ;D
| ivape wrote:
| I'd also recommend you go with something like 8b, so you can
| have the other 8GB of vram for a decent sized context window.
| There's tons of good 8b ones, as mentioned above. If you go for
| the largest model you can fit, you'll have slower inference (as
| you pass in more tokens) and smaller context.
| svachalek wrote:
| 8b is the number of parameters. The most common quant is 4
| bits per parameter so 8b params is roughly 4GB of VRAM.
| (Typically more like 4.5GB)
| rhdunn wrote:
| The number of quantized bits is a trade off between size
| and quality. Ideally you should be aiming for a 6-bit or
| 5-bit model. I've seen some models be unstable at 4-bit
| (where they will either repeat words or start generating
| random words).
|
| Anything below 4-bits is usually not worth it unless you
| want to experiment with running a 70B+ model -- though I
| don't have any experience of doing that, so I don't know
| how well the increased parameter size balances the
| quantization.
|
| See https://github.com/ggml-org/llama.cpp/pull/1684 and htt
| ps://gist.github.com/Artefact2/b5f810600771265fc1e3944228..
| . for comparisons between quantization levels.
| kouteiheika wrote:
| > The number of quantized bits is a trade off between
| size and quality. Ideally you should be aiming for a
| 6-bit or 5-bit model. I've seen some models be unstable
| at 4-bit (where they will either repeat words or start
| generating random words).
|
| Note that that's a skill issue of whoever quantized the
| model. In general quantization even as low as 3-bit can
| be almost loseless when you do quantization-aware
| finetuning[1] (and apparently you don't even need that
| many training tokens), but even if you don't want to do
| any extra training you can be smart as to which parts of
| the model you're quantizing and by how much to minimize
| the damage (e.g. in the worst case over-quantizing even a
| _single_ weight can have disastrous consequences[2])
|
| Some time ago I ran an experiment where I finetuned a
| small model while quantizing parts of it to 2-bits to see
| which parts are most sensitive (the numbers are the final
| loss; lower is better): 1.5275
| mlp.downscale 1.5061 mlp.upscale 1.4665
| mlp.gate 1.4531 lm_head 1.3998
| attn.out_proj 1.3962 attn.v_proj 1.3794
| attn.k_proj 1.3811 input_embedding
| 1.3662 attn.q_proj 1.3397 unquantized
| baseline
|
| So as you can see quantizing some parts of the model
| affects it more strongly. The downprojection in the MLP
| layers is the most sensitive part of the model (which
| also matches with what [2] found), so it makes sense to
| quantize this part of the model less and instead quantize
| other parts more strongly. But if you'll just do the
| naive "quantize everything in 4-bit" then sure, you might
| get broken models.
|
| [1] - https://arxiv.org/pdf/2502.02631 [2] -
| https://arxiv.org/pdf/2411.07191
| rhdunn wrote:
| Interesting. I was aware of using an imatrix for the
| i-quants but didn't know you could use them for k-quants.
| I've not experimented with using imatrices in my local
| setup yet.
|
| And it's not a skill issue... it's the default
| behaviour/logic when using k-quants to quantize a model
| with llama.cpp.
| diggan wrote:
| I think your recommendation falls within
|
| > all of them will have some strengths and weaknesses
|
| Sometimes a higher parameter model with less quantization and
| low context will be the best, sometimes lower parameter model
| with some quantization and huge context will be the best,
| sometimes high parameter count + lots of quantization +
| medium context will be the best.
|
| It's really hard to say one model is better than another in a
| general way, since it depends on so many things like your use
| case, the prompts, the settings, quantization, quantization
| method and so on.
|
| If you're building/trying to build stuff depending on LLMs in
| any capacity, the first step is coming up with your own
| custom benchmark/evaluation that you can run with your
| specific use cases being put under test. Don't share this
| publicly (so it doesn't end up in the training data) and run
| it in order to figure out what model is best for that
| specific problem.
| evilduck wrote:
| With a 16GB GPU you can comfortably run like Qwen3 14B or
| Mistral Small 24B models at Q4 to Q6 and still have plenty of
| context space and get much better abilities than an 8B model.
| bee_rider wrote:
| I'm curious (as someone who knows nothing about this
| stuff!)--the context window is basically a record of the
| conversation so far and other info that isn't part of the
| model, right?
|
| I'm a bit surprised that 8GB is useful as a context window if
| that is the case--it just seems like you could fit a ton of
| research papers, emails, and textbooks in 2GB, for example.
|
| But, I'm commenting from a place of ignorance and curiosity.
| Do models blow up the info in the context window, maybe do
| some processing to pre-digest it?
| hexomancer wrote:
| Yes, every token is expanded into a vector that can be many
| thousand of dimensions. The vectors are stored for every
| token and every layer.
|
| You absolutely can not fill even a single research paper in
| 2 GB much less an entire book.
| datameta wrote:
| Can system RAM be used for context (albeit at lower parsing
| speeds)?
| moffkalast wrote:
| Yes at this point it's starting to become almost a matter of
| how much you like the model's personality since they're all
| fairly decent. OP just has to start downloading and trying them
| out. With 16GB one can do partial DDR5 offloading with
| llama.cpp and run anything up to about 30B (even dense) or even
| more at a "reasonable" speed for chat purposes. Especially with
| tensor offload.
|
| I wouldn't count Qwen as that much of a conversationalist
| though. Mistral Nemo and Small are pretty decent. All of Llama
| 3.X are still very good models even by today's standards. Gemma
| 3s are great but a bit unhinged. And of course QwQ when you
| need GPT4 at home. And probably lots of others I'm forgetting.
| diggan wrote:
| > If you want to run LLMs locally then the localllama community
| is your friend: https://old.reddit.com/r/LocalLLaMA/
|
| For folks new to reddit, it's worth noting that LocalLlama,
| just like the rest of the internet but especially reddit, is
| filled with misinformed people spreading incorrect "facts" as
| truth, and you really can't use the upvote/downvote count as an
| indicator of quality or how truthful something is there.
|
| Something that is more accurate but put in a boring way will
| often be downvoted, while straight up incorrect but
| funny/emotional/"fitting the group think" comments usually get
| upvoted.
|
| For us who've spent a lot of time on the web, this sort of
| bullshit detector is basically built-in at this point, but if
| you're new to places where the group think is so heavy as on
| reddit, it's worth being careful taking anything at face value.
| ddoolin wrote:
| This is entirely why I can't bring myself to use it. The
| groupthink and virtue signaling is _intense_ , when it's not
| just extremely low effort crud that rises to the top. And
| yes, before anyone says, I know, "curate." No, thank you.
| dingnuts wrote:
| Friend, this website is EXACTLY the same
| rafterydj wrote:
| I understand that the core similarities are there, but I
| disagree. The comparisons have been around since I
| started browsing HN years ago. The moderation on this
| site, for one, emphasizes constructive conversation and
| discussion in a way that most subreddits can only dream
| of.
|
| It also helps that the target audience has been filtered
| with that moderation, so over time this site (on average)
| skews more technical and informed.
| ffsm8 wrote:
| Frankly, no. As an obvious example that can be stated
| nowadays: musk has always been an over-promising liar.
|
| Eg just look at the 2012+ videos of thunderf00t.
|
| Yet people were literally banned here just for pointing
| out that he hasn't actually delivered on anything in the
| capacity he promised until he did the salute.
|
| It's pointless to list other examples, as this page is-
| as dingnuts pointed out - exactly the same and most
| people aren't actually willing to change their opinion
| based on arguments. They're set in their opinions and
| think everyone else is dumb.
| lcnPylGDnU4H9OF wrote:
| I don't see how that example refutes their point. It can
| be true both that there have been disagreeable bans and
| that the bans, in general, tend to result in higher
| quality discussions. The disagreeable bans seem to be
| outliers.
|
| > They're set in their opinions and think everyone else
| is dumb.
|
| Well, anyway, I read and post comments here because
| commenters here think critically about discussion topics.
| It's not a perfect community with perfect moderation but
| the discussions are of a quality that's hard to find
| elsewhere, let alone reddit.
| chucksmash wrote:
| > Yet people were literally banned here just for pointing
| out that he hasn't actually delivered on anything in the
| capacity he promised until he did the salute.
|
| I'd be shocked if they (you?) were banned _just_ for
| critiquing Musk. So please link the post. I 'm prepared
| to be shocked.
|
| I'm also pretty sure that I could make a throwaway
| account that only posted critiques of Musk (or about any
| single subject for that matter) and manage to keep it
| alive by making the critiques timely, on-topic and
| thoughtful or get it banned by being repetitive and
| unconstructive. So would you say I was banned for talking
| about <topic>? Or would you say I was banned for my
| behavior while talking about <topic>?
| janalsncm wrote:
| Aside from the fact that I highly doubt anyone was banned
| as you describe, EM's _stories_ have gotten more and more
| grandiose. So it's not the same.
|
| Today he's pitching moonshot projects as core to Tesla.
|
| 10 years ago he was saying self-driving was easy, but he
| was also selling by far the best electric vehicle on the
| market. So lying about self driving and Tesla semis
| mattered less.
|
| Fwiw I've been subbed to tf00t since his 50 part
| creationist videos in early 2010s.
| Freedom2 wrote:
| This sites commenters attempt to apply technical
| solutions to social problems, then pats itself on the
| back despite their comments being entirely inappropriate
| to the problem space.
|
| There's also no actual constructive discussion when it
| comes to future looking tech. The Cybertruck, Vision Pro,
| LLMs are some of the most recent items that were
| absolutely inaccurately called by the most popular
| comments. And their reasoning for their prediction had no
| actual substance in their comments.
| yieldcrv wrote:
| And the crypto asset discussions are very nontechnical
| here, veering into elementary and inaccurate
| philosophical discussions, despite this being a great
| forum to talk about technical aspects. every network has
| pull requests and governance proposals worth discussing,
| and the deepest discussion here is resurrected from 2012
| about the entire concept not having a licit use case that
| the poster could imagine
| Maxatar wrote:
| HackerNews isn't not exactly like reddit, sure, but it's
| not much better. People are much better behaved, but
| still spread a great deal of misinformation.
|
| One way to gauge this property of a community is whether
| people who are known experts in a respective field
| participate in it, and unfortunately there are very few
| of them on HackerNews (this was not always the case).
| I've had some opportunities to meet with people who are
| experts, usually at conferences/industry events, and
| while many of them tend to be active on Twitter... they
| all say the same things about this site, namely that it's
| simply full of bad information and the amount of effort
| needed to dispel that information is significantly higher
| than the amount of effort needed to spread it.
|
| Next time someone posts an article about a topic you are
| intimately familiar with, like top 1% subject matter
| expert in... review the comment section for it and you'll
| find just heaps of misconceptions, superficial knowledge,
| and my favorite are the contrarians who take these very
| strong opinions on a subject they have some passing
| knowledge about but talk about their contrarian opinion
| with such a high degree of confidence.
|
| One issue is you may not actually be a subject matter
| expert on a topic that comes up a lot on HackerNews, so
| you won't recognize that this happens... but while people
| here are a lot more polite and the moderation policies do
| encourage good behavior... moderation policies don't do a
| lot to stop the spread of bad information from poorly
| informed people.
| SamPatt wrote:
| One of the things I appreciate most about HN is the fact
| that experts are often found in the comments.
|
| Perhaps we are defining experts differently?
| bb88 wrote:
| There was a lot of pseudo science being published and
| voted up in the comments with Ivermectin/HCQ/etc and
| Covid, when those people weren't experts and before the
| Ivermectin paper got serious scrutiny.
|
| The other aspect is that people on here think they're
| that if they are an expert in one thing, they instantly
| become an expert in another thing.
| tuwtuwtuwtuw wrote:
| Do you have any sources to back up those claims?
| ddoolin wrote:
| It happens in degrees, and the degree here is much lower.
| turtlesdown11 wrote:
| its actually the reverse, dunning kruger is off the
| charts on hacker news
| ddoolin wrote:
| I don't think there's a lot of groupthink or virtue
| signaling here, and those are the things that irritate me
| the most. If people here overestimate their knowledge or
| abilities, that's okay because I don't treat things
| people say as gospel/fact/truth unless I have clear and
| compelling reasons to do so. This is the internet after
| all.
|
| Personally I also think the submissions that make it to
| the front page(s) are much better than any subreddit.
| bigyabai wrote:
| I disagree. Reddit users are out to impress nobody but
| themselves, but the other day I saw someone submit a
| "Show HN" with AI-generated testimonials.
|
| HN has an active grifter culture reinforced by the VC
| funding cycles. Reddit can only dream about lying as well
| as HN does.
| rafaelmn wrote:
| That's a tangential problem.
|
| HN tends to push up grifter hype slop, and there are a
| lot of those people around cause VC, but you can still
| see comments pushing back.
|
| Reading reddit reminds me of highschool forum arguments
| I've had 20 years ago, but lower quality because of
| population selection. It's just too mainstream at this
| point and shows you what the middle of the bell curve
| looks like.
| janalsncm wrote:
| Strongly disagree.
|
| Scroll to the bottom of comment sections on HN, you'll
| find the kind of low-effort drive-by comments that are
| usually at the top of Reddit comment sections.
|
| In other words, it helps to have real moderators.
| k__ wrote:
| While the tone on HN is much more civil than on Reddit.
| It's still quite the echo chamber.
| mountainriver wrote:
| Strong disagree as well, this is one of the few places on
| the Internet which avoids this. I wish there were more
| adolph wrote:
| > > . . . The groupthink and virtue signaling is intense
| . . .
|
| > Friend, this website is EXACTLY the same
|
| And it gnows it:
| https://news.ycombinator.com/item?id=4881042
| ivape wrote:
| Well the unfortunate truth is HN has been behind the curve on
| local llm discussions so localllama has been the only one
| picking up the slack. There are just waaaaaaaay to many "ai
| is just hype" people here and the grassroots
| hardware/localllm discussions have been quite scant.
|
| Like, we're fucking two years in and only now do we have a
| thread about something like this? The whole crowd here needs
| to speed up to catch up.
| saltcured wrote:
| There are people who think LLMs are the future and a
| sweeping change you must embrace or be left behind.
|
| There are others wondering if this is another hype
| juggernaut like CORBA, J2EE, WSDL, XML, no-SQL, or who-
| knows-what. A way to do things that some people treated as
| the new One True Way, but others could completely bypass
| for their entire, successful career and look at it now in
| hindsight with a chuckle.
| Der_Einzige wrote:
| Lol this is true but also a TON of sampling innovations that
| are getting their love right now from the AI community (see
| min_p oral at ICLR 2025) came right from r/localllama so
| don't be a hater!!!
| rahimnathwani wrote:
| Poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.
| png?t=174...
| ijk wrote:
| LocalLlama is good for:
|
| - Learning basic terms and concepts.
|
| - Learning how to run local inference.
|
| - Inference-level considerations (e.g., sampling).
|
| - Pointers to where to get other information.
|
| - Getting the vibe of where things are.
|
| - Healthy skepticism about benchmarks.
|
| - Some new research; there have been a number of significant
| discoveries that either originated in LocalLlama or got
| popularized there.
|
| LocalLlama is bad because:
|
| - Confusing information about finetuning; there's a lot of
| myths from early experiments that get repeated uncritically.
|
| - Lots of newbie questions get repeated.
|
| - Endless complaints that it's been took long since a new
| model was released.
|
| - Most new research; sometimes a paper gets posted but most
| of the audience doesn't have enough background to evaluate
| the implications of things even if they're highly relevant.
| I've seen a lot of cutting edge stuff get overlooked because
| there weren't enough upvoters who understood what they were
| looking at.
| drillsteps5 wrote:
| I use it as a discovery tool. Like if anybody mentions
| something interesting I go and research install/start playing
| with it. I could care less if they like it or not I'll make
| my own opinion.
|
| For example I find all comments about model X be more
| "friendly" or "chatty" and model Y being more "unhinged" or
| whatever to be mostly BS. Like there's gazillion ways a
| conversation can go and I don't find model X or Y to be
| consistently chatty or unhinged or creative or whatever every
| time.
| nico wrote:
| What do you recommend for coding with aider or roo?
|
| Sometimes it's hard to find models that can effectively use
| tools
| cchance wrote:
| I havent found one good locally, i use DeepSeek r1 0528 its
| slow but free and really good at coding (openrouter has it
| free currently)
| xtracto wrote:
| There was this great post the other day [1] showing that with
| llama-cpp you could offload some specific tensors to the CPU
| and maintain good performance. That's a good way to use
| lare(ish) models in commodity hardware.
|
| Normally with llama-cpp you specifiy how many (full) layers you
| want to put in GPU (-ngl) . But CPU-offloading specific tensors
| that don't require heavy computation , saves GPU space without
| affecting speed that much.
|
| I've also read a paper on loading only "hot" neurons into the
| cpu [2] . The future of home AI looks so cool!
|
| [1]
| https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...
|
| [2] https://arxiv.org/abs/2312.12456
| srazzaque wrote:
| Agree with what others have said: you need to try a few out. But
| I'd put Qwen3-14B on your list of things to try out.
| emson wrote:
| Good question. I've had some success with Qwen2.5-Coder 14B, I
| did use the quantised version:
| huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct-GGUF:latest It
| worked well on my MacBook Pro M1 32Gb. It does get a bit hot on a
| laptop though.
| ProllyInfamous wrote:
| VEGA64 (8GB) is pretty much obsolete _for this AI stuff, right_
| (compared to e.g. M2Pro (16GB))?
|
| I'll give Qwen2.5 a try on the Apple Silicon, thanks.
| kekePower wrote:
| I have an RTX 3070 with 8GB VRAM and for me Qwen3:30B-A3B is fast
| enough. It's not lightning fast, but more than adequate if you
| have a _little_ patience.
|
| I've found that Qwen3 is generally really good at following
| instructions and you can also very easily turn on or off the
| reasoning by adding "/no_think" in the prompt to turn it off.
|
| The reason Qwen3:30B works so well is because it's a MoE. I have
| tested the 14B model and it's noticeably slower because it's a
| dense model.
| tedivm wrote:
| How are you getting Qwen3:30B-A3B running with 8GB? On my
| system it takes 20GB of VRAM to launch it.
| fennecfoxy wrote:
| Probably offload to regular ram I'd wager. Or really, really,
| reaaaaaaally quantized to absolute fuck. Qwen3:30B-A3B Q1
| with a 1k Q4 context uses 5.84GB of vram.
| kekePower wrote:
| It offloads to system memory, but since there are "only" 3
| Billion active parameters, it works surprisingly well. I've
| been able to run models that are up to 29GB in size, albeit
| very, very slow on my system with 32GB RAM.
| dcminter wrote:
| I think you'll find that on that card most models that are
| approaching the 16G memory size will be more than fast enough and
| sufficient for chat. You're in the happy position of needing
| steeper requirements rather than faster hardware! :D
|
| Ollama is the easiest way to get started trying things out IMO:
| https://ollama.com/
| giorgioz wrote:
| I found LM Studios so much easier than ollama given it has a
| UI: https://lmstudio.ai/ Did you know about LM Studio? Why is
| ollama still recommended given it's just a CLI with worse UX?
| ekianjo wrote:
| lM studio is closed source
| prophesi wrote:
| Any FOSS solutions that let you browse models and
| guesstimates for you on whether you have enough VRAM to
| fully load the model? That's the only selling point to LM
| Studio for me.
|
| Ollama's default context length is frustratingly short in
| the era of 100k+ context windows.
|
| My solution so far has been to boot up LM Studio to check
| if a model will work well on my machine, manually download
| the model myself through huggingface, run llama.cpp, and
| hook it up to open-webui. Which is less than ideal, and LM
| Studio's proprietary code has access to my machine specs.
| nickthegreek wrote:
| https://huggingface.co/docs/accelerate/v0.32.0/en/usage_g
| uid...
| prophesi wrote:
| Thanks! That's really helpful.
| dcminter wrote:
| I recommended ollama because IMO that is the easiest way to
| get started (as I said).
| spacecadet wrote:
| Look for something in the 500m-3b parameters range. 3 might push
| it...
|
| SmolVLM is pretty useful.
| https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
| rhdunn wrote:
| It is feasible to run 7B, 8B models with q6_0 in 8GB VRAM, or
| q5_k_m/q4_k_m if you have to or want to free up some VRAM for
| other things. With q4_k_m you can run 10B and even 12B models.
| vladslav wrote:
| I use Gemma3:12b on a Mac M3 Pro, basically like Grammarly.
| giorgioz wrote:
| Try out some models with LM Studio: https://lmstudio.ai/ It has a
| UI so it's very easy to download the model and have a UI similar
| to the chatGPT app to query that model.
| redman25 wrote:
| Gemma-3-12b-qat https://huggingface.co/google/gemma-3-12b-it-
| qat-q4_0-gguf
|
| Qwen_Qwen3-14B-IQ4_XS.gguf
| https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF
|
| Gemma3 is a good conversationalist but tends to hallucinate.
| Qwen3 is very smart but also very stubborn (not very steerable).
| m4r71n wrote:
| What is everyone using their local LLMs for primarily? Unless you
| have a beefy machine, you'll never approach the level of quality
| of proprietary models like Gemini or Claude, but I'm guessing
| these smaller models still have their use cases, just not sure
| what those are.
| ativzzz wrote:
| I think that the future of local LLMs is delegation. You give
| it a prompt and it very quickly identifies what should be used
| to solve the prompt.
|
| Can it be solved locally with locally running MCPs? Or maybe
| it's a system API - like reading your calendar or checking your
| email. Otherwise it identifies the best cloud model and sends
| the prompt there.
|
| Basically Siri if it was good
| Rotundo wrote:
| Not everyone is comfortable with sending their data and/or
| questions and prompts to an external party.
| DennisP wrote:
| Especially now that a court has ordered OpenAI to keep
| records of it all.
|
| https://www.adweek.com/media/a-federal-judge-ordered-
| openai-...
| diggan wrote:
| I'm currently experimenting with Devstral for my own local
| coding agent I've slowly built together. It's in many ways
| nicer than Codex in that 1) full access to my hardware so can
| start VMs, make network requests and everything else I can do,
| which Codex cannot and 2) it's way faster both in initial
| setup, working through things and creating a patch.
|
| Of course, it still isn't at the same level as Codex itself,
| the model Codex is using is just way better so of course it'll
| get better results. But Devstral (as I currently use it) is
| able to make smaller changes and refactors, and I think if I
| evolve the software a bit more, can start making larger changes
| too.
| brandall10 wrote:
| Why are you comparing it to Codex and not Claude Code, which
| can do all those things?
|
| And why not just use Openhands, which it was designed around
| which I presume can also do all those things?
| cratermoon wrote:
| I have a large repository of notes, article drafts, and
| commonplace book-type stuff. I experimented a year or so ago
| with a system using RAG to "ask myself" what I have to say
| about various topics. (I suppose nowadays I would use MCP
| instead of RAG?) I was not especially impressed by the results
| with the models I was able to run: long-winded responses full
| of slop and repetition, irrelevant information pulled in from
| notes that had some semantically similar ideas, and such. I'm
| certainly not going to feed the contents of my private
| notebooks to any of the AI companies.
| notfromhere wrote:
| You'd still use RAG, just use MCP to more easily connect an
| LLM to your RAG pipeline
| cratermoon wrote:
| To clarify: what I was doing was first querying for the
| documents via a standard document database query and then
| feeding the best matching documents to the LLM. My
| understanding is that with MCP I'd delegate the document
| query from the LLM to the tool.
| longtimelistnr wrote:
| As a beginner, I also haven't had much luck with embedded
| vector queries either. Firstly, setting it up was a major
| pain in the ass and I couldn't even get it to ingest
| anything beyond .txt files. Second, maybe it was my AI
| system prompt or the lack of outside search capabilities
| but unless i was very specific with my query the response
| was essentially "can't find what youre looking for"
| moffkalast wrote:
| > unless you have a beefy machine
|
| The average person in r/locallama has a machine that would make
| r/pcmasterrace users blush.
| rollcat wrote:
| An Apple M1 is decent enough for LMs. My friend wondered why
| I got so excited about it when it came out five years ago. It
| wasn't that it was particularly powerful - it's decent. What
| it did was to set a new bar for "low end".
| vel0city wrote:
| A new Mac is easily starting around $1k and quickly goes up
| from there if you want a storage or RAM upgrade, especially
| for enough memory to really run some local models. Insane
| that a $1,000 computer is called "decent" and "low end". My
| daily driver personal laptop brand new was $300.
| moffkalast wrote:
| That's fun to hear given that low end laptops are now
| $800, mid range is like $1.5k and upper end is $3k+ even
| for non-Apple vendors. Inflation makes fools of us all.
| vel0city wrote:
| Low end laptops can still easily be found for far less
| than $800.
|
| https://www.microcenter.com/product/676305/acer-
| aspire-3-a31...
| Mr-Frog wrote:
| The first IBM PC in 1981 cost $1,565, which is comparable
| to $5,500 after inflation.
| evilduck wrote:
| An M1 Mac is about 5 years old at this point and can be
| had for far less than a grand.
|
| A brand new Mac Mini M4 is only $499.
| vel0city wrote:
| Ah, I was focusing on the laptops, my bad. But still its
| more than $499. Just looked on the Apple store website,
| Mac Mini M4 starting at $599 (not $499), with only 256GB
| of storage.
|
| https://www.apple.com/shop/buy-mac/mac-mini/m4
| nickthegreek wrote:
| microcenter routinely sells that system for $450.
|
| https://www.microcenter.com/product/688173/Mac_mini_MU9D3
| LL-...
| fennecfoxy wrote:
| You're right - memory size and then bandwidth is
| imperative for LLMs. Apple currently lacks great memory
| bandwidth with their unified memory. But it's not a bad
| option if you can find one for a good price. The prices
| for new are just bonkers.
| rollcat wrote:
| Of course it depends on what you consider "low end" -
| it's relative to your expectations. I have a G4 TiBook,
| _the_ definition of a high-end laptop, by 2002 standards.
| If you consider a $300 laptop a good daily driver, I 'll
| one-up you with this: <https://www.chrisfenton.com/diy-
| laptop-v2/>
| vel0city wrote:
| My $300 laptop is a few years old. It has a Ryzen 3 3200U
| CPU, it has a 14" 1080p display, backlit keyboard. It
| came with 8GB of RAM and a 128GB SSD, I upgraded to 16GB
| from RAM acquired by a dumpster dive and a 256GB SSD for
| like $10 on clearance at Microcenter. I upgraded the WiFi
| to an Intel AX210 6e for about another $10 off Amazon. It
| gets 6-8 hours of battery life doing browsing and texting
| editing kind of workloads.
|
| The only thing that is itching me to get a new machine is
| it needs a 19V power supply. Luckily it's a pretty common
| barrel size, I already had several power cables laying
| around that work just fine. I'd prefer to just have all
| my portable devices to run off USB-C though.
| pram wrote:
| I know I speak for everyone that your dumpster laptop is
| very impressive, give yourself a big pat on the back. You
| deserve it.
| ozim wrote:
| You still can get decent stuff out of local ones.
|
| Mostly I use it for testing tools and integrations via API not
| to spend money on subscriptions. When I see something working I
| switch it to proprietary one to get best results.
| nomel wrote:
| If you're comfortable with the API, all the services provide
| pay-as-you-go API access that can be much cheaper. I've tried
| local, but the _time_ cost of getting it to spit out
| something reasonable wasn 't worth the _literal pennies_ the
| answers from the flagship would cost.
| qingcharles wrote:
| This. The APIs are so cheap and they are up and running
| _right now_ with 10x better quality output. Unless whatever
| you are doing is Totally Top Secret or completely
| nefarious, then send your prompts to an API.
| barnabee wrote:
| I generally try a local model first for most prompts. It's good
| enough surprisingly often (over 50% for sure). Every time I
| avoid using a cloud service is a win.
| mixmastamyk wrote:
| Shouldn't the (MoE) mixture of experts approach allow one to
| conserve memory by working on specific problem type at a time?
|
| > (MoE) divides an AI model into separate sub-networks (or
| "experts"), each specializing in a subset of the input data, to
| jointly perform a task.
| ijk wrote:
| Sort of, but the "experts" aren't easily divisible in a
| conceptually interpretable way so the naive understanding of
| MoE is misleading.
|
| What you typically end up with in memory constrained
| environments is that the core shared layers are in fast
| memory (VRAM, ideally) and the rest are in slower memory
| (system RAM or even a fast SSD).
|
| MoE models are typically very shallow-but-wide in comparison
| with the dense models, so they end up being faster than an
| equivalent dense model, because they're ultimately running
| through fewer layers each token.
| qingcharles wrote:
| If you look on localllama you'll see most of the people there
| are really just trying to do NSFW or other questionable or
| unethical things with it.
|
| The stuff you can run on reasonable home hardware (e.g. a
| single GPU) isn't going to blow your mind. You can get pretty
| close to GPT3.5, but it'll feel dated and clunky compared to
| what you're used to.
|
| Unless you have already spent big $$ on a GPU for gaming, I
| really don't think buying GPUs for home makes sense,
| considering the hardware and running costs, when you can go to
| a site like vast.ai and borrow one for an insanely cheap amount
| to try it out. You'll probably get bored and be glad you didn't
| spend your kids' college fund on a rack of H100s.
| teleforce wrote:
| This is an excellent example of local LLM application [1].
|
| It's an AI-driven chat system designed to support students in
| the Introduction to Computing course (ECE 120) at UIUC,
| offering assistance with course content, homework, or
| troubleshooting common problems.
|
| It serves as an educational aid integrated into the course's
| learning environment using UIUC Illinois Chat system [2].
|
| Personally I've found it's really useful that it provides the
| details portions of course study materials for examples slides
| that's directly related to the discussions so the students can
| check the sources veracity of the answers provided by the LLM.
|
| It seems to me that RAG is the killer feature for local LLM
| [3]. It directly addressed the main pain point of LLM
| hallucinations and help LLMs stick to the facts.
|
| [1] Introduction to Computing course (ECE 120) Chatbot:
|
| https://www.uiuc.chat/ece120/chat
|
| [2] UIUC Illinois Chat:
|
| https://uiuc.chat/
|
| [3] Retrieval-augmented generation [RAG]:
|
| https://en.wikipedia.org/wiki/Retrieval-augmented_generation
| staticcaucasian wrote:
| Does this actually need to be local? Since the chat bot is
| open to the public and I assume the course material used for
| RAG all on this page
| (https://canvas.illinois.edu/courses/54315/pages/exam-
| schedul...) all stays freely accessible - I clicked a few
| links without being a student - I assume a pre-prompted
| larger non-local LLM would outperform the local instance.
| Though, you can imagine an equivalent course with all of its
| content ACL-gated/'paywalled' could benefit from local RAG, I
| guess.
| ijk wrote:
| General local inference strengths:
|
| - Experiments with inference-level control; can't do the
| Outlines / Instructor stuff with most API services, can't do
| the advanced sampling strategies, etc. (They're catching up but
| they're 12 months behind what you can do locally.)
|
| - Small, fast, finetuned models; _if you know what your domain
| is sufficiently to train a model you can outperform everything
| else_. General models _usually_ win, if only due to ease of
| prompt engineering, but not always.
|
| - Control over which model is being run. Some drift is
| inevitable as your real-world data changes, but when your model
| is also changing underneath you it can be harder to build
| something sustainable.
|
| - More control over costs; this is the classic on-prem versus
| cloud decision. Most cases you just want to pay for the
| cloud...but we're not in ZIRP anymore and having a predictable
| power bill can trump sudden unpredictable API bills.
|
| In general, the move to cloud services was originally a cynical
| OpenAI move to keep GPT-3 locked away. They've built up a bunch
| of reasons to prefer the in-cloud models (heavily subsidized
| fast inference, the biggest and most cutting edge models, etc.)
| so if you need the latest and greatest right now and are
| willing to pay, it's probably the right business move for most
| businesses.
|
| This is likely to change as we get models that can reasonably
| run on edge devices; right now it's hard to build an app or a
| video game that incidentally uses LLM tech because user revenue
| is unlikely to exceed inference costs without a lot of careful
| planning or a subscription. Not impossible, but definitely adds
| business challenges. Small models running on end-user devices
| opens up an entirely new level of applications in terms of
| cost-effectiveness.
|
| If you need the right answer, sometimes only the biggest cloud
| API model is acceptable. If you've got some wiggle room on
| accuracy and can live with sometimes getting a substandard
| response, then you've got a lot more options. The trick is that
| the things that an LLM is best at are always going to be things
| where less than five nines of reliability are acceptable, so
| even though the biggest models have more reliability, an
| average there are many tasks where you might be just fine with
| a small fast model that you have more control over.
| drillsteps5 wrote:
| I avoid using cloud whenever I can on principle. For instance,
| OpenAI recently indicated that they are working on some social
| network-like service for ChatGPT users to share their chats.
|
| Running it locally helps me understand how these things work
| under the hood, which raises my value on the job market. I also
| play with various ideas which have LLM on the backend (think
| LLM-powered Web search, agents, things of that nature), I don't
| have to pay cloud providers, and I already had a gaming rig
| when LLaMa was released.
| jaggs wrote:
| hf.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF:Q6_K
| is a decent performing model, if you're not looking for blinding
| speed. It definitely ticks all the boxes in terms of model
| quality. Try a smaller quant if you need more speed.
| cowpig wrote:
| Ollama[0] has a collection of models that are either already
| small or quantized/distilled, and come with hyperparameters that
| are pretty reasonable, and they make it easy to try them out. I
| recommend you install it and just try a bunch because they all
| have different "personalities", different strengths and
| weaknesses. My personal go-tos are:
|
| Qwen3 family from Alibaba seem to be the best reasoning models
| that fit on local hardware right now. Reasoning models on local
| hardware are annoying in contexts where you just want an
| immediate response, but vastly outperform non-reasoning models on
| things where you want the model to be less naive/foolish.
|
| Gemma3 from google is really good at intuition-oriented stuff,
| but with an obnoxious HR Boy Scout personality where you
| basically have to add "please don't add any disclaimers" to the
| system prompt for it to function. Like, just tell me how long you
| think this sprain will take to heal, I already know you are not a
| medical professional, jfc.
|
| Devstral from Mistral performs the best on my command line
| utility where I describe the command I want and it executes that
| for me (e.g. give me a 1-liner to list the dotfiles in this
| folder and all subfolders that were created in the last month).
|
| Nemo from Mistral, I have heard (but not tested) is really good
| for routing-type jobs, where you need something with to make a
| simple multiple-choice decision competently with low latency, and
| is easy to fine-tune if you want to get that sophisticated.
|
| [0] https://ollama.com/search
| unethical_ban wrote:
| Phi-4 is scared to talk about anything controversial, as if
| they're being watched.
|
| I asked it a question about militias. It thought for a few pages
| about the answer and whether to tell me, then came back with "I
| cannot comply".
|
| Nidum is the name of uncensored Gemma, it does a good job most of
| the time.
| emmelaich wrote:
| Generally speaking, how can you tell how much vram a model will
| take? It seems to be a valuable bit of data which is missing from
| downloadable models (gguf) files.
| omneity wrote:
| Very rougly you can consider the Bs of a model as GBs of memory
| then it depends on the quantization level. Say for an 8B model:
|
| - FP16: 2x 8GB = 16GB
|
| - Q8: 1x 8GB
|
| - Q4: 0.5x 8GB = 4GB
|
| It doesn't 100% neatly map like this but this gives you a rough
| measure. In top of this you need some more memory depending on
| the context length and some other stuff.
|
| Rationale for the calculation above: A model is basically a
| billions of variables with a floating number value. So the size
| of a model roughly maps to number of variables (weights) x
| word-precision of each variable (4, 8, 16bits..)
|
| You don't have to quantize all layers to the same precision
| this is why sometimes you see fractional quantizations like
| 1.58bits.
| rhdunn wrote:
| The 1.58bit quantization is using 3 values -- -1, 0, 1. The
| bits number comes from log_2(3) = 1.58....
|
| For that level you can pack 4 weights in a byte using 2 bits
| per byte. However, there is one bit configuration in each
| that is unused.
|
| More complex packing arrangements are done by grouping
| weights together (e.g. a group of 3) and assigning a bit
| configuration to each combination of values into a lookup
| table. This allows greater compression closer to the 1.68
| bits value.
| fennecfoxy wrote:
| Depends on quantization etc. But there are good calculators
| that will calculate for your KV cache etc as well:
| https://apxml.com/tools/vram-calculator.
| binary132 wrote:
| I've had awesome results with Qwen3-30B-A3B compared to other
| local LMs I've tried. Still not crazy good but a lot better and
| very fast. I have 24GB of VRAM though
| tiahura wrote:
| What about for a 5090?
| sgt wrote:
| Comes with 32GB VRAM right?
|
| Speaking of, would a Ryzen 9 12 core be nice for a 5090 setup?
|
| Or should one really go dual 5090?
| mrbonner wrote:
| Did someone have a chance to try local llama for the new AMD AI
| Max+ with 128gb of unified RAM?
| nosecreek wrote:
| Related question: what is everyone using to run a local LLM? I'm
| using Jan.ai and it's been okay. I also see OpenWebUI mentioned
| quite often.
| op00to wrote:
| LMStudio, and sometimes AnythingLLM.
| Havoc wrote:
| LM studio if you just want an app. openwebui is just a front
| end - you'd need to have either llama.cpp or vllm behind it to
| serve the model
| fennecfoxy wrote:
| KoboldCPP + SillyTavern, has worked the best for me.
| Havoc wrote:
| It's a bit like asking what flavour of icecream is the best. Try
| a few and see.
|
| For 16gb and speed you could try Qwen3-30B-A3B with some offload
| to system ram or use a dense model Probably a 14B quant
| fennecfoxy wrote:
| Basic conversations are essentially RP I suppose. You can look at
| KoboldCPP or SillyTavern reddit.
|
| I was trying Patricide unslop mell and some of the Qwen ones
| recently. Up to a point more params is better than worrying about
| quantization. But eventually you'll hit a compute wall with high
| params.
|
| KV cache quantization is awesome (I use q4 for a 32k context with
| a 1080ti!) and context shifting is also awesome for long
| conversations/stories/games. I was using ooba but found recently
| that KoboldCPP not only runs faster for the same model/settings
| but also Kobold's context shifting works much more consistently
| than Ooba's "streaming_llm" option, which almost always re-
| evaluates the prompt when hooked up to something like ST.
| sabareesh wrote:
| This is what i have https://sabareesh.com/posts/llm-rig/ All You
| Need is 4x 4090 GPUs to Train Your Own Model
| yapyap wrote:
| 4 4090s is easily 8000$, nothing to scoff at IMO
| ge96 wrote:
| Imagine some SLI 16 x 1080
| dr_kiszonka wrote:
| Could you explain what is your use case for training 1B models?
| Learning or perhaps fine tuning?
| sabareesh wrote:
| Learning, prototype and then scale it in to cloud. Also can
| be used as inference engine to train another model if you are
| using model as a judge for RL.
| arh68 wrote:
| Wow, a 5060Ti. 16gb + I'm guessing >=32gb ram. And here I am
| spinning Ye Olde RX 570 4gb + 32gb.
|
| I'd like to know how many tokens you can get out of the larger
| models especially (using Ollama + Open WebUI on Docker Desktop,
| or LM Studio whatever). I'm probably not upgrading GPU this year,
| but I'd appreciate an anecdotal benchmark. -
| gemma3:12b - phi4:latest (14b) - qwen2.5:14b [I get
| ~3 t/s on all these small models, acceptably slow] -
| qwen2.5:32b [this is about my machine's limit; verrry slow, ~1
| t/s] - qwen2.5:72b [beyond my machine's limit, but maybe
| not yours]
| diggan wrote:
| I'm guessing you probably also want to include the quantization
| levels you're using, as otherwise they'll be a huge variance in
| your comparisons with others :)
| arh68 wrote:
| True, true. All Q4_K_M unless I'm mistaken. Thanks
| reissbaker wrote:
| At 16GB a Q4 quant of Mistral Small 3.1, or Qwen3-14B at FP8,
| will probably serve you best. You'd be cutting it a little close
| on context length due to the VRAM usage... If you want longer
| context, a Q4 quant of Qwen3-14B will be a bit dumber than FP8
| but will leave you more breathing room. Mistral Small can take
| images as input, and Qwen3 will be a bit better at math/coding;
| YMMV otherwise.
|
| Going below Q4 isn't worth it IMO. If you want significantly more
| context, probably drop down to a Q4 quant of Qwen3-8B rather than
| continuing to lobotomize the 14B.
|
| Some folks have been recommending Qwen3-30B-A3, but I think 16GB
| of VRAM is probably not quite enough for that: at Q4 you'd be
| looking at 15GB for the weights alone. Qwen3-14B should be pretty
| similar in practice though despite being lower in param count,
| since it's a dense model rather than a sparse one: dense models
| are generally smarter-per-param than sparse models, but somewhat
| slower. Your 5060 should be plenty fast enough for the 14B as
| long as you keep everything on-GPU and stay away from CPU
| offloading.
|
| Since you're on a Blackwell-generation Nvidia chip, using LLMs
| quantized to NVFP4 specifically will provide some speed
| improvements at some quality cost compared to FP8 (and will be
| faster than Q4 GGUF, although ~equally dumb). Ollama doesn't
| support NVFP4 yet, so you'd need to use vLLM (which isn't too
| hard, and will give better token throughput anyway). Finding pre-
| quantized models at NVFP4 will be more difficult since there's
| less-broad support, but you can use llmcompressor [1] to
| statically compress any FP16 LLM to NVFP4 locally -- you'll
| probably need to use accelerate to offload params to CPU during
| the one-time compression process, which llmcompressor has
| documentation for.
|
| I wouldn't reach for this particular power tool until you've
| decided on an LLM already, and just want faster perf, since it's
| a bit more involved than just using ollama and the initial
| quantization process will be slow due to CPU offload during
| compression (albeit it's only a one-time cost). But if you land
| on a Q4 model, it's not a bad choice once you have a favorite.
|
| 1: https://github.com/vllm-project/llm-compressor
| FuriouslyAdrift wrote:
| I've had good luck with GPT4All (Nomic) and either reason v1
| (Qwen 2.5 - Coder 7B) or Llama 3 8B Instruct.
| DiabloD3 wrote:
| I'd suggest buying a better GPU, only because all the models you
| want need a 24GB card. Nvidia... more or less robbed you.
|
| That said, Unsloth's version of Qwen3 30B, running via llama.cpp
| (don't waste your time with any other inference engine), with the
| following arguments (documented in Unsloth's docs, but sometimes
| hard to find): `--threads (number of threads your CPU has) --ctx-
| size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407
| --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along
| with the other arguments you need.
|
| Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF
| (since you have 16GB, grab Q3_K_XL, since it fits in vram and
| leaves about 3-4GB left for the other apps on your desktop and
| other allocations llama.cpp needs to make).
|
| Also, why 30B and not the full fat 235B? You don't have 120-240GB
| of VRAM. The 14B and less ones are also not what you want: more
| parameters are better, parameter precision is vastly less
| important (which is why Unsloth has their specially crafted
| <=2bit versions that are 85%+ as good, yet are ridiculously tiny
| in comparison to their originals).
|
| Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3
| bigyabai wrote:
| > only because all the models you want need a 24GB card
|
| ???
|
| Just run a q4 quant of the same model and it will fit no
| problem.
| DiabloD3 wrote:
| Q4_K_M is the "default" for a lot of models on HF, and they
| generally require ~20GB of VRAM to run. It will not fit
| entirely on a 16GB card. You want to be about 3-4GB VRAM on
| top of what the model requires.
|
| A back of the envelope estimate of specifically
| unsloth/Qwen3-30B-A3B-128K-GGUF is 18.6GB for Q4_K_M.
| antirez wrote:
| The largest Gemma 3 and Qwen 3 you can run. Offload to RAM as
| many layers you can.
| lxe wrote:
| There's a new one every day it seems. Follow
| https://x.com/reach_vb from huggingface.
| PhilippGille wrote:
| Haven't seen Mozilla's LocalScore [1] mentioned in the comments
| yet. It's exactly made for the purpose of finding out how well
| different models run on different hardware.
|
| [1] https://www.localscore.ai/
| janalsncm wrote:
| People ask this question a lot and annoyingly the answer is:
| there are many definitions of "best". Speed, capabilities (e.g.
| do you want it to be able to handle images or just text?),
| quality, etc.
|
| It's like asking what the best pair of shoes is.
|
| Go on Ollama and look at the most popular models. You can decide
| for yourself what you value.
|
| And start small, these things are GBs in size so you don't want
| to wait an hour for a download only to find out a model runs at 1
| token / second.
| speedgoose wrote:
| My personal preference this month is the biggest Gemma3 you can
| fit on your hardware.
| arnaudsm wrote:
| Qwen3 family (and the R1 qwen3-8b distill) is #1 in programming
| and reasoning.
|
| However it's heavily censored on political topics because of its
| Chinese origin. For world knowledge, I'd recommend Gemma3.
|
| This post will be outdated in a month. Check https://livebench.ai
| and https://aider.chat/docs/leaderboards/ for up to date
| benchmarks
| the_sleaze_ wrote:
| > This post will be outdated in a month
|
| The pace of change is mind boggling. Not only for the models
| but even the tools to put them to use. Routers, tools, MCP,
| streaming libraries, SDKs...
|
| Do you have any advice for someone who is interested,
| developing alone and not surrounded by coworkers or meetups who
| wants to be able to do discovery and stay up to date?
| nickdothutton wrote:
| I find Ollama + TypingMind (or similar interface) to work well
| for me. As for which models, I think this is changing from one
| month to the next (perhaps not quite that fast). We are in that
| kind of period. You'll need to make sure the model layers fit in
| VRAM.
| nsxwolf wrote:
| I'm running llama3.2 out of the box on my 2013 Mac Pro, the low
| end quad core Xeon one, with 64GB of RAM.
|
| It's slow-ish but still useful, getting 5-10 tokens per second.
| nurettin wrote:
| pretty much all Q_4 models on huggingface fit in consumer grade
| cards.
| depingus wrote:
| Captain Eris Violet 12B fits those requirements.
| drillsteps5 wrote:
| I concur LocalLLama subreddit recommendation. Not in terms of
| choosing "the best model" but to answer questions, find guides,
| latest news and gossip, names of the tools, various models and
| how they stack against each other, etc.
|
| There's no one "best" model, you just try a few and play with
| parameters and see which one fits your needs the best.
|
| Since you're on HN, I'd recommend skipping Ollama and LMStudio.
| They might restrict access to the latest models and you typically
| only choose from the ones they tested with. And besides what kind
| of fun is this when you don't get to peek under the hood?
|
| llamacpp can do a lot itself, and you can do most recently
| released models (when changes are needed they adjust literally
| within a few days). You can get models from huggingface
| obviously. I prefer GGUF format, saves me some memory (you can
| use lower quantization, I find most 6-bit somewhat satisfactory).
|
| I find that the the size of the model's GGUF file with roughly
| tell me if it'll fit in my VRAM. For example 24Gb GGUF model will
| NOT fit in 16Gb, whereas 12Gb likely will. However, the more
| context you add the more RAM will be needed.
|
| Keep in mind that models are trained with certain context window.
| If it has 8Kb context (like most older models do) and you load it
| with 32Kb context it won't be much help.
|
| You can run llamacpp on Linux, Windows, or MacOS, you can get the
| binaries or compile on your local. It can split the model between
| VRAM and RAM (if the model doesn't fit in your 16Gb). It even has
| simple React front-end (llamacpp-server). The same module
| provides REST service which has similar (but simpler) protocol to
| OpenAI and all the other "big" guys.
|
| Since it implements OpenAI REST API, it also works with a lot of
| front-end tools if you want more functionality (ie oobabooga aka
| textgeneration webui).
|
| Koboldcpp is another backend you can try if you find llamacpp to
| be too raw (I believe it's the still llamacpp under the hood).
___________________________________________________________________
(page generated 2025-05-30 23:01 UTC)