[HN Gopher] Gemma 3 QAT Models: Bringing AI to Consumer GPUs
___________________________________________________________________
Gemma 3 QAT Models: Bringing AI to Consumer GPUs
Author : emrah
Score : 577 points
Date : 2025-04-20 12:22 UTC (1 days ago)
(HTM) web link (developers.googleblog.com)
(TXT) w3m dump (developers.googleblog.com)
| emrah wrote:
| Available on ollama: https://ollama.com/library/gemma3
| jinay wrote:
| Make sure you're using the "-it-qat" suffixed models like
| "gemma3:27b-it-qat"
| Zambyte wrote:
| Here are the direct links:
|
| https://ollama.com/library/gemma3:27b-it-qat
|
| https://ollama.com/library/gemma3:12b-it-qat
|
| https://ollama.com/library/gemma3:4b-it-qat
|
| https://ollama.com/library/gemma3:1b-it-qat
| ein0p wrote:
| Thanks. I was wondering why my open-webui said that I already
| had the model. I bet a lot of people are making the same
| mistake I did and downloading just the old, post-quantized
| 27B.
| Der_Einzige wrote:
| How many times do I have to say this? Ollama, llamacpp, and
| many other projects are slower than vLLM/sglang. vLLM is a much
| superior inference engine and is fully supported by the only
| LLM frontends that matter (sillytavern).
|
| The community getting obsessed with Ollama has done huge damage
| to the field, as it's ineffecient compared to vLLM. Many people
| can get far more tok/s than they think they could if only they
| knew the right tools.
| m00dy wrote:
| Ollama is definitely not for production loads but vLLm is.
| janderson215 wrote:
| I did not know this, so thank you. I read a blogpost a while
| back that encouraged using Ollama and never mention vLLM. Do
| you recommend reading any particular resource?
| Zambyte wrote:
| The significant convenience benefits outweigh the higher TPS
| that vLLM offers in the context of my single machine homelab
| GPU server. If I was hosting it for something more critical
| than just myself and a few friends chatting with it, sure.
| Being able to just paste a model name into Open WebUI and run
| it is important to me though.
|
| It is important to know about both to decide between the two
| for your use case though.
| Der_Einzige wrote:
| Running any HF model on vllm is as simple as pasting a
| model name into one command in your terminal.
| Zambyte wrote:
| What command is it? Because that was not at all my
| experience.
| Der_Einzige wrote:
| Vllm serve... huggingface gives run instructions for
| every model with vllm on their website.
| Zambyte wrote:
| How do I serve multiple models? I can pick from dozens of
| models that I have downloaded through Open WebUI.
| iAMkenough wrote:
| Had to build it from source to run on my Mac, and the
| experimental support doesn't seem to include these latest
| Gemma 3 QAT models on Apple Silicon.
| oezi wrote:
| Why is sillytavern the only LLM frontend which matters?
| GordonS wrote:
| I tried sillytavern a few weeks ago... wow, that is an
| "interesting" UI! I blundered around for a while, couldn't
| figure out how to do _anything_ useful... and then
| installed LM Studio instead.
| imtringued wrote:
| I personally thought the lorebook feature was quite neat
| and then quickly gave up on it because I couldn't get it
| to trigger, ever.
|
| Whatever those keyword things are, they certainly don't
| seem to be doing any form of RAG.
| Der_Einzige wrote:
| It supports more sampler and other settings than anyone
| else.
| ach9l wrote:
| instead of ranting, maybe explain how to make a qat q4 work
| with images in vllm, afaik it is not yet possible
| oezi wrote:
| Somebody in this thread mentioned 20.x tok/s on ollama. What
| are you seeing in vLLM?
| Zambyte wrote:
| FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the
| 27b qat. You can't really compare inference engine to
| inference engine without keeping the hardware and model
| fixed.
|
| Unfortunately Ollama and vLLM are therefore incomparable at
| the moment, because vLLM does not support these models yet.
|
| https://github.com/vllm-project/vllm/issues/16856
| simonw wrote:
| Last I looked vLLM didn't work on a Mac.
| mitjam wrote:
| Afaik vllm is for concurrent serving with batched inference
| for higher throughput, not single-user inference. I doubt
| inference throughput is higher with single prompts at a
| time than Ollama. Update: this is a good Intro to
| continuous batching in llm inference:
| https://www.anyscale.com/blog/continuous-batching-llm-
| infere...
| Der_Einzige wrote:
| It is much faster on single prompts than ollama. 3X is
| not unheard of
| prometheon1 wrote:
| From the HN guidelines:
| https://news.ycombinator.com/newsguidelines.html
|
| > Be kind. Don't be snarky.
|
| > Please don't post shallow dismissals, especially of other
| people's work.
|
| In my opinion, your comment is not in line with the
| guidelines. Especially the part about sillytavern being the
| only LLM frontend that matters. Telling the devs of any LLM
| frontend except sillytavern that their app doesn't matter
| seems exactly like a shallow dismissal of other people's work
| to me.
| holografix wrote:
| Could 16gb vram be enough for the 27b QAT version?
| halflings wrote:
| That's what the chart says yes. 14.1GB VRAM usage for the 27B
| model.
| erichocean wrote:
| That's the VRAM required just to load the model weights.
|
| To actually use a model, you need a context window.
| Realistically, you'll want a 20GB GPU or larger, depending on
| how many tokens you need.
| oezi wrote:
| I didn't realize that the context would require such so
| much memory. Is this KV caches? It would seem like a big
| advantage if this memory requirement could be reduced.
| jffry wrote:
| With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory
| usage is just a hair over 20GB, so no, probably not without a
| nerfed context window
| woadwarrior01 wrote:
| Indeed, the default context length in ollama is a mere 2048
| tokens.
| hskalin wrote:
| With ollama you could offload a few layers to cpu if they don't
| fit in the VRAM. This will cost some performance ofcourse but
| it's much better than the alternative (everything on cpu)
| senko wrote:
| I'm doing that with a 12GB card, ollama supports it out of
| the box.
|
| For some reason, it only uses around 7GB of VRAM, probably
| due to how the layers are scheduled, maybe I could tweak
| something there, but didn't bother just for testing.
|
| Obviously, perf depends on CPU, GPU and RAM, but on my
| machine (3060 + i5-13500) it's around 2 t/s.
| dockerd wrote:
| Does it work on LM Studio? Loading 27b-it-qat taking up more
| than 22GB on 24GB mac.
| parched99 wrote:
| I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB)
| to run with a 100 token context size on a 5070 ti (16GB) using
| llamacpp.
|
| Prompt Tokens: 10
|
| Time: 229.089 ms
|
| Speed: 43.7 t/s
|
| Generation Tokens: 41
|
| Time: 959.412 ms
|
| Speed: 42.7 t/s
| floridianfisher wrote:
| Try one of the smaller versions. 27b is too big for your gpu
| parched99 wrote:
| I'm aware. I was addressing the question being asked.
| tbocek wrote:
| This is probably due to this: https://github.com/ggml-
| org/llama.cpp/issues/12637. This GitHub issue is about
| interleaved sliding window attention (iSWA) not available in
| llama.cpp for Gemma 3. This could reduce the memory
| requirements a lot. They mentioned for a certain scenario,
| going from 62GB to 10GB.
| parched99 wrote:
| Resolving that issue, would help reduce (not eliminate) the
| size of the context. The model will still only just barely
| fit in 16 GB, which is what the parent comment asked.
|
| Best to have two or more low-end, 16GB GPUs for a total of
| 32GB VRAM to run most of the better local models.
| nolist_policy wrote:
| Ollama supports iSWA.
| idonotknowwhy wrote:
| I didn't realise the 5070 is slower than the 3090. Thanks.
|
| If you want a bit more context, try -ctv q8 -ctk q8 (from
| memory so look it up) to quant the kv cache.
|
| Also an imatrix gguf like iq4xs might be smaller with better
| quality
| parched99 wrote:
| I answered the question directly. IQ4_X_S is smaller, but
| slower and less accurate than Q4_0. The parent comment
| specifically asked about the QAT version. That's literally
| what this thread is about. The context-length mention was
| relevant to show how it's only barely usable.
| abawany wrote:
| I tried the 27b-iat model on a 4090m with 16gb vram with mostly
| default args via llama.cpp and it didn't fit - used up the vram
| and tried to use about 2gb of system ram: performance in this
| setup was < 5 tps.
| diggan wrote:
| First graph is a comparison of the "Elo Score" while using
| "native" BF16 precision in various models, second graph is
| comparing VRAM usage between native BF16 precision and their QAT
| models, but since this method is about doing quantization while
| also maintaining quality, isn't the obvious graph of comparing
| the quality between BF16 and QAT missing? The text doesn't seem
| to talk about it either, yet it's basically the topic of the blog
| post.
| croemer wrote:
| Indeed, the one thing I was looking for was Elo/performance of
| the quantized models, not how good the base model is. Showing
| how much memory is saved by quantization in a figure is a bit
| of an insult to the intelligence of the reader.
| nithril wrote:
| In addition the graph "Massive VRAM Savings" graph states what
| looks like a tautology, reducing from 16 bits to 4 bits leads
| unsurprisingly to a x4 reduction in memory usage
| claiir wrote:
| Yea they mention a "perplexity drop" relative to naive
| quantization, but that's meaningless to me. > We reduce the
| perplexity drop by 54% (using llama.cpp perplexity evaluation)
| when quantizing down to Q4_0.
|
| Wish they showed benchmarks / added quantized versions to the
| arena! :>
| jarbus wrote:
| Very excited to see these kinds of techniques, I think getting a
| 30B level reasoning model usable on consumer hardware is going to
| be a game changer, especially if it uses less power.
| apples_oranges wrote:
| Deepseek does reasoning on my home Linux pc but not sure how
| power hungry it is
| gcr wrote:
| what variant? I'd considered DeepSeek far too large for any
| consumer GPUs
| scosman wrote:
| Some people run Deepseek on CPU. 37B active params - it
| isn't fast but it's passible.
| danielbln wrote:
| Actual deepseek or some qwen/llama reasoning fine-tune?
| scosman wrote:
| Actual Deepseek. 500gb of memory and a threadripper
| works. Not a standard PC spec, but a common ish home brew
| setup for single user Deepseek.
| wtcactus wrote:
| They keep mentioning the RTX 3090 (with 24 GB VRAM), but the
| model is only 14.1 GB.
|
| Shouldn't it fit a 5060 Ti 16GB, for instance?
| jsnell wrote:
| Memory is needed for more than just the parameters, e.g. the KV
| cache.
| cubefox wrote:
| KV = key-value
| oktoberpaard wrote:
| With a 128K context length and 8 bit KV cache, the 27b model
| occupies 22 GiB on my system. With a smaller context length you
| should be able to fit it on a 16 GiB GPU.
| Havoc wrote:
| Just checked - 19 gigs with 8k context @ q8 kv.Plus another
| 2.5-ish or so for OS etc.
|
| ...so yeah 3090
| noodletheworld wrote:
| ?
|
| Am I missing something?
|
| These have been out for a while; if you follow the HF link you
| can see, for example, the 27b quant has been downloaded from HF
| 64,000 times over the last 10 days.
|
| Is there something more to this, or is just a follow up blog
| post?
|
| (is it just that ollama finally has partial (no images right?)
| support? Or something else?)
| deepsquirrelnet wrote:
| QAT "quantization aware training" means they had it quantized
| to 4 bits during training rather than after training in full or
| half precision. It's supposedly a higher quality, but
| unfortunately they don't show any comparisons between QAT and
| post-training quantization.
| noodletheworld wrote:
| I understand that, but the qat models (1) are not new
| uploads.
|
| How is this more significant now than when they were uploaded
| 2 weeks ago?
|
| Are we expecting new models? I don't understand the timing.
| This post feels like it's two weeks late.
|
| [1] - https://huggingface.co/collections/google/gemma-3-qat-6
| 7ee61...
| llmguy wrote:
| 8 days is closer to 1 week then 2. And it's a blog post,
| nobody owes you realtime updates.
| noodletheworld wrote:
| https://huggingface.co/google/gemma-3-27b-it-
| qat-q4_0-gguf/t...
|
| > 17 days ago
|
| Anywaaay...
|
| I'm literally asking, quite honestly, if this is just an
| 'after the fact' update literally weeks later, that they
| uploaded a bunch of models, or if there is something more
| significant about this I'm missing.
| timcobb wrote:
| Probably the former... I see your confusion but it's
| really only a couple weeks at most. The news cycle is
| strong in you, grasshopper :)
| osanseviero wrote:
| Hi! Omar from the Gemma team here.
|
| Last time we only released the quantized GGUFs. Only
| llama.cpp users could use it (+ Ollama, but without
| vision).
|
| Now, we released the unquantized checkpoints, so anyone
| can quantize themselves and use in their favorite tools,
| including Ollama with vision, MLX, LM Studio, etc. MLX
| folks also found that the model worked decently with 3
| bits compared to naive 3-bit, so by releasing the
| unquantized checkpoints we allow further experimentation
| and research.
|
| TL;DR. One was a release in a specific format/tool, we
| followed-up with a full release of artifacts that enable
| the community to do much more.
| oezi wrote:
| Hey Omar, is there any chance that Gemma 3 might get a
| speech (ASR/AST/TTS) release?
| simonw wrote:
| The official announcement of the QAT models happened on
| Friday 18th, two days ago. It looks like they uploaded them
| to HF in advance of that announcement:
| https://developers.googleblog.com/en/gemma-3-quantized-
| aware...
|
| The partnership with Ollama and MLX and LM Studio and
| llama.cpp was revealed in that announcement, which made the
| models a lot easier for people to use.
| xnx wrote:
| The linked blog post was 2 days ago
| Patrick_Devine wrote:
| Ollama has had vision support for Gemma3 since it came out. The
| implementation is _not_ based on llama.cpp 's version.
| behnamoh wrote:
| This is what local LLMs need--being treated like first-class
| citizens by the companies that make them.
|
| That said, the first graph is misleading about the number of
| H100s required to run DeepSeek r1 at FP16. The model is FP8.
| freeamz wrote:
| so what is the real comparison against DeepSeek r1 ? Would be
| good to know which is actually more cost efficient and open
| (reproducible build) to run locally.
| behnamoh wrote:
| half the amount of those dots is what it takes. but also, why
| compare a 27B model with a +600B? that doesn't make sense.
| smallerize wrote:
| It's an older image that they just reused for the blog
| post. It's on https://ai.google.dev/gemma for example
| mmoskal wrote:
| Also ~noone runs h100 at home, ie at batch size 1. What matters
| is throughput. With 37b active parameters and a massive
| deployment throughout (per gpu) should be similar to Gemma.
| mythz wrote:
| The speed gains are real, after downloading latest QAT gemma3:27b
| eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11
| tok/s (on A4000's).
| btbuildem wrote:
| Is 27B the largest QAT Gemma 3? Given these size reductions, it
| would be amazing to have the 70B!
| arnaudsm wrote:
| The original Gemma 3 does not have a 70B version.
| btbuildem wrote:
| Ah thank you
| umajho wrote:
| I am currently using the Q4_K_M quantized version of
| gemma-3-27b-it locally. I previously assumed that a 27B model
| with image input support wouldn't be very high quality, but after
| actually using it, the generated responses feel better than those
| from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M),
| and its recognition of images is also stronger than I expected.
| (I thought the model could only roughly understand the concepts
| in the image, but I didn't expect it to be able to recognize text
| within the image.)
|
| Since this article publishes the optimized Q4 quantized version,
| it would be great if it included more comparisons between the new
| version and my currently used unoptimized Q4 version (such as
| benchmark scores).
|
| (I deliberately wrote this reply in Chinese and had
| gemma-3-27b-it Q4_K_M translate it into English.)
| rob_c wrote:
| Given how long between this being released and this community
| picking up on it... Lol
| GaunterODimm wrote:
| 2days :/...
| rob_c wrote:
| Given I know people running gemma3 on local devices for over
| almost a month now this is either a very slow news day or
| evidence of finger missing the pulse...
| https://blog.google/technology/developers/gemma-3/
| simonw wrote:
| This is new. These are new QAT (Quantization-Aware
| Training) models released by the Gemma team.
| rob_c wrote:
| There's nothing more than an iteration on the topic,
| gemma3 was smashing local results a month ago and made no
| waves as it dropped...
| simonw wrote:
| Quoting the linked story:
|
| > Last month, we launched Gemma 3, our latest generation
| of open models. Delivering state-of-the-art performance,
| Gemma 3 quickly established itself as a leading model
| capable of running on a single high-end GPU like the
| NVIDIA H100 using its native BFloat16 (BF16) precision.
|
| > To make Gemma 3 even more accessible, we are announcing
| new versions optimized with Quantization-Aware Training
| (QAT) that dramatically reduces memory requirements while
| maintaining high quality.
|
| The thing that's new, and that is clearly resonating with
| people, is the "To make Gemma 3 even more accessible..."
| bit.
| rob_c wrote:
| As I've said in my lectures on how to perform 1bit
| training of QAT systems to build classifiers...
|
| "An iteration on a theme".
|
| Once the network design is proven to work yes it's an
| impressive technical achievement, but as I've said given
| I've known people in multiple research institutes and
| companies using Gemma3 for a month mostly saying they're
| surprised it's not getting noticed... This is just
| enabling more users but the none QAT version will almost
| always perform better...
| simonw wrote:
| Sounds like you're excited to see Gemma 3 get the
| recognition it deserves on Hacker News then.
| rob_c wrote:
| No just pointing out the flooding obvious as usual and
| collecting down votes for it
| fragmede wrote:
| Speaking for myself, my downvotes are not because of the
| content of your arguments, but because your tone is
| consistently condescending and dismissive. Comments like
| "just pointing out the flooding obvious" come off as smug
| and combative rather than constructive.
|
| HN works best when people engage in good faith, stay
| curious, and try to move the conversation forward. That
| kind of tone -- even when technically accurate --
| discourages others from participating and derails
| meaningful discussion.
|
| If you're getting downvotes regularly, maybe it's worth
| considering how your comments are landing with others,
| not just whether they're "right."
| rob_c wrote:
| My tone only switches once people get uppity. The
| original comment is on point and accurate, not combative
| and not insulting (unless the community seriously takes a
| 'lol'....
|
| Tbh I give up writing that in response to this rant. My
| polite poke holds and it's non insulting so I'm not going
| to capitulate to those childish enough to not look
| inwards.
| simonw wrote:
| I think gemma-3-27b-it-qat-4bit is my new favorite local model -
| or at least it's right up there with Mistral Small 3.1 24B.
|
| I've been trying it on an M2 64GB via both Ollama and MLX. It's
| very, very good, and it only uses ~22Gb (via Ollama) or ~15GB
| (MLX) leaving plenty of memory for running other apps.
|
| Some notes here:
| https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
|
| Last night I had it write me a complete plugin for my LLM tool
| like this: llm install llm-mlx llm mlx
| download-model mlx-community/gemma-3-27b-it-qat-4bit
| llm -m mlx-community/gemma-3-27b-it-qat-4bit \ -f
| https://raw.githubusercontent.com/simonw/llm-hacker-
| news/refs/heads/main/llm_hacker_news.py \ -f https://raw.
| githubusercontent.com/simonw/tools/refs/heads/main/github-issue-
| to-markdown.html \ -s 'Write a new fragments plugin in
| Python that registers issue:org/repo/123 which fetches
| that issue number from the specified github repo and
| uses the same markdown logic as the HTML page to turn
| that into a fragment'
|
| It gave a solid response!
| https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... -
| more notes here: https://simonwillison.net/2025/Apr/20/llm-
| fragments-github/
| rs186 wrote:
| Can you quote tps?
|
| More and more I start to realize that cost saving is a small
| problem for local LLMs. If it is too slow, it becomes unusable,
| so much that you might as well use public LLM endpoints. Unless
| you really care about getting things done locally without
| sending information to another server.
|
| With OpenAI API/ChatGPT, I get response much faster than I can
| read, and for simple question, it means I just need a glimpse
| of the response, copy & paste and get things done. Whereas on
| local LLM, I watch it painstakingly prints preambles that I
| don't care about, and get what I actually need after 20 seconds
| (on a fast GPU).
|
| And I am not yet talking about context window etc.
|
| I have been researching about how people integrate local LLMs
| in their workflows. My finding is that most people play with it
| for a short time and that's about it, and most people are much
| better off spending money on OpenAI credits (which can last a
| very long time with typical usage) than getting a beefed up Mac
| Studio or building a machine with 4090.
| simonw wrote:
| My tooling doesn't measure TPS yet. It feels snappy to me on
| MLX.
|
| I agree that hosted models are usually a better option for
| most people - much faster, higher quality, handle longer
| inputs, really cheap.
|
| I enjoy local models for research and for the occasional
| offline scenario.
|
| I'm also interested in their applications for journalism,
| specifically for dealing with extremely sensitive data like
| leaked information from confidential sources.
| freeamz wrote:
| >I'm also interested in their applications for journalism,
| specifically for dealing with extremely sensitive data like
| leaked information from confidential sources.
|
| Think it is NOT just you. Most company with decent
| management also would not want their data going to anything
| outside the physical server they have in control of. But
| yeah for most people just use an app and hosted server. But
| this is HN,there are ppl here hosting their own email
| servers, so shouldn't be too hard to run llm locally.
| simonw wrote:
| "Most company with decent management also would not want
| their data going to anything outside the physical server
| they have in control of."
|
| I don't think that's been true for over a decade: AWS
| wouldn't be trillion dollar business if most companies
| still wanted to stay on-premise.
| terhechte wrote:
| Or GitHub. I'm always amused when people don't want to
| send fractions of their code to a LLM but happily host it
| on GitHub. All big llm providers offer no-training-on-
| your-data business plans.
| tarruda wrote:
| > I'm always amused when people don't want to send
| fractions of their code to a LLM but happily host it on
| GitHub
|
| What amuses me even more is people thinking their code is
| too unique and precious, and that GitHub/Microsoft wants
| to steal it.
| AlexCoventry wrote:
| Concern about platform risk in regard to Microsoft is
| historically justified.
| Terretta wrote:
| Unlikely they think Microsoft or GitHub wants to steal
| it.
|
| With LLMs, they're thinking of examples that regurgitated
| proprietary code, and contrary to everyday general
| observation, valuable proprietary code does exist.
|
| But with GitHub, the thinking is generally the opposite:
| the worry is that the code is terrible, and seeing it
| would be like giant blinkenlights* indicating the way in.
|
| * https://en.wikipedia.org/wiki/Blinkenlights
| vikarti wrote:
| Regulations sometimes matter. Stupid "security" rules
| sometimes matter too.
| __float wrote:
| While none of that is false, I think there's a big
| difference from shipping your data to an external LLM API
| and using AWS.
|
| Using AWS is basically a "physical server they have
| control of".
| simonw wrote:
| That's why AWS Bedrock and Google Vertex AI and Azure AI
| model inference exist - they're all hosted LLM services
| that offer the same compliance guarantees that you get
| from regular AWS-style hosting agreements.
| IanCal wrote:
| As in aws is a much bigger security concern?
| ipdashc wrote:
| Yeah, this has been confusing me a bit. I'm not
| complaining by ANY means, but why does it suddenly feel
| like everyone cares about data privacy in LLM contexts,
| way more than previous attitudes to allowing data to sit
| on a bunch of random SaaS products?
|
| I assume because of the assumption that the AI companies
| will train off of your data, causing it to leak? But I
| thought all these services had enterprise tiers where
| they'll promise not to do that?
|
| Again, I'm not complaining, it's good to see people
| caring about where their data goes. Just interesting that
| they care now, but not before. (In some ways LLMs should
| be one of the safer services, since they don't even
| really need to store any data, they can delete it after
| the query or conversation is over.)
| pornel wrote:
| It is due to the risk of a leak.
|
| Laundering of data through training makes it a more
| complicated case than a simple data theft or copyright
| infringement.
|
| Leaks could be accidental, e.g. due to an employee
| logging in to their free-as-in-labor personal account
| instead of a no-training Enterprise account. It's safer
| to have a complete ban on providers that _may_ collect
| data for training.
| 6510 wrote:
| Their entire business model based on taking other peoples
| stuff. I cant imagine someone would willingly drown with
| the sinking ship if the entire cargo is filled with
| lifeboats - just because they promised they would.
| vbezhenar wrote:
| How can you be sure that AWS will not use your data to
| train their models? They got enormous data, probably most
| data in the world.
| simonw wrote:
| Being caught doing they would be wildly harmful to their
| business - billions of dollars harmful, especially given
| the contracts they sign with their customers. The brand
| damage would be unimaginably expensive too.
|
| There is no world in which training on customer data
| without permission would be worth it for AWS.
|
| Your data really isn't that useful anyway.
| mdp2021 wrote:
| > _Your data really isn 't that useful anyway_
|
| ? One single random document, maybe, but as an aggregate,
| I understood some parties were trying to scrape
| indiscriminately - the "big data" way. And if some of
| that input is sensitive, and is stored somewhere in the
| NN, it may come out in an output - in theory...
|
| Actually I never researched the details of the potential
| phenomenon - that anything personal may be stored (not
| just George III but Random Randy) -, but it seems
| possible.
| simonw wrote:
| There's a pretty common misconception that training LLMs
| is about loading in as much data as possible no matter
| the source.
|
| That might have been true a few years ago but today the
| top AI labs are all focusing on quality: they're trying
| to find the best possible sources of high quality tokens,
| not randomly dumping in anything they can obtain.
|
| Andrej Karpathy said this last year:
| https://twitter.com/karpathy/status/1797313173449764933
|
| > Turns out that LLMs learn a lot better and faster from
| educational content as well. This is partly because the
| average Common Crawl article (internet pages) is not of
| very high value and distracts the training, packing in
| too much irrelevant information. The average webpage on
| the internet is so random and terrible it's not even
| clear how prior LLMs learn anything at all.
| mdp2021 wrote:
| Obviously the training data should be preferably high
| quality - but there you have a (pseudo-, I insisted also
| elsewhere citing the rights to have read whatever is in
| any public library) problem with "copyright".
|
| If there exists some advantage on quantity though, then
| achieving high quality imposes questions about tradeoffs
| and workflows - sources where authors are "free
| participants" could have odd data sip in.
|
| And the matter of whether such data may be reflected in
| outputs remains as a question (probably tackled by some I
| have not read... Ars longa, vita brevis).
| freeamz wrote:
| In Scandinavian financial related severs must in the
| country! That always sounded like a sane approach. The
| whole putting your data on saas or AWS just seems like
| the same "Let's shift the responsibility to a big
| player".
|
| Any important data should NOT be in devices that is NOT
| physically with in our jurisdiction.
| mjlee wrote:
| AWS has a strong track record, a clear business model
| that isn't predicated on gathering as much data as
| possible, and an awful lot to lose if they break their
| promises.
|
| Lots of AI companies have some of these, but not to the
| same extent.
| Tepix wrote:
| on-premises
|
| https://twominenglish.com/premise-vs-premises/
| belter wrote:
| > "Most company with decent management also would not
| want their data going to anything outside the physical
| server they have in control of."
|
| Most companies physical and digital security controls are
| so much worst than anything from AWS or Google. Note I
| dont include Azure...but a _physical server_ they have
| _control of_ is a phrase that screams vulnerability.
| triyambakam wrote:
| > specifically for dealing with extremely sensitive data
| like leaked information from confidential sources.
|
| Can you explain this further? It seems in contrast to your
| previous comment about trusting Anthropic with your data
| simonw wrote:
| I trust Anthropic not to train on my data.
|
| If they get hit by a government subpoena because a
| journalist has been using them to analyze leaked
| corporate or government secret files I also trust them to
| honor that subpoena.
|
| Sometimes journalists deal with material that they cannot
| risk leaving their own machine.
|
| "News is what somebody somewhere wants to suppress"
| overfeed wrote:
| > Whereas on local LLM, I watch it painstakingly prints
| preambles that I don't care about, and get what I actually
| need after 20 seconds.
|
| You may need to "right-size" the models you use to match your
| hardware, model, and TPS expectations, which may involve
| using a smaller version of the model with faster TPS,
| upgrading your jardware, or paying for hosted models.
|
| Alternatively, if you can use agentic workflows or tools like
| Aider, you don't have to watch the model work slowly with
| large modles locally. Instead you queue work for it, go to
| sleep, or eat, or do other work, and then much later look
| over the Pull Requests whenever it completes them.
| rs186 wrote:
| I have a 4070 super for gaming, and used it to play with
| LLM a few times. It is by no means a bad card, but I
| realize that unless I want to get 4090 or new Macs that I
| don't have any other use for, I can only use it to run
| smaller models. However, most smaller models aren't
| satisfactory and are still slower than hosted LLMs. I
| haven't found a model that I am happy with for my hardware.
|
| Regarding agentic workflows -- sounds nice but I am too
| scared to try it out, based on my experience with standard
| LLMs like GPT or Claude for writing code. Small snippets or
| filling in missing unit tests, fine, anything more
| complicated? Has been a disaster for me.
| taneq wrote:
| As I understand it, these models are limited on GPU
| memory far more than GPU compute. You'd be better off
| with dual 4070s than with a single 4090 unless the 4090
| has more RAM than the other two combined.
| adastra22 wrote:
| I have never found any agent able to put together sensible
| pull requests without constant hand holding. I shudder to
| think of what those repositories must look like.
| otabdeveloper4 wrote:
| The only actually useful application of LLM's is processing
| large amounts of data for classification and/or summarizing
| purposes.
|
| That's not the stuff you want to send to a public API, this
| is something you want as a 24/7 locally running batch job.
|
| ("AI assistant" is an evolutionary dead end, and Star Trek be
| damned.)
| DJHenk wrote:
| > More and more I start to realize that cost saving is a
| small problem for local LLMs. If it is too slow, it becomes
| unusable, so much that you might as well use public LLM
| endpoints. Unless you really care about getting things done
| locally without sending information to another server.
|
| There is another aspect to consider, aside from privacy.
|
| These models are trained by downloading every scrap of
| information from the internet, including the works of many,
| many authors who have never consented to that. And they for
| sure are not going to get a share of the profits, if there is
| every going to be any. If you use a cloud provider, you are
| basically saying that is all fine. You are happy to pay them,
| and make yourself dependent on their service, based on work
| that wasn't theirs to use.
|
| However, if you use a local model, the authors still did not
| give consent, but one could argue that the company that made
| the model is at least giving back to the community. They
| don't get any money out of it, and you are not becoming
| dependent on their hyper capitalist service. No rent-seeking.
| The benefits of the work are free to use for everyone. This
| makes using AI a little more acceptable from a moral
| standpoint.
| ein0p wrote:
| Sometimes TPS doesn't matter. I've generated textual
| descriptions for 100K or so images in my photo archive, some
| of which I have absolutely no interest in uploading to
| someone else's computer. This works pretty well with Gemma. I
| use local LLMs all the time for things where privacy is even
| remotely important. I estimate this constitutes easily a
| quarter of my LLM usage.
| lodovic wrote:
| This is a really cool idea. Do you pretrain the model so it
| can tag people? I have so many photo's that it seems
| impossible to ever categorize them,using a workflow like
| yours might help a lot
| ein0p wrote:
| No, tagging of people is already handled by another
| model. Gemma just describes what's in the image, and
| produces a comma separated list of keywords. No
| additional training is required besides a few tweaks to
| the prompt so that it outputs just the description,
| without any "fluff". E.g. it normally prepends such
| outputs with "Here's a description of the image:" unless
| you really insist that it should output only the
| description. I suppose I could use constrained decoding
| into JSON or something to achieve the same, but I didn't
| mess with that.
|
| On some images where Gemma3 struggles Mistral Small
| produces better descriptions, BTW. But it seems harder to
| make it follow my instructions exactly.
|
| I'm looking forward to the day when I can also do this
| with videos, a lot of which I also have no interest in
| uploading to someone else's computer.
| fer wrote:
| How do you use the keywords after? I have Immich running
| which does some analysis, but the querying is a bit of a
| hit and miss.
| ein0p wrote:
| Search is indeed hit and miss. Immich, for instance,
| currently does absolutely nothing with the EXIF
| "description" field, so I store textual descriptions on
| the side as well. I have found Immich's search by image
| embeddings to be pretty weak at recall, and even weaker
| at ranking. IIRC Lightroom Classic (which I also use, but
| haven't found a way to automate this for without writing
| an extension) does search that field, but ranking is a
| bit of a dumpster fire, so your best bet is searching
| uncommon terms or constraining search by metadata (e.g.
| not just "black kitten" but "black kitten AND 2025"). I
| expect this to improve significantly over time - it's a
| fairly obvious thing to add given the available tech.
| ethersteeds wrote:
| > No, tagging of people is already handled by another
| model.
|
| As an aside, what model/tools do you prefer for tagging
| people?
| mentalgear wrote:
| Since you already seem to have done some impressive work
| on this for your personal use, would you mind open
| sourcing it?
| starik36 wrote:
| I was thinking of doing the same, but I would like to
| include people's name. in the description. For example
| "Jennifer looking out in the desert sky.".
|
| As it stands, Gemma will just say "Woman looking out in the
| desert sky."
| ein0p wrote:
| Most search rankers do not consider word order, so if you
| could also append the person's name at the end of text
| description, it'd probably work well enough for retrieval
| and ranking at least.
|
| If you want natural language to resolve the names, that'd
| at a minimum require bounding boxes of the faces and
| their corresponding names. It'd also require either
| preprocessing, or specialized training, or both. To my
| knowledge no locally-hostable model as of today has that.
| I don't know if any proprietary models can do this
| either, but it's certainly worth a try - they might just
| do it. The vast majority of the things they can do is
| emergent, meaning they were never specifically trained to
| do them.
| k__ wrote:
| The local LLM is your project manager, the big remote ones
| are the engineers and designers :D
| trees101 wrote:
| Not sure how accurate my stats are. I used ollama with the
| --verbose flag. Using a 4090 and all default settings, I get
| 40TPS for Gemma 29B model
|
| `ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
|
| `ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS
| +-0.3TPS
|
| Strange results; the full model gives me slightly more TPS.
| orangecat wrote:
| ollama's `gemma3:27b` is also 4-bit quantized, you need
| `27b-it-q8_0` for 8 bit or `27b-it-fp16` for FP16. See
| https://ollama.com/library/gemma3/tags.
| starik36 wrote:
| On an A5000 with 24GB, this model typically gets between 20
| to 25 tps.
| a_e_k wrote:
| I'm seeing ~38--42 tps on a 4090 in a fresh build of
| llama.cpp under Fedora 42 on my personal machine.
|
| (-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m
| models/gemma-3-27b-it-qat-q4_0.gguf)
| pantulis wrote:
| > Can you quote tps?
|
| LLM Studio running on a Mac Studio M4 Max with 128GB,
| gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get
| 8.89 tps.
| jychang wrote:
| That's pretty terrible. I'm getting 18tok/sec Gemma 3 27b
| QAT on a M1 Max 32gb macbook.
| pantulis wrote:
| Yeah, I know. Not sure if this due to something in LLM
| Studio or whatever.
| kristianp wrote:
| Is QAT a different quantisation format to Q4_0? Can you try
| "gemma-3-27b-it-qat" for a model:
| https://lmstudio.ai/model/gemma-3-27b-it-qat
| nico wrote:
| Been super impressed with local models on mac. Love that the
| gemma models have 128k token context input size. However,
| outputs are usually pretty short
|
| Any tips on generating long output? Like multiple pages of a
| document, a story, a play or even a book?
| simonw wrote:
| The tool you are using may set a default max output size
| without you realizing. Ollama has a num_ctx that defaults to
| 2048 for example: https://github.com/ollama/ollama/blob/main/
| docs/faq.md#how-c...
| nico wrote:
| Been playing with that, but doesn't seem to have much
| effect. It works very well to limit output to smaller bits,
| like setting it to 100-200. But above 2-4k the output seems
| to never get longer than about 1 page
|
| Might try using the models with mlx instead of ollama to
| see if that makes a difference
|
| Any tips on prompting to get longer outputs?
|
| Also, does the model context size determine max output
| size? Are the two related or are they independent
| characteristics of the model?
| simonw wrote:
| Interestingly the Gemma 3 docs say: https://ai.google.dev
| /gemma/docs/core/model_card_3#:~:text=T...
|
| > Total output context up to 128K tokens for the 4B, 12B,
| and 27B sizes, and 32K tokens for the 1B size per
| request, subtracting the request input tokens
|
| I don't know how to get it to output anything that length
| though.
| nico wrote:
| Thank you for the insights and useful links
|
| Will keep experimenting, will also try mistral3.1
|
| edit: just tried mistral3.1 and the quality of the output
| is very good, at least compared to the other models I
| tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and
| deepseek-r1:14b)
|
| Doing some research, because of their training sets, it
| seems like most models are not trained on producing long
| outputs so even if they technically could, they won't.
| Might require developing my own training dataset and then
| doing some fine tuning. Apparently the models and ollama
| have some safeguards against rambling and repetition
| Gracana wrote:
| You can probably find some long-form tuned models on HF.
| I've had decent results with QwQ-32B (which I can run on
| my desktop) and Mistral Large (which I have to run on my
| server). Generating and refining an outline before
| writing the whole piece can help, and you can also split
| the piece up into multiple outputs (working a paragraph
| or two at a time, for instance). So far I've found it to
| be a tough process, with mixed results.
| nico wrote:
| Thank you, will try out your suggestions
|
| Have you used something like a director model to
| supervise the output? If so, could you comment on the
| effectiveness of it and potentially any tips?
| Gracana wrote:
| Nope, sounds neat though. There's so much to keep up with
| in this space.
| Casteil wrote:
| This is basically the opposite of what I've experienced - at
| least compared to another recent entry like IBM's Granite
| 3.3.
|
| By comparison, Gemma3's output (both 12b and 27b) seems to
| typically be more long/verbose, but not problematically so.
| nico wrote:
| I agree with you. The outputs are usually good, it's just
| that for the use case I have now (writing several pages of
| long dialogs), the output is not as long as I'd want it,
| and definitely not as long as it's supposedly capable of
| doing
| tootie wrote:
| I'm using 12b and getting seriously verbose answers. It's
| squeezed into 8GB and takes its sweet time but answers are
| really solid.
| tomrod wrote:
| Simon, what is your local GPU setup? (No doubt you've covered
| this, but I'm not sure where to dig up).
| simonw wrote:
| MacBook Pro M2 with 64GB of RAM. That's why I tend to be
| limited to Ollama and MLX - stuff that requires NVIDIA
| doesn't work for me locally.
| Elucalidavah wrote:
| > MacBook Pro M2 with 64GB of RAM
|
| Are there non-mac options with similar capabilities?
| simonw wrote:
| Yes, but I don't really know anything about those.
| https://www.reddit.com/r/LocalLLaMA/ is full of people
| running models on PCs with NVIDIA cards.
|
| The unique benefit of an Apple Silicon Mac at the moment
| is that the 64GB of RAM is available to both the GPU and
| the CPU at once. With other hardware you usually need
| dedicated separate VRAM for the GPU.
| _neil wrote:
| It's not out yet, but the upcoming Framework desktop [0]
| is supposed to have a similar unified memory setup.
|
| [0] https://frame.work/desktop
| dwood_dev wrote:
| Anything with the Radeon 8060S/Ryzen AI Max+ 395. One of
| the popular MiniPC Chinese brands has them for
| preorder[0] with shipping starting May 7th. Framework
| also has them, but shipping Q3.
|
| 0: https://www.gmktec.com/products/prepaid-deposit-amd-
| ryzen(tm)-a...
| chpatrick wrote:
| I've never been able to get ROCm working reliably
| personally.
| danans wrote:
| Nvidia Orin AGX if a desktop form factor works for you.
| chpatrick wrote:
| I remember seeing a post about someone running the full
| size DeepSeek model in a dual-Xeon server with a ton of
| RAM.
| jychang wrote:
| MLX is slower than GGUFs on Macs.
|
| On my M1 Max macbook pro, the GGUF version
| bartowski/google_gemma-3-27b-it-qat-GGUF is 15.6gb and runs
| at 17tok/sec, whereas mlx-community/gemma-3-27b-it-qat-4bit
| is 16.8gb and runs at 15tok/sec. Note that both of these
| are the new QAT 4bit quants.
| phaedrix wrote:
| No, in general mlx versions are always faster, ice tested
| most of them.
| 85392_school wrote:
| What TPS difference are you getting?
| littlestymaar wrote:
| > and it only uses ~22Gb (via Ollama) or ~15GB (MLX)
|
| Why is the memory use different? Are you using different
| context size in both set-ups?
| simonw wrote:
| No idea. MLX is its own thing, optimized for Apple Silicon.
| Ollama uses GGUFs.
|
| https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0.
| https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit
| says it's 4bit. I think those are the same quantization?
| jychang wrote:
| Those are the same quant, but this is a good example of why
| you shouldn't use ollama. Either directly use llama.cpp, or
| use something like LM Studio if you want something with a
| GUI/easier user experience.
|
| The Gemma 3 17b QAT GGUF should be taking up ~15gb, not
| 22gb.
| Patrick_Devine wrote:
| The vision tower is 7GB, so I was wondering if you were
| loading it without vision?
| paprots wrote:
| The original gemma3:27b also took only 22GB using Ollama on my
| 64GB MacBook. I'm quite confused that the QAT took the same. Do
| you know why? Which model is better? `gemma3:27b`, or
| `gemma3:27b-qat`?
| nolist_policy wrote:
| I suspect your "original gemma3:27b" was a quantized model
| since the non-quantized (16bit) version needs around 54gb.
| kgwgk wrote:
| Look up 27b in https://ollama.com/library/gemma3/tags
|
| You'll find the id a418f5838eaf which also corresponds to
| 27b-it-q4_K_M
| superkuh wrote:
| Quantization aware training just means having the model deal
| with quantized values a bit during training so it handles the
| quantization better when it is quantized after training/etc.
| It doesn't change the model size itself.
| zorgmonkey wrote:
| Both versions are quantized and should use the same amount of
| RAM, the difference with QAT is the quantization happens
| during training time and it should result in slightly better
| (closer to the bf16 weights) output
| prvc wrote:
| > ~15GB (MLX) leaving plenty of memory for running other apps.
|
| Is that small enough to run well (without thrashing) on a
| system with only 16GiB RAM?
| simonw wrote:
| I expect not. On my Mac at least I've found I need a bunch of
| GB free to have anything else running at all.
| mnoronha wrote:
| Any idea why MLX and ollama use such different amounts of
| ram?
| jychang wrote:
| I don't think ollama is quantizing the embeddings table,
| which is still full FP16.
|
| If you're using MLX, that means you're on a mac, in which
| case ollama actually isn't your best option. Either
| directly use llama.cpp if you're a power user, or use LM
| Studio if you want something a bit better than ollama but
| more user friendly than llama.cpp. (LM Studio has a GUI
| and is also more user friendly than ollama, but has the
| downsides of not being as scriptable. You win some, you
| lose some.)
|
| Don't use MLX, it's not as fast/small as the best GGUFs
| currently (and also tends to be more buggy, it currently
| has some known bugs with japanese). Download the LM
| Studio version of the Gemma 3 QAT GGUF quants, which are
| made by Bartowski. Google actually directly mentions
| Bartowski in blog post linked above (ctrl-f his name),
| and his models are currently the best ones to use.
|
| https://huggingface.co/bartowski/google_gemma-3-27b-it-
| qat-G...
|
| The "best Gemma 3 27b model to download" crown has taken
| a very roundabout path. After the initial Google release,
| it went from Unsloth Q4_K_M, to Google QAT Q4_0, to
| stduhpf Q4_0_S, to Bartowski Q4_0 now.
| codybontecou wrote:
| Can you run the mlx-variation of this model through Ollama so
| that I can interact with it in Open WebUI?
| simonw wrote:
| I haven't tried it yet but there's an MLX project that
| exposes an OpenAI-compatible serving endpoint that should
| work with Open WebUI: https://github.com/madroidmaq/mlx-omni-
| server
| codybontecou wrote:
| Appreciate the link. I'll try to tinker with it later
| today.
| bobjordan wrote:
| Thanks for the call out on this model! I have 42gb usable VRAM
| on my ancient (~10yrs old) quad-sli titan-x workstation and
| have been looking for a model to balance large context window
| with output quality. I'm able to run this model with a 56K
| context window and it just fits into my 42gb VRAM to run 100%
| GPU. The output quality is really good and 56K context window
| is very usable. Nice find!
| ygreif wrote:
| Do many consumer GPUs have >20 gigabytes RAM? That sounds like
| a lot to me
| mcintyre1994 wrote:
| I don't think so, but Apple's unified memory architecture
| makes it a possibility for people with Macbook Pros.
| justanotheratom wrote:
| Anyone packaged one of these in an iPhone App? I am sure it is
| doable, but I am curious what tokens/sec is possible these days.
| I would love to ship "private" AI Apps if we can get reasonable
| tokens/sec.
| Alifatisk wrote:
| If you ever ship a private AI app, don't forget to implement
| the export functionality, please!
| idonotknowwhy wrote:
| You mean conversations? Just the jsonl of the standard hf
| dataset format to import into other systems?
| Alifatisk wrote:
| Yeah I mean conversations.
| nico wrote:
| What kind of functionality do you need from the model?
|
| For basic conversation and RAG, you can use tinyllama or
| qwen-2.5-0.5b, both of which run on a raspberry pi at around
| 5-20 tokens per second
| justanotheratom wrote:
| I am looking for structured output at about 100-200
| tokens/second on iPhone 14+. Any pointers?
| zamadatix wrote:
| There are many such apps, e.g. Mollama, Enclave AI or
| PrivateLLM or dozens of others, but you could tell me it runs
| at 1,000,000 tokens/second on an iPhone and I wouldn't care
| because the largest model version you're going to be able to
| load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you
| still need context) and it's just not worth the time to use.
|
| That said, if you really care, it generates faster than reading
| speed (on an A18 based model at least).
| woodson wrote:
| Some of these small models still have their uses, e.g. for
| summarization. Don't expect them to fully replace ChatGPT.
| zamadatix wrote:
| The use case is more "I'm willing to have really bad
| answers that have extremely high rates of making things up"
| than based on the application. The same goes for
| summarization, it's not like it does it well like a large
| model would.
| nolist_policy wrote:
| FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with
| 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with
| Termux.
| Casteil wrote:
| Does this turn your phone into a personal space heater too?
| Alifatisk wrote:
| Except this being lighter than the other models, is there
| anything else the Gemma model is specifically good at or better
| than the other models at doing?
| itake wrote:
| Google claims to have better multi language support, due
| tokenizer improvements.
| nico wrote:
| They are multimodal. Havent tried the QAT one yet. But the
| gemma3s released a few weeks ago are pretty good at processing
| images and telling you details about what's in them
| Zambyte wrote:
| I have found Gemma models are able to produce useful
| information about more niche subjects that other models like
| Mistral Small cannot, at the expense of never really saying "I
| don't know", where other models will, and will instead produce
| false information.
|
| For example, if I ask mistral small who I am by name, it will
| say there is no known notable figure by that name before the
| knowledge cutoff. Gemma 3 will say I am a well known <random
| profession> and make up facts. On the other hand, I have asked
| both about local organization in my area that I am involved
| with, and Gemma 3 could produce useful and factual information,
| where Mistral Small said it did not know.
| trebligdivad wrote:
| It seems pretty impressive - I'm running it on my CPU (16 core
| AMD 3950x) and it's very very impressive at translation, and the
| image description is very impressive as well. I'm getting about
| 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was
| previously using). It does tend to be a bit chatty unless you
| tell it not to be; pretty much everything it'll give you a
| 'breakdown' unless you tell it not to - so for traslation my
| prompt is 'Translate the input to English, only output the
| translation' to stop it giving a breakdown of the input language.
| simonw wrote:
| What are you using to run it? I haven't got image input working
| yet myself.
| trebligdivad wrote:
| I'm using llama.cpp - built last night from head; to do image
| stuff you have to run a separate client they provide, with
| something like:
|
| ./build/bin/llama-gemma3-cli -m
| /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj
| /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this
| image." --image ~/Downloads/surprise.png
|
| Note the 2nd gguf in there - I'm not sure, but I think that's
| for encoding the image.
| terhechte wrote:
| Image input has been working with LM Studio for quite some
| time
| Havoc wrote:
| The upcoming qwen3 series is supposed to be MoE...likely to
| give better tk/s on CPU
| slekker wrote:
| What's MoE?
| zamalek wrote:
| Mixture of Experts. Very broadly speaking, there are a
| bunch of mini networks (experts) which can be independently
| activated.
| Havoc wrote:
| Mixture of experts like other guy said - everything gets
| loaded into mem but not every byte is needed to generate a
| token (unlike classic LLMs like gemma).
|
| So for devices that have lots of mem but weaker processing
| power it can get you similar output quality but faster. So
| tends to do better on CPU and APU like setups
| trebligdivad wrote:
| I'm not even sure they're loading everything into memory
| for MoE; maybe they can get away with only the relevant
| experts being paged in.
| XCSme wrote:
| So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?
| perching_aix wrote:
| This is my first time trying to locally host a model - gave both
| the 12B and 27B QAT models a shot.
|
| I was both impressed and disappointed. Setup was piss easy, and
| the models are great conversationalists. I have a 12 gig card
| available and the 12B model ran very nice and swift.
|
| However, they're seemingly terrible at actually assisting with
| stuff. Tried something very basic: asked for a powershell one
| liner to get the native blocksize of my disks. Ended up
| hallucinating fields, then telling me to go off into the deep
| end, first elevating to admin, then using WMI, then bringing up
| IOCTL. Pretty unfortunate. Not sure I'll be able to put it to
| actual meaningful use as a result.
| parched99 wrote:
| I think Powershell is a bad test. I've noticed all local models
| have trouble providing accurate responses to Powershell-related
| prompts. Strangely, even Microsoft's model, Phi 4, is bad at
| answering these questions without careful prompting. Though, MS
| can't even provide accurate PS docs.
|
| My best guess is that there's not enough discussion/development
| related to Powershell in training data.
| fragmede wrote:
| Which, like, you'd think Microsoft has an entire team there
| who's purpose would be to generate good PowerShell for it to
| train on.
| terhechte wrote:
| Local models, due to their size more than big cloud models,
| favor popular languages rather than more niche ones. They work
| fantastic for JavaScript, Python, Bash but much worse at less
| popular things like Clojure, Nim or Haskell. Powershell is
| probably on the less popular side compared to Js or Bash.
|
| If this is your main use case you can always try to fine tune a
| model. I maintain a small llm bench of different programming
| languages and the performance difference between say Python and
| Rust on some smaller models is up to 70%
| perching_aix wrote:
| How accessible and viable is model fine-tuning? I'm not in
| the loop at all unfortunately.
| terhechte wrote:
| This is a very accessible way of playing around with the
| topic: https://transformerlab.ai
| HachiWari8 wrote:
| I tried the 27B QAT model and it hallucinates like crazy. When
| I ask it for information about some made up person, restaurant,
| place name, etc., it never says "I don't know about that" and
| instead seems eager to just make up details. The larger local
| models like the older Llama 3.3 70B seem better at this, but
| are also too big to fit on a 24GB GPU.
| jayavanth wrote:
| you should set a lower temperature
| CyberShadow wrote:
| How does it compare to CodeGemma for programming tasks?
| api wrote:
| When I see 32B or 70B models performing similarly to 200+B
| models, I don't know what to make of this. Either the latter
| contains more breadth of information but we have managed to
| distill latent capabilities to be similar, the larger models are
| just less efficient, or the tests are not very good.
| simonw wrote:
| It makes intuitive sense to me that this would be possible,
| because LLMs are still mostly opaque black boxes. I expect you
| could drop a whole hunch of the weights without having a huge
| impact on quality - maybe you end up mostly ditching the parts
| that are derived from shitposts on Reddit but keep the bits
| from Arxiv for example.
|
| (That's a massive simplification of how any of this works, but
| it's how I think about it at a high level.)
| retinaros wrote:
| its just bs benchmarks. they are all cheating at this point
| feeding the data in the training set. doesnt mean the llm arent
| becoming better but when they all lie...
| mekpro wrote:
| Gemma 3 is way way better than Llama 4. I think Meta will start
| to lose its position in LLM mindshare. Another weakness of Llama
| 4 is its model size that is too large (even though it can run
| fast with MoE), which greatly limits the applicable users to a
| small percentage of enthusiasts who have enough GPU VRAM.
| Meanwhile, Gemma 3 is widely usable across all hardware sizes.
| miki123211 wrote:
| What would be the best way to deploy this if you're maximizing
| for GPU utilization in a multi-user (API) scenario? Structured
| output support would be a big plus.
|
| We're working with a GPU-poor organization with very strict data
| residency requirements, and these models might be exactly what we
| need.
|
| I would normally say VLLM, but the blog post notably does not
| mention VLLM support.
| PhilippGille wrote:
| vLLM lists Gemma 3 as supported, if I'm not mistaken:
| https://docs.vllm.ai/en/latest/models/supported_models.html#...
| 999900000999 wrote:
| Assuming this can match Claude's latest, and full time usage ( as
| in you have a system that's constantly running code without any
| user input,) you'd probably save 600 to 700 a month. A 4090 is
| only 2K and you'll see an ROI within 90 days.
|
| I can imagine this will serve to drive prices for hosted llms
| lower.
|
| At this level any company that produces even a nominal amount of
| code should be running LMS on prem( AWS if your on the cloud).
| rafaelmn wrote:
| I'd say using a Mac studio with M4 Max and 128 GB RAM will get
| you way further than 4090 in context size and model size.
| Cheaper than 2x4090 and less power while being a great overall
| machine.
|
| I think these consumer GPUs are way too expensive for the
| amount of memory they pack - and that's intentional price
| discrimination. Also the builds are gimmicky. It's just not
| setup for AI models, and the versions that are cost 20k.
|
| AMD has that 128GB RAM strix halo chip but even with soldered
| ram the bandwidth there is very limited, half of M4 Max, which
| is half of 4090.
|
| I think this generation of hardware and local models is not
| there yet - would wait for M5/M6 release.
| tootie wrote:
| There's certainly room to grow but I'm running Gemma 12b on a
| 4060 (8GB VRAM) which I bought for gaming and it's a tad slow
| but still gives excellent results. And it certainly seems
| software is outpacing hardware right now. The target is
| making a good enough model that can run on a phone.
| retinaros wrote:
| two 3090 are the way to go
| briandear wrote:
| The normal Gemma models seem to work fine on Apple silicon with
| Metal. Am I missing something?
| simonw wrote:
| These new special editions of those models claim to work better
| with less memory.
| porphyra wrote:
| It is funny that Microsoft had been peddling "AI PCs" and Apple
| had been peddling "made for Apple Intelligence" for a while now,
| when in fact usable models for consumer GPUs are only barely
| starting to be a thing on extremely high end GPUs like the 3090.
| ivape wrote:
| This is why the "AI hardware cycle is hype" crowd is so wrong.
| We're not even close, we're basically at ColecoVision/Atari
| stage of hardware here. It's going be quite a thing when
| _everyone_ gets a SNES /Genesis.
| icedrift wrote:
| Capable local models have been usable on Macs for a while now
| thanks to their unified memory.
| NorwegianDude wrote:
| A 3090 is not a extremely high end GPU. Is a consumer GPU
| launched in 2020, and even in price and compute it's around a
| mid-range consumer GPU these days.
|
| The high end consumer card from Nvidia is the RTX 5090, and the
| professional version of the card is the RTX PRO 6000.
| zapnuk wrote:
| A 3090 still costs 1800EUR. Thats not mid-range by a long
| shot
|
| The 5070 or 5070ti are mid range. They cost 650/900EUR.
| NorwegianDude wrote:
| 3090s are no longer produced, that's why new ones are so
| expensive. At least here, used 3090s are around EUR650, and
| a RTX 5070 is around EUR625.
|
| It's definitely not extremely high end any more, the price
| is(at least here) the same as the new mid range consumer
| cards.
|
| I guess the price can vary by location, but EUR1800 for a
| 3090 is crazy, that's more than the new price in 2020.
| sentimentscan wrote:
| A year ago, I bought a brand-new EVGA hybrid-cooled 3090 Ti
| for 700 euros. I'm still astonished at how good of a
| decision it was, especially considering the scarcity of
| 24GB cards available for a similar price. For pure gaming,
| many cards perform better, but they mostly come with 12 to
| 16GB of VRAM.
| dragonwriter wrote:
| For model usability as a binary yes/no, pretty much the only
| dimension that matters is VRAM, and at 24GB the 3090 is still
| high end for a consumer NVidia GPUs, yes, the 5090 (and
| _only_ the 5090) is above it, at 32GB, but 24GB is way ahead
| of the mid-range.
| NorwegianDude wrote:
| 24 GB of VRAM is a large amount of VRAM on a consumer GPU,
| that I totally agree with you on. But it's definitely not
| an extremely high end GPU these days. It is suitable, yes,
| but not high end. The high end alternative for a consumer
| GPU would be the RTX 5090, but that is only available for
| EUR3000 now, while used 3090s are around EUR650.
| dragonwriter wrote:
| AI PCs aren't about running the kind of models that take a
| 3090-class GPU, or even _running on GPU at all_ , but systems
| where the local end is running something like Phi-3.5-vision-
| instruct, on system RAM using a CPU with an integrated NPU,
| which is why the AI PC requirements specify an NPU, a certain
| amount of processing capacity, and a minimum amount of
| DDR5/LPDDR5 system RAM.
| mark_l_watson wrote:
| Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using
| Ollama for routine work on my 32G memory Mac.
|
| gemma3:27b-it-qat with open-codex, running locally, is just
| amazingly useful, not only for Python dev, but for Haskell and
| Common Lisp also.
|
| I still like Gemini 2.5 Pro and o3 for brainstorming or working
| on difficult problems, but for routine work it (simply) makes me
| feel good to have everything open source/weights running on my
| own system.
|
| Wen I bought my 32G Mac a year ago, I didn't expect to be so
| happy as running gemma3:27b-it-qat with open-codex locally.
| Tsarp wrote:
| What tps are you hitting? And did you have to change KV size?
| nxobject wrote:
| Fellow owner of a 32GB MBP here: how much memory does it use
| while resident - or, if swapping happens, do you see the
| effects in your day to day work? I'm in the awkward position of
| using on a daily basis a lot of virtualized bloated Windows
| software (mostly SAS).
| mark_l_watson wrote:
| I have the usual programs running on my Mac, along with open-
| codex: Emacs, web browser, terminals, VSCode, etc. Even with
| large contexts, open-codex with Ollama and Gemma 3 27B QAT
| does not seem to overload my system.
|
| To be clear, I sometimes toggle open-codex to use the Gemini
| 3.5 Pro API also, but I enjoy running locally for simpler
| routine work.
| pantulis wrote:
| How did you manage to run open-codex against a local ollama? I
| keep getting 400 Errors no matter what I try with the
| --provider and --model options.
| pantulis wrote:
| Never mind, found your Leanpub book and followed the
| instructions and at least I have it running with qwen-2.5.
| I'll investigate what happens with Gemma.
| piyh wrote:
| Meta Maverick is crying in the shower getting so handily beat by
| a model with 15x fewer params
| Samin100 wrote:
| I have a few private "vibe check" questions and the 4 bit QAT 27B
| model got them all correctly. I'm kind of shocked at the
| information density locked in just 13 GB of weights. If anyone at
| Deepmind is reading this -- Gemma 3 27B is the single most
| impressive open source model I have ever used. Well done!
| itake wrote:
| I tried to use the -it models for translation, but it
| completely failed at translating adult content.
|
| I think this means I either have to train the -pt model with my
| own instruction tuning or use another provider :(
| andhuman wrote:
| Have you tried Mistral Small 24b?
| itake wrote:
| My current architecture is an on-device model for fast
| translation and then replace that with a slow translation
| (via an API call) when its ready.
|
| 24b would be too small to run on device and I'm trying to
| keep my cloud costs low (meaning I can't afford to host a
| small 24b 24/7).
| jychang wrote:
| Try mradermacher/amoral-gemma3-27B-v2-qat-GGUF
| itake wrote:
| My current architecture is an on-device model for fast
| translation and then replace that with a slow translation
| (via an API call) when its ready.
|
| 24b would be too small to run on device and I'm trying to
| keep my cloud costs low (meaning I can't afford to host a
| small 27b 24/7).
| cheriot wrote:
| Is there already a Helium for GPUs?
| punnerud wrote:
| Just tested the 27B, and it's not very good at following
| instructions and is very limited on more complex code problems.
|
| Mapping from one JSON with a lot of plain text, into a new
| structure and it fails every time.
|
| Ask it to generate SVG, and it's very simple and almost too dumb.
|
| Nice that it doesn't need that huge amount of RAM, and perform ok
| on smaller languages from my initial tests.
| Havoc wrote:
| Definitely my current fav. Also interesting that for many
| questions the response is very similar to the gemini series. Must
| be sharing training datasets pretty directly.
| mattfrommars wrote:
| anyone had success using Gemma 3 QAT models on Ollama with cline?
| They just don't work as good compared Gemini 2.0 flash provided
| by API
| technologesus wrote:
| Just for fun I created a new personal benchmark for vision-
| enabled LLMs: playing minecraft. I used JSON structured output in
| LM Studio to create basic controls for the game. Unfortunately no
| matter how hard I proompted, gemma-3-27b QAT is not really able
| to understand simple minecraft scenarios. It would say things
| like "I'm now looking at a stone block. I need to break it" when
| it is looking out at the horizon in the desert.
|
| Here is the JSON schema: https://pastebin.com/SiEJ6LEz System
| prompt: https://pastebin.com/R68QkfQu
| jvictor118 wrote:
| i've found the vision capabilities are very bad with spatial
| awareness/reasoning. They seem to know that certain things are
| in the image, but not where they are relative to each other,
| their relative sizes, etc.
| gigel82 wrote:
| FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and
| 29Gb with 16k context and runs at ~61t/s on my 5090.
| manjunaths wrote:
| I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine
| with ROCm and llama.cpp on Windows 11. I can use Open-webui or
| the native gui for the interface. It is made available via an
| internal IP to all members of my home.
|
| It runs at around 26 tokens/sec and FP16, FP8 is not supported by
| the Radeon 7900 GRE.
|
| I just love it.
|
| For coding QwQ 32b is still king. But with a 16GB VRAM card it
| gives me ~3 tokens/sec, which is unusable.
|
| I tried to make Gemma 3 write a powershell script with Terminal
| gui interface and it ran into dead-ends and finally gave up. QwQ
| 32B performed a lot better.
|
| But for most general purposes it is great. My kid's been using it
| to feed his school textbooks and ask it questions. It is better
| than anything else currently.
|
| Somehow it is more "uptight" than llama or the chinese models
| like Qwen. Can't put my finger on it, the Chinese models seem
| nicer and more talkative.
| mdp2021 wrote:
| > _My kid 's been using it to feed his school textbooks and ask
| it questions_
|
| Which method are you employing to feed a textbook into the
| model?
| anshumankmr wrote:
| my trusty RTX 3060 is gonna have its day in the sun... though I
| have run a bunch of 7B models fairly easily on Ollama.
| ece wrote:
| On Hugging Face:
| https://huggingface.co/collections/google/gemma-3-qat-67ee61...
| yuweiloopy2 wrote:
| Been using the 27B QAT model for batch processing 50K+ internal
| documents. The 128K context is game-changing for our legal review
| pipeline. Though I wish the token generation was faster - at
| 20tps it's still too slow for interactive use compared to Claude
| Opus.
| gitroom wrote:
| nice, loving the push with local models lately - always makes me
| wonder though, you think privacy wins out over speed and
| convenience in the long run or people just stick with what's
| quickest?
| simonw wrote:
| Speed and convenience will definitely win for most people.
| Hosted LLMs are _so cheap_ these days, and are massively more
| capable than anything you can fit on even a very beefy
| ($4,000+) consumer machine.
|
| The privacy concerns are honestly _mostly_ imaginary at this
| point, too. Plenty of hosted LLM vendors will promise not to
| train on your data. The bigger threat is if they themselves log
| data and then have a security incident, but honestly the risk
| that your own personal machine gets stolen or hacked is a lot
| higher than that.
___________________________________________________________________
(page generated 2025-04-21 23:01 UTC)