[HN Gopher] Gemma 3 QAT Models: Bringing AI to Consumer GPUs
       ___________________________________________________________________
        
       Gemma 3 QAT Models: Bringing AI to Consumer GPUs
        
       Author : emrah
       Score  : 376 points
       Date   : 2025-04-20 12:22 UTC (10 hours ago)
        
 (HTM) web link (developers.googleblog.com)
 (TXT) w3m dump (developers.googleblog.com)
        
       | emrah wrote:
       | Available on ollama: https://ollama.com/library/gemma3
        
         | jinay wrote:
         | Make sure you're using the "-it-qat" suffixed models like
         | "gemma3:27b-it-qat"
        
           | Zambyte wrote:
           | Here are the direct links:
           | 
           | https://ollama.com/library/gemma3:27b-it-qat
           | 
           | https://ollama.com/library/gemma3:12b-it-qat
           | 
           | https://ollama.com/library/gemma3:4b-it-qat
           | 
           | https://ollama.com/library/gemma3:1b-it-qat
        
           | ein0p wrote:
           | Thanks. I was wondering why my open-webui said that I already
           | had the model. I bet a lot of people are making the same
           | mistake I did and downloading just the old, post-quantized
           | 27B.
        
         | Der_Einzige wrote:
         | How many times do I have to say this? Ollama, llamacpp, and
         | many other projects are slower than vLLM/sglang. vLLM is a much
         | superior inference engine and is fully supported by the only
         | LLM frontends that matter (sillytavern).
         | 
         | The community getting obsessed with Ollama has done huge damage
         | to the field, as it's ineffecient compared to vLLM. Many people
         | can get far more tok/s than they think they could if only they
         | knew the right tools.
        
           | m00dy wrote:
           | Ollama is definitely not for production loads but vLLm is.
        
           | janderson215 wrote:
           | I did not know this, so thank you. I read a blogpost a while
           | back that encouraged using Ollama and never mention vLLM. Do
           | you recommend reading any particular resource?
        
           | Zambyte wrote:
           | The significant convenience benefits outweigh the higher TPS
           | that vLLM offers in the context of my single machine homelab
           | GPU server. If I was hosting it for something more critical
           | than just myself and a few friends chatting with it, sure.
           | Being able to just paste a model name into Open WebUI and run
           | it is important to me though.
           | 
           | It is important to know about both to decide between the two
           | for your use case though.
        
             | Der_Einzige wrote:
             | Running any HF model on vllm is as simple as pasting a
             | model name into one command in your terminal.
        
               | Zambyte wrote:
               | What command is it? Because that was not at all my
               | experience.
        
           | oezi wrote:
           | Why is sillytavern the only LLM frontend which matters?
        
             | GordonS wrote:
             | I tried sillytavern a few weeks ago... wow, that is an
             | "interesting" UI! I blundered around for a while, couldn't
             | figure out how to do _anything_ useful... and then
             | installed LM Studio instead.
        
               | imtringued wrote:
               | I personally thought the lorebook feature was quite neat
               | and then quickly gave up on it because I couldn't get it
               | to trigger, ever.
               | 
               | Whatever those keyword things are, they certainly don't
               | seem to be doing any form of RAG.
        
             | Der_Einzige wrote:
             | It supports more sampler and other settings than anyone
             | else.
        
           | ach9l wrote:
           | instead of ranting, maybe explain how to make a qat q4 work
           | with images in vllm, afaik it is not yet possible
        
           | oezi wrote:
           | Somebody in this thread mentioned 20.x tok/s on ollama. What
           | are you seeing in vLLM?
        
             | Zambyte wrote:
             | FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the
             | 27b qat. You can't really compare inference engine to
             | inference engine without keeping the hardware and model
             | fixed.
             | 
             | Unfortunately Ollama and vLLM are therefore incomparable at
             | the moment, because vLLM does not support these models yet.
             | 
             | https://github.com/vllm-project/vllm/issues/16856
        
           | simonw wrote:
           | Last I looked vLLM didn't work on a Mac.
        
             | mitjam wrote:
             | Afaik vllm is for concurrent serving with batched inference
             | for higher throughput, not single-user inference. I doubt
             | inference throughput is higher with single prompts at a
             | time than Ollama. Update: this is a good Intro to
             | continuous batching in llm inference:
             | https://www.anyscale.com/blog/continuous-batching-llm-
             | infere...
        
               | Der_Einzige wrote:
               | It is much faster on single prompts than ollama. 3X is
               | not unheard of
        
       | holografix wrote:
       | Could 16gb vram be enough for the 27b QAT version?
        
         | halflings wrote:
         | That's what the chart says yes. 14.1GB VRAM usage for the 27B
         | model.
        
           | erichocean wrote:
           | That's the VRAM required just to load the model weights.
           | 
           | To actually use a model, you need a context window.
           | Realistically, you'll want a 20GB GPU or larger, depending on
           | how many tokens you need.
        
             | oezi wrote:
             | I didn't realize that the context would require such so
             | much memory. Is this KV caches? It would seem like a big
             | advantage if this memory requirement could be reduced.
        
         | jffry wrote:
         | With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory
         | usage is just a hair over 20GB, so no, probably not without a
         | nerfed context window
        
           | woadwarrior01 wrote:
           | Indeed, the default context length in ollama is a mere 2048
           | tokens.
        
         | hskalin wrote:
         | With ollama you could offload a few layers to cpu if they don't
         | fit in the VRAM. This will cost some performance ofcourse but
         | it's much better than the alternative (everything on cpu)
        
           | senko wrote:
           | I'm doing that with a 12GB card, ollama supports it out of
           | the box.
           | 
           | For some reason, it only uses around 7GB of VRAM, probably
           | due to how the layers are scheduled, maybe I could tweak
           | something there, but didn't bother just for testing.
           | 
           | Obviously, perf depends on CPU, GPU and RAM, but on my
           | machine (3060 + i5-13500) it's around 2 t/s.
        
         | parched99 wrote:
         | I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB)
         | to run with a 100 token context size on a 5070 ti (16GB) using
         | llamacpp.
         | 
         | Prompt Tokens: 10
         | 
         | Time: 229.089 ms
         | 
         | Speed: 43.7 t/s
         | 
         | Generation Tokens: 41
         | 
         | Time: 959.412 ms
         | 
         | Speed: 42.7 t/s
        
           | floridianfisher wrote:
           | Try one of the smaller versions. 27b is too big for your gpu
        
             | parched99 wrote:
             | I'm aware. I was addressing the question being asked.
        
           | tbocek wrote:
           | This is probably due to this: https://github.com/ggml-
           | org/llama.cpp/issues/12637. This GitHub issue is about
           | interleaved sliding window attention (iSWA) not available in
           | llama.cpp for Gemma 3. This could reduce the memory
           | requirements a lot. They mentioned for a certain scenario,
           | going from 62GB to 10GB.
        
             | parched99 wrote:
             | Resolving that issue, would help reduce (not eliminate) the
             | size of the context. The model will still only just barely
             | fit in 16 GB, which is what the parent comment asked.
             | 
             | Best to have two or more low-end, 16GB GPUs for a total of
             | 32GB VRAM to run most of the better local models.
        
       | diggan wrote:
       | First graph is a comparison of the "Elo Score" while using
       | "native" BF16 precision in various models, second graph is
       | comparing VRAM usage between native BF16 precision and their QAT
       | models, but since this method is about doing quantization while
       | also maintaining quality, isn't the obvious graph of comparing
       | the quality between BF16 and QAT missing? The text doesn't seem
       | to talk about it either, yet it's basically the topic of the blog
       | post.
        
         | croemer wrote:
         | Indeed, the one thing I was looking for was Elo/performance of
         | the quantized models, not how good the base model is. Showing
         | how much memory is saved by quantization in a figure is a bit
         | of an insult to the intelligence of the reader.
        
         | nithril wrote:
         | In addition the graph "Massive VRAM Savings" graph states what
         | looks like a tautology, reducing from 16 bits to 4 bits leads
         | unsurprisingly to a x4 reduction in memory usage
        
         | claiir wrote:
         | Yea they mention a "perplexity drop" relative to naive
         | quantization, but that's meaningless to me. > We reduce the
         | perplexity drop by 54% (using llama.cpp perplexity evaluation)
         | when quantizing down to Q4_0.
         | 
         | Wish they showed benchmarks / added quantized versions to the
         | arena! :>
        
       | jarbus wrote:
       | Very excited to see these kinds of techniques, I think getting a
       | 30B level reasoning model usable on consumer hardware is going to
       | be a game changer, especially if it uses less power.
        
         | apples_oranges wrote:
         | Deepseek does reasoning on my home Linux pc but not sure how
         | power hungry it is
        
           | gcr wrote:
           | what variant? I'd considered DeepSeek far too large for any
           | consumer GPUs
        
             | scosman wrote:
             | Some people run Deepseek on CPU. 37B active params - it
             | isn't fast but it's passible.
        
               | danielbln wrote:
               | Actual deepseek or some qwen/llama reasoning fine-tune?
        
               | scosman wrote:
               | Actual Deepseek. 500gb of memory and a threadripper
               | works. Not a standard PC spec, but a common ish home brew
               | setup for single user Deepseek.
        
       | wtcactus wrote:
       | They keep mentioning the RTX 3090 (with 24 GB VRAM), but the
       | model is only 14.1 GB.
       | 
       | Shouldn't it fit a 5060 Ti 16GB, for instance?
        
         | jsnell wrote:
         | Memory is needed for more than just the parameters, e.g. the KV
         | cache.
        
           | cubefox wrote:
           | KV = key-value
        
         | oktoberpaard wrote:
         | With a 128K context length and 8 bit KV cache, the 27b model
         | occupies 22 GiB on my system. With a smaller context length you
         | should be able to fit it on a 16 GiB GPU.
        
       | noodletheworld wrote:
       | ?
       | 
       | Am I missing something?
       | 
       | These have been out for a while; if you follow the HF link you
       | can see, for example, the 27b quant has been downloaded from HF
       | 64,000 times over the last 10 days.
       | 
       | Is there something more to this, or is just a follow up blog
       | post?
       | 
       | (is it just that ollama finally has partial (no images right?)
       | support? Or something else?)
        
         | deepsquirrelnet wrote:
         | QAT "quantization aware training" means they had it quantized
         | to 4 bits during training rather than after training in full or
         | half precision. It's supposedly a higher quality, but
         | unfortunately they don't show any comparisons between QAT and
         | post-training quantization.
        
           | noodletheworld wrote:
           | I understand that, but the qat models (1) are not new
           | uploads.
           | 
           | How is this more significant now than when they were uploaded
           | 2 weeks ago?
           | 
           | Are we expecting new models? I don't understand the timing.
           | This post feels like it's two weeks late.
           | 
           | [1] - https://huggingface.co/collections/google/gemma-3-qat-6
           | 7ee61...
        
             | llmguy wrote:
             | 8 days is closer to 1 week then 2. And it's a blog post,
             | nobody owes you realtime updates.
        
               | noodletheworld wrote:
               | https://huggingface.co/google/gemma-3-27b-it-
               | qat-q4_0-gguf/t...
               | 
               | > 17 days ago
               | 
               | Anywaaay...
               | 
               | I'm literally asking, quite honestly, if this is just an
               | 'after the fact' update literally weeks later, that they
               | uploaded a bunch of models, or if there is something more
               | significant about this I'm missing.
        
               | timcobb wrote:
               | Probably the former... I see your confusion but it's
               | really only a couple weeks at most. The news cycle is
               | strong in you, grasshopper :)
        
               | osanseviero wrote:
               | Hi! Omar from the Gemma team here.
               | 
               | Last time we only released the quantized GGUFs. Only
               | llama.cpp users could use it (+ Ollama, but without
               | vision).
               | 
               | Now, we released the unquantized checkpoints, so anyone
               | can quantize themselves and use in their favorite tools,
               | including Ollama with vision, MLX, LM Studio, etc. MLX
               | folks also found that the model worked decently with 3
               | bits compared to naive 3-bit, so by releasing the
               | unquantized checkpoints we allow further experimentation
               | and research.
               | 
               | TL;DR. One was a release in a specific format/tool, we
               | followed-up with a full release of artifacts that enable
               | the community to do much more.
        
               | oezi wrote:
               | Hey Omar, is there any chance that Gemma 3 might get a
               | speech (ASR/AST/TTS) release?
        
             | simonw wrote:
             | The official announcement of the QAT models happened on
             | Friday 18th, two days ago. It looks like they uploaded them
             | to HF in advance of that announcement:
             | https://developers.googleblog.com/en/gemma-3-quantized-
             | aware...
             | 
             | The partnership with Ollama and MLX and LM Studio and
             | llama.cpp was revealed in that announcement, which made the
             | models a lot easier for people to use.
        
         | xnx wrote:
         | The linked blog post was 2 days ago
        
       | behnamoh wrote:
       | This is what local LLMs need--being treated like first-class
       | citizens by the companies that make them.
       | 
       | That said, the first graph is misleading about the number of
       | H100s required to run DeepSeek r1 at FP16. The model is FP8.
        
         | freeamz wrote:
         | so what is the real comparison against DeepSeek r1 ? Would be
         | good to know which is actually more cost efficient and open
         | (reproducible build) to run locally.
        
           | behnamoh wrote:
           | half the amount of those dots is what it takes. but also, why
           | compare a 27B model with a +600B? that doesn't make sense.
        
             | smallerize wrote:
             | It's an older image that they just reused for the blog
             | post. It's on https://ai.google.dev/gemma for example
        
         | mmoskal wrote:
         | Also ~noone runs h100 at home, ie at batch size 1. What matters
         | is throughput. With 37b active parameters and a massive
         | deployment throughout (per gpu) should be similar to Gemma.
        
       | mythz wrote:
       | The speed gains are real, after downloading latest QAT gemma3:27b
       | eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11
       | tok/s (on A4000's).
        
       | btbuildem wrote:
       | Is 27B the largest QAT Gemma 3? Given these size reductions, it
       | would be amazing to have the 70B!
        
         | arnaudsm wrote:
         | The original Gemma 3 does not have a 70B version.
        
       | umajho wrote:
       | I am currently using the Q4_K_M quantized version of
       | gemma-3-27b-it locally. I previously assumed that a 27B model
       | with image input support wouldn't be very high quality, but after
       | actually using it, the generated responses feel better than those
       | from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M),
       | and its recognition of images is also stronger than I expected.
       | (I thought the model could only roughly understand the concepts
       | in the image, but I didn't expect it to be able to recognize text
       | within the image.)
       | 
       | Since this article publishes the optimized Q4 quantized version,
       | it would be great if it included more comparisons between the new
       | version and my currently used unoptimized Q4 version (such as
       | benchmark scores).
       | 
       | (I deliberately wrote this reply in Chinese and had
       | gemma-3-27b-it Q4_K_M translate it into English.)
        
       | rob_c wrote:
       | Given how long between this being released and this community
       | picking up on it... Lol
        
         | GaunterODimm wrote:
         | 2days :/...
        
           | rob_c wrote:
           | Given I know people running gemma3 on local devices for over
           | almost a month now this is either a very slow news day or
           | evidence of finger missing the pulse...
           | https://blog.google/technology/developers/gemma-3/
        
             | simonw wrote:
             | This is new. These are new QAT (Quantization-Aware
             | Training) models released by the Gemma team.
        
               | rob_c wrote:
               | There's nothing more than an iteration on the topic,
               | gemma3 was smashing local results a month ago and made no
               | waves as it dropped...
        
               | simonw wrote:
               | Quoting the linked story:
               | 
               | > Last month, we launched Gemma 3, our latest generation
               | of open models. Delivering state-of-the-art performance,
               | Gemma 3 quickly established itself as a leading model
               | capable of running on a single high-end GPU like the
               | NVIDIA H100 using its native BFloat16 (BF16) precision.
               | 
               | > To make Gemma 3 even more accessible, we are announcing
               | new versions optimized with Quantization-Aware Training
               | (QAT) that dramatically reduces memory requirements while
               | maintaining high quality.
               | 
               | The thing that's new, and that is clearly resonating with
               | people, is the "To make Gemma 3 even more accessible..."
               | bit.
        
               | rob_c wrote:
               | As I've said in my lectures on how to perform 1bit
               | training of QAT systems to build classifiers...
               | 
               | "An iteration on a theme".
               | 
               | Once the network design is proven to work yes it's an
               | impressive technical achievement, but as I've said given
               | I've known people in multiple research institutes and
               | companies using Gemma3 for a month mostly saying they're
               | surprised it's not getting noticed... This is just
               | enabling more users but the none QAT version will almost
               | always perform better...
        
               | simonw wrote:
               | Sounds like you're excited to see Gemma 3 get the
               | recognition it deserves on Hacker News then.
        
               | rob_c wrote:
               | No just pointing out the flooding obvious as usual and
               | collecting down votes for it
        
               | fragmede wrote:
               | Speaking for myself, my downvotes are not because of the
               | content of your arguments, but because your tone is
               | consistently condescending and dismissive. Comments like
               | "just pointing out the flooding obvious" come off as smug
               | and combative rather than constructive.
               | 
               | HN works best when people engage in good faith, stay
               | curious, and try to move the conversation forward. That
               | kind of tone -- even when technically accurate --
               | discourages others from participating and derails
               | meaningful discussion.
               | 
               | If you're getting downvotes regularly, maybe it's worth
               | considering how your comments are landing with others,
               | not just whether they're "right."
        
       | simonw wrote:
       | I think gemma-3-27b-it-qat-4bit is my new favorite local model -
       | or at least it's right up there with Mistral Small 3.1 24B.
       | 
       | I've been trying it on an M2 64GB via both Ollama and MLX. It's
       | very, very good, and it only uses ~22Gb (via Ollama) or ~15GB
       | (MLX) leaving plenty of memory for running other apps.
       | 
       | Some notes here:
       | https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
       | 
       | Last night I had it write me a complete plugin for my LLM tool
       | like this:                 llm install llm-mlx       llm mlx
       | download-model mlx-community/gemma-3-27b-it-qat-4bit
       | llm -m mlx-community/gemma-3-27b-it-qat-4bit \         -f
       | https://raw.githubusercontent.com/simonw/llm-hacker-
       | news/refs/heads/main/llm_hacker_news.py \         -f https://raw.
       | githubusercontent.com/simonw/tools/refs/heads/main/github-issue-
       | to-markdown.html \         -s 'Write a new fragments plugin in
       | Python that registers         issue:org/repo/123 which fetches
       | that issue             number from the specified github repo and
       | uses the same             markdown logic as the HTML page to turn
       | that into a             fragment'
       | 
       | It gave a solid response!
       | https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... -
       | more notes here: https://simonwillison.net/2025/Apr/20/llm-
       | fragments-github/
        
         | rs186 wrote:
         | Can you quote tps?
         | 
         | More and more I start to realize that cost saving is a small
         | problem for local LLMs. If it is too slow, it becomes unusable,
         | so much that you might as well use public LLM endpoints. Unless
         | you really care about getting things done locally without
         | sending information to another server.
         | 
         | With OpenAI API/ChatGPT, I get response much faster than I can
         | read, and for simple question, it means I just need a glimpse
         | of the response, copy & paste and get things done. Whereas on
         | local LLM, I watch it painstakingly prints preambles that I
         | don't care about, and get what I actually need after 20 seconds
         | (on a fast GPU).
         | 
         | And I am not yet talking about context window etc.
         | 
         | I have been researching about how people integrate local LLMs
         | in their workflows. My finding is that most people play with it
         | for a short time and that's about it, and most people are much
         | better off spending money on OpenAI credits (which can last a
         | very long time with typical usage) than getting a beefed up Mac
         | Studio or building a machine with 4090.
        
           | simonw wrote:
           | My tooling doesn't measure TPS yet. It feels snappy to me on
           | MLX.
           | 
           | I agree that hosted models are usually a better option for
           | most people - much faster, higher quality, handle longer
           | inputs, really cheap.
           | 
           | I enjoy local models for research and for the occasional
           | offline scenario.
           | 
           | I'm also interested in their applications for journalism,
           | specifically for dealing with extremely sensitive data like
           | leaked information from confidential sources.
        
             | freeamz wrote:
             | >I'm also interested in their applications for journalism,
             | specifically for dealing with extremely sensitive data like
             | leaked information from confidential sources.
             | 
             | Think it is NOT just you. Most company with decent
             | management also would not want their data going to anything
             | outside the physical server they have in control of. But
             | yeah for most people just use an app and hosted server. But
             | this is HN,there are ppl here hosting their own email
             | servers, so shouldn't be too hard to run llm locally.
        
               | simonw wrote:
               | "Most company with decent management also would not want
               | their data going to anything outside the physical server
               | they have in control of."
               | 
               | I don't think that's been true for over a decade: AWS
               | wouldn't be trillion dollar business if most companies
               | still wanted to stay on-premise.
        
               | terhechte wrote:
               | Or GitHub. I'm always amused when people don't want to
               | send fractions of their code to a LLM but happily host it
               | on GitHub. All big llm providers offer no-training-on-
               | your-data business plans.
        
               | tarruda wrote:
               | > I'm always amused when people don't want to send
               | fractions of their code to a LLM but happily host it on
               | GitHub
               | 
               | What amuses me even more is people thinking their code is
               | too unique and precious, and that GitHub/Microsoft wants
               | to steal it.
        
               | AlexCoventry wrote:
               | Concern about platform risk in regard to Microsoft is
               | historically justified.
        
               | Terretta wrote:
               | Unlikely they think Microsoft or GitHub wants to steal
               | it.
               | 
               | With LLMs, they're thinking of examples that regurgitated
               | proprietary code, and contrary to everyday general
               | observation, valuable proprietary code does exist.
               | 
               | But with GitHub, the thinking is generally the opposite:
               | the worry is that the code is terrible, and seeing it
               | would be like giant blinkenlights* indicating the way in.
               | 
               | * https://en.wikipedia.org/wiki/Blinkenlights
        
               | vikarti wrote:
               | Regulations sometimes matter. Stupid "security" rules
               | sometimes matter too.
        
               | __float wrote:
               | While none of that is false, I think there's a big
               | difference from shipping your data to an external LLM API
               | and using AWS.
               | 
               | Using AWS is basically a "physical server they have
               | control of".
        
               | simonw wrote:
               | That's why AWS Bedrock and Google Vertex AI and Azure AI
               | model inference exist - they're all hosted LLM services
               | that offer the same compliance guarantees that you get
               | from regular AWS-style hosting agreements.
        
               | IanCal wrote:
               | As in aws is a much bigger security concern?
        
           | overfeed wrote:
           | > Whereas on local LLM, I watch it painstakingly prints
           | preambles that I don't care about, and get what I actually
           | need after 20 seconds.
           | 
           | You may need to "right-size" the models you use to match your
           | hardware, model, and TPS expectations, which may involve
           | using a smaller version of the model with faster TPS,
           | upgrading your jardware, or paying for hosted models.
           | 
           | Alternatively, if you can use agentic workflows or tools like
           | Aider, you don't have to watch the model work slowly with
           | large modles locally. Instead you queue work for it, go to
           | sleep, or eat, or do other work, and then much later look
           | over the Pull Requests whenever it completes them.
        
             | rs186 wrote:
             | I have a 4070 super for gaming, and used it to play with
             | LLM a few times. It is by no means a bad card, but I
             | realize that unless I want to get 4090 or new Macs that I
             | don't have any other use for, I can only use it to run
             | smaller models. However, most smaller models aren't
             | satisfactory and are still slower than hosted LLMs. I
             | haven't found a model that I am happy with for my hardware.
             | 
             | Regarding agentic workflows -- sounds nice but I am too
             | scared to try it out, based on my experience with standard
             | LLMs like GPT or Claude for writing code. Small snippets or
             | filling in missing unit tests, fine, anything more
             | complicated? Has been a disaster for me.
        
           | otabdeveloper4 wrote:
           | The only actually useful application of LLM's is processing
           | large amounts of data for classification and/or summarizing
           | purposes.
           | 
           | That's not the stuff you want to send to a public API, this
           | is something you want as a 24/7 locally running batch job.
           | 
           | ("AI assistant" is an evolutionary dead end, and Star Trek be
           | damned.)
        
           | DJHenk wrote:
           | > More and more I start to realize that cost saving is a
           | small problem for local LLMs. If it is too slow, it becomes
           | unusable, so much that you might as well use public LLM
           | endpoints. Unless you really care about getting things done
           | locally without sending information to another server.
           | 
           | There is another aspect to consider, aside from privacy.
           | 
           | These models are trained by downloading every scrap of
           | information from the internet, including the works of many,
           | many authors who have never consented to that. And they for
           | sure are not going to get a share of the profits, if there is
           | every going to be any. If you use a cloud provider, you are
           | basically saying that is all fine. You are happy to pay them,
           | and make yourself dependent on their service, based on work
           | that wasn't theirs to use.
           | 
           | However, if you use a local model, the authors still did not
           | give consent, but one could argue that the company that made
           | the model is at least giving back to the community. They
           | don't get any money out of it, and you are not becoming
           | dependent on their hyper capitalist service. No rent-seeking.
           | The benefits of the work are free to use for everyone. This
           | makes using AI a little more acceptable from a moral
           | standpoint.
        
           | ein0p wrote:
           | Sometimes TPS doesn't matter. I've generated textual
           | descriptions for 100K or so images in my photo archive, some
           | of which I have absolutely no interest in uploading to
           | someone else's computer. This works pretty well with Gemma. I
           | use local LLMs all the time for things where privacy is even
           | remotely important. I estimate this constitutes easily a
           | quarter of my LLM usage.
        
             | lodovic wrote:
             | This is a really cool idea. Do you pretrain the model so it
             | can tag people? I have so many photo's that it seems
             | impossible to ever categorize them,using a workflow like
             | yours might help a lot
        
               | ein0p wrote:
               | No, tagging of people is already handled by another
               | model. Gemma just describes what's in the image, and
               | produces a comma separated list of keywords. No
               | additional training is required besides a few tweaks to
               | the prompt so that it outputs just the description,
               | without any "fluff". E.g. it normally prepends such
               | outputs with "Here's a description of the image:" unless
               | you really insist that it should output only the
               | description. I suppose I could use constrained decoding
               | into JSON or something to achieve the same, but I didn't
               | mess with that.
               | 
               | On some images where Gemma3 struggles Mistral Small
               | produces better descriptions, BTW. But it seems harder to
               | make it follow my instructions exactly.
               | 
               | I'm looking forward to the day when I can also do this
               | with videos, a lot of which I also have no interest in
               | uploading to someone else's computer.
        
               | fer wrote:
               | How do you use the keywords after? I have Immich running
               | which does some analysis, but the querying is a bit of a
               | hit and miss.
        
               | ein0p wrote:
               | Search is indeed hit and miss. Immich, for instance,
               | currently does absolutely nothing with the EXIF
               | "description" field, so I store textual descriptions on
               | the side as well. I have found Immich's search by image
               | embeddings to be pretty weak at recall, and even weaker
               | at ranking. IIRC Lightroom Classic (which I also use, but
               | haven't found a way to automate this for without writing
               | an extension) does search that field, but ranking is a
               | bit of a dumpster fire, so your best bet is searching
               | uncommon terms or constraining search by metadata (e.g.
               | not just "black kitten" but "black kitten AND 2025"). I
               | expect this to improve significantly over time - it's a
               | fairly obvious thing to add given the available tech.
        
           | k__ wrote:
           | The local LLM is your project manager, the big remote ones
           | are the engineers and designers :D
        
           | trees101 wrote:
           | Not sure how accurate my stats are. I used ollama with the
           | --verbose flag. Using a 4090 and all default settings, I get
           | 40TPS for Gemma 29B model
           | 
           | `ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
           | 
           | `ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS
           | +-0.3TPS
           | 
           | Strange results; the full model gives me slightly more TPS.
        
         | nico wrote:
         | Been super impressed with local models on mac. Love that the
         | gemma models have 128k token context input size. However,
         | outputs are usually pretty short
         | 
         | Any tips on generating long output? Like multiple pages of a
         | document, a story, a play or even a book?
        
           | simonw wrote:
           | The tool you are using may set a default max output size
           | without you realizing. Ollama has a num_ctx that defaults to
           | 2048 for example: https://github.com/ollama/ollama/blob/main/
           | docs/faq.md#how-c...
        
             | nico wrote:
             | Been playing with that, but doesn't seem to have much
             | effect. It works very well to limit output to smaller bits,
             | like setting it to 100-200. But above 2-4k the output seems
             | to never get longer than about 1 page
             | 
             | Might try using the models with mlx instead of ollama to
             | see if that makes a difference
             | 
             | Any tips on prompting to get longer outputs?
             | 
             | Also, does the model context size determine max output
             | size? Are the two related or are they independent
             | characteristics of the model?
        
               | simonw wrote:
               | Interestingly the Gemma 3 docs say: https://ai.google.dev
               | /gemma/docs/core/model_card_3#:~:text=T...
               | 
               | > Total output context up to 128K tokens for the 4B, 12B,
               | and 27B sizes, and 32K tokens for the 1B size per
               | request, subtracting the request input tokens
               | 
               | I don't know how to get it to output anything that length
               | though.
        
               | nico wrote:
               | Thank you for the insights and useful links
               | 
               | Will keep experimenting, will also try mistral3.1
               | 
               | edit: just tried mistral3.1 and the quality of the output
               | is very good, at least compared to the other models I
               | tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and
               | deepseek-r1:14b)
               | 
               | Doing some research, because of their training sets, it
               | seems like most models are not trained on producing long
               | outputs so even if they technically could, they won't.
               | Might require developing my own training dataset and then
               | doing some fine tuning. Apparently the models and ollama
               | have some safeguards against rambling and repetition
        
               | Gracana wrote:
               | You can probably find some long-form tuned models on HF.
               | I've had decent results with QwQ-32B (which I can run on
               | my desktop) and Mistral Large (which I have to run on my
               | server). Generating and refining an outline before
               | writing the whole piece can help, and you can also split
               | the piece up into multiple outputs (working a paragraph
               | or two at a time, for instance). So far I've found it to
               | be a tough process, with mixed results.
        
               | nico wrote:
               | Thank you, will try out your suggestions
               | 
               | Have you used something like a director model to
               | supervise the output? If so, could you comment on the
               | effectiveness of it and potentially any tips?
        
           | Casteil wrote:
           | This is basically the opposite of what I've experienced - at
           | least compared to another recent entry like IBM's Granite
           | 3.3.
           | 
           | By comparison, Gemma3's output (both 12b and 27b) seems to
           | typically be more long/verbose, but not problematically so.
        
             | nico wrote:
             | I agree with you. The outputs are usually good, it's just
             | that for the use case I have now (writing several pages of
             | long dialogs), the output is not as long as I'd want it,
             | and definitely not as long as it's supposedly capable of
             | doing
        
         | tomrod wrote:
         | Simon, what is your local GPU setup? (No doubt you've covered
         | this, but I'm not sure where to dig up).
        
           | simonw wrote:
           | MacBook Pro M2 with 64GB of RAM. That's why I tend to be
           | limited to Ollama and MLX - stuff that requires NVIDIA
           | doesn't work for me locally.
        
             | Elucalidavah wrote:
             | > MacBook Pro M2 with 64GB of RAM
             | 
             | Are there non-mac options with similar capabilities?
        
               | simonw wrote:
               | Yes, but I don't really know anything about those.
               | https://www.reddit.com/r/LocalLLaMA/ is full of people
               | running models on PCs with NVIDIA cards.
               | 
               | The unique benefit of an Apple Silicon Mac at the moment
               | is that the 64GB of RAM is available to both the GPU and
               | the CPU at once. With other hardware you usually need
               | dedicated separate VRAM for the GPU.
        
               | _neil wrote:
               | It's not out yet, but the upcoming Framework desktop [0]
               | is supposed to have a similar unified memory setup.
               | 
               | [0] https://frame.work/desktop
        
               | dwood_dev wrote:
               | Anything with the Radeon 8060S/Ryzen AI Max+ 395. One of
               | the popular MiniPC Chinese brands has them for
               | preorder[0] with shipping starting May 7th. Framework
               | also has them, but shipping Q3.
               | 
               | 0: https://www.gmktec.com/products/prepaid-deposit-amd-
               | ryzen(tm)-a...
        
         | littlestymaar wrote:
         | > and it only uses ~22Gb (via Ollama) or ~15GB (MLX)
         | 
         | Why is the memory use different? Are you using different
         | context size in both set-ups?
        
           | simonw wrote:
           | No idea. MLX is its own thing, optimized for Apple Silicon.
           | Ollama uses GGUFs.
           | 
           | https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0.
           | https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit
           | says it's 4bit. I think those are the same quantization?
        
         | paprots wrote:
         | The original gemma3:27b also took only 22GB using Ollama on my
         | 64GB MacBook. I'm quite confused that the QAT took the same. Do
         | you know why? Which model is better? `gemma3:27b`, or
         | `gemma3:27b-qat`?
        
           | nolist_policy wrote:
           | I suspect your "original gemma3:27b" was a quantized model
           | since the non-quantized (16bit) version needs around 54gb.
        
           | kgwgk wrote:
           | Look up 27b in https://ollama.com/library/gemma3/tags
           | 
           | You'll find the id a418f5838eaf which also corresponds to
           | 27b-it-q4_K_M
        
           | superkuh wrote:
           | Quantization aware training just means having the model deal
           | with quantized values a bit during training so it handles the
           | quantization better when it is quantized after training/etc.
           | It doesn't change the model size itself.
        
         | prvc wrote:
         | > ~15GB (MLX) leaving plenty of memory for running other apps.
         | 
         | Is that small enough to run well (without thrashing) on a
         | system with only 16GiB RAM?
        
           | simonw wrote:
           | I expect not. On my Mac at least I've found I need a bunch of
           | GB free to have anything else running at all.
        
             | mnoronha wrote:
             | Any idea why MLX and ollama use such different amounts of
             | ram?
        
         | codybontecou wrote:
         | Can you run the mlx-variation of this model through Ollama so
         | that I can interact with it in Open WebUI?
        
           | simonw wrote:
           | I haven't tried it yet but there's an MLX project that
           | exposes an OpenAI-compatible serving endpoint that should
           | work with Open WebUI: https://github.com/madroidmaq/mlx-omni-
           | server
        
       | justanotheratom wrote:
       | Anyone packaged one of these in an iPhone App? I am sure it is
       | doable, but I am curious what tokens/sec is possible these days.
       | I would love to ship "private" AI Apps if we can get reasonable
       | tokens/sec.
        
         | Alifatisk wrote:
         | If you ever ship a private AI app, don't forget to implement
         | the export functionality, please!
        
         | nico wrote:
         | What kind of functionality do you need from the model?
         | 
         | For basic conversation and RAG, you can use tinyllama or
         | qwen-2.5-0.5b, both of which run on a raspberry pi at around
         | 5-20 tokens per second
        
         | zamadatix wrote:
         | There are many such apps, e.g. Mollama, Enclave AI or
         | PrivateLLM or dozens of others, but you could tell me it runs
         | at 1,000,000 tokens/second on an iPhone and I wouldn't care
         | because the largest model version you're going to be able to
         | load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you
         | still need context) and it's just not worth the time to use.
         | 
         | That said, if you really care, it generates faster than reading
         | speed (on an A18 based model at least).
        
           | woodson wrote:
           | Some of these small models still have their uses, e.g. for
           | summarization. Don't expect them to fully replace ChatGPT.
        
             | zamadatix wrote:
             | The use case is more "I'm willing to have really bad
             | answers that have extremely high rates of making things up"
             | than based on the application. The same goes for
             | summarization, it's not like it does it well like a large
             | model would.
        
         | nolist_policy wrote:
         | FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with
         | 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with
         | Termux.
        
           | Casteil wrote:
           | Does this turn your phone into a personal space heater too?
        
       | Alifatisk wrote:
       | Except this being lighter than the other models, is there
       | anything else the Gemma model is specifically good at or better
       | than the other models at doing?
        
         | itake wrote:
         | Google claims to have better multi language support, due
         | tokenizer improvements.
        
         | nico wrote:
         | They are multimodal. Havent tried the QAT one yet. But the
         | gemma3s released a few weeks ago are pretty good at processing
         | images and telling you details about what's in them
        
         | Zambyte wrote:
         | I have found Gemma models are able to produce useful
         | information about more niche subjects that other models like
         | Mistral Small cannot, at the expense of never really saying "I
         | don't know", where other models will, and will instead produce
         | false information.
         | 
         | For example, if I ask mistral small who I am by name, it will
         | say there is no known notable figure by that name before the
         | knowledge cutoff. Gemma 3 will say I am a well known <random
         | profession> and make up facts. On the other hand, I have asked
         | both about local organization in my area that I am involved
         | with, and Gemma 3 could produce useful and factual information,
         | where Mistral Small said it did not know.
        
       | trebligdivad wrote:
       | It seems pretty impressive - I'm running it on my CPU (16 core
       | AMD 3950x) and it's very very impressive at translation, and the
       | image description is very impressive as well. I'm getting about
       | 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was
       | previously using). It does tend to be a bit chatty unless you
       | tell it not to be; pretty much everything it'll give you a
       | 'breakdown' unless you tell it not to - so for traslation my
       | prompt is 'Translate the input to English, only output the
       | translation' to stop it giving a breakdown of the input language.
        
         | simonw wrote:
         | What are you using to run it? I haven't got image input working
         | yet myself.
        
           | trebligdivad wrote:
           | I'm using llama.cpp - built last night from head; to do image
           | stuff you have to run a separate client they provide, with
           | something like:
           | 
           | ./build/bin/llama-gemma3-cli -m
           | /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj
           | /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this
           | image." --image ~/Downloads/surprise.png
           | 
           | Note the 2nd gguf in there - I'm not sure, but I think that's
           | for encoding the image.
        
           | terhechte wrote:
           | Image input has been working with LM Studio for quite some
           | time
        
       | XCSme wrote:
       | So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?
        
       | perching_aix wrote:
       | This is my first time trying to locally host a model - gave both
       | the 12B and 27B QAT models a shot.
       | 
       | I was both impressed and disappointed. Setup was piss easy, and
       | the models are great conversationalists. I have a 12 gig card
       | available and the 12B model ran very nice and swift.
       | 
       | However, they're seemingly terrible at actually assisting with
       | stuff. Tried something very basic: asked for a powershell one
       | liner to get the native blocksize of my disks. Ended up
       | hallucinating fields, then telling me to go off into the deep
       | end, first elevating to admin, then using WMI, then bringing up
       | IOCTL. Pretty unfortunate. Not sure I'll be able to put it to
       | actual meaningful use as a result.
        
         | parched99 wrote:
         | I think Powershell is a bad test. I've noticed all local models
         | have trouble providing accurate responses to Powershell-related
         | prompts. Strangely, even Microsoft's model, Phi 4, is bad at
         | answering these questions without careful prompting. Though, MS
         | can't even provide accurate PS docs.
         | 
         | My best guess is that there's not enough discussion/development
         | related to Powershell in training data.
        
           | fragmede wrote:
           | Which, like, you'd think Microsoft has an entire team there
           | who's purpose would be to generate good PowerShell for it to
           | train on.
        
         | terhechte wrote:
         | Local models, due to their size more than big cloud models,
         | favor popular languages rather than more niche ones. They work
         | fantastic for JavaScript, Python, Bash but much worse at less
         | popular things like Clojure, Nim or Haskell. Powershell is
         | probably on the less popular side compared to Js or Bash.
         | 
         | If this is your main use case you can always try to fine tune a
         | model. I maintain a small llm bench of different programming
         | languages and the performance difference between say Python and
         | Rust on some smaller models is up to 70%
        
           | perching_aix wrote:
           | How accessible and viable is model fine-tuning? I'm not in
           | the loop at all unfortunately.
        
       | CyberShadow wrote:
       | How does it compare to CodeGemma for programming tasks?
        
       | api wrote:
       | When I see 32B or 70B models performing similarly to 200+B
       | models, I don't know what to make of this. Either the latter
       | contains more breadth of information but we have managed to
       | distill latent capabilities to be similar, the larger models are
       | just less efficient, or the tests are not very good.
        
         | simonw wrote:
         | It makes intuitive sense to me that this would be possible,
         | because LLMs are still mostly opaque black boxes. I expect you
         | could drop a whole hunch of the weights without having a huge
         | impact on quality - maybe you end up mostly ditching the parts
         | that are derived from shitposts on Reddit but keep the bits
         | from Arxiv for example.
         | 
         | (That's a massive simplification of how any of this works, but
         | it's how I think about it at a high level.)
        
         | retinaros wrote:
         | its just bs benchmarks. they are all cheating at this point
         | feeding the data in the training set. doesnt mean the llm arent
         | becoming better but when they all lie...
        
       | mekpro wrote:
       | Gemma 3 is way way better than Llama 4. I think Meta will start
       | to lose its position in LLM mindshare. Another weakness of Llama
       | 4 is its model size that is too large (even though it can run
       | fast with MoE), which greatly limits the applicable users to a
       | small percentage of enthusiasts who have enough GPU VRAM.
       | Meanwhile, Gemma 3 is widely usable across all hardware sizes.
        
       | miki123211 wrote:
       | What would be the best way to deploy this if you're maximizing
       | for GPU utilization in a multi-user (API) scenario? Structured
       | output support would be a big plus.
       | 
       | We're working with a GPU-poor organization with very strict data
       | residency requirements, and these models might be exactly what we
       | need.
       | 
       | I would normally say VLLM, but the blog post notably does not
       | mention VLLM support.
        
       | 999900000999 wrote:
       | Assuming this can match Claude's latest, and full time usage ( as
       | in you have a system that's constantly running code without any
       | user input,) you'd probably save 600 to 700 a month. A 4090 is
       | only 2K and you'll see an ROI within 90 days.
       | 
       | I can imagine this will serve to drive prices for hosted llms
       | lower.
       | 
       | At this level any company that produces even a nominal amount of
       | code should be running LMS on prem( AWS if your on the cloud).
        
         | rafaelmn wrote:
         | I'd say using a Mac studio with M4 Max and 128 GB RAM will get
         | you way further than 4090 in context size and model size.
         | Cheaper than 2x4090 and less power while being a great overall
         | machine.
         | 
         | I think these consumer GPUs are way too expensive for the
         | amount of memory they pack - and that's intentional price
         | discrimination. Also the builds are gimmicky. It's just not
         | setup for AI models, and the versions that are cost 20k.
         | 
         | AMD has that 128GB RAM strix halo chip but even with soldered
         | ram the bandwidth there is very limited, half of M4 Max, which
         | is half of 4090.
         | 
         | I think this generation of hardware and local models is not
         | there yet - would wait for M5/M6 release.
        
       | briandear wrote:
       | The normal Gemma models seem to work fine on Apple silicon with
       | Metal. Am I missing something?
        
         | simonw wrote:
         | These new special editions of those models claim to work better
         | with less memory.
        
       | porphyra wrote:
       | It is funny that Microsoft had been peddling "AI PCs" and Apple
       | had been peddling "made for Apple Intelligence" for a while now,
       | when in fact usable models for consumer GPUs are only barely
       | starting to be a thing on extremely high end GPUs like the 3090.
        
         | ivape wrote:
         | This is why the "AI hardware cycle is hype" crowd is so wrong.
         | We're not even close, we're basically at ColecoVision/Atari
         | stage of hardware here. It's going be quite a thing when
         | _everyone_ gets a SNES /Genesis.
        
         | icedrift wrote:
         | Capable local models have been usable on Macs for a while now
         | thanks to their unified memory.
        
         | NorwegianDude wrote:
         | A 3090 is not a extremely high end GPU. Is a consumer GPU
         | launched in 2020, and even in price and compute it's around a
         | mid-range consumer GPU these days.
         | 
         | The high end consumer card from Nvidia is the RTX 5090, and the
         | professional version of the card is the RTX PRO 6000.
        
           | zapnuk wrote:
           | A 3090 still costs 1800EUR. Thats not mid-range by a long
           | shot
           | 
           | The 5070 or 5070ti are mid range. They cost 650/900EUR.
        
             | NorwegianDude wrote:
             | 3090s are no longer produced, that's why new ones are so
             | expensive. At least here, used 3090s are around EUR650, and
             | a RTX 5070 is around EUR625.
             | 
             | It's definitely not extremely high end any more, the price
             | is(at least here) the same as the new mid range consumer
             | cards.
             | 
             | I guess the price can vary by location, but EUR1800 for a
             | 3090 is crazy, that's more than the new price in 2020.
        
           | dragonwriter wrote:
           | For model usability as a binary yes/no, pretty much the only
           | dimension that matters is VRAM, and at 24GB the 3090 is still
           | high end for a consumer NVidia GPUs, yes, the 5090 (and
           | _only_ the 5090) is above it, at 32GB, but 24GB is way ahead
           | of the mid-range.
        
             | NorwegianDude wrote:
             | 24 GB of VRAM is a large amount of VRAM on a consumer GPU,
             | that I totally agree with you on. But it's definitely not
             | an extremely high end GPU these days. It is suitable, yes,
             | but not high end. The high end alternative for a consumer
             | GPU would be the RTX 5090, but that is only available for
             | EUR3000 now, while used 3090s are around EUR650.
        
         | dragonwriter wrote:
         | AI PCs aren't about running the kind of models that take a
         | 3090-class GPU, or even _running on GPU at all_ , but systems
         | where the local end is running something like Phi-3.5-vision-
         | instruct, on system RAM using a CPU with an integrated NPU,
         | which is why the AI PC requirements specify an NPU, a certain
         | amount of processing capacity, and a minimum amount of
         | DDR5/LPDDR5 system RAM.
        
       | mark_l_watson wrote:
       | Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using
       | Ollama for routine work on my 32G memory Mac.
       | 
       | gemma3:27b-it-qat with open-codex, running locally, is just
       | amazingly useful, not only for Python dev, but for Haskell and
       | Common Lisp also.
       | 
       | I still like Gemini 2.5 Pro and o3 for brainstorming or working
       | on difficult problems, but for routine work it (simply) makes me
       | feel good to have everything open source/weights running on my
       | own system.
       | 
       | Wen I bought my 32G Mac a year ago, I didn't expect to be so
       | happy as running gemma3:27b-it-qat with open-codex locally.
        
       | piyh wrote:
       | Meta Maverick is crying in the shower getting so handily beat by
       | a model with 15x fewer params
        
       | Samin100 wrote:
       | I have a few private "vibe check" questions and the 4 bit QAT 27B
       | model got them all correctly. I'm kind of shocked at the
       | information density locked in just 13 GB of weights. If anyone at
       | Deepmind is reading this -- Gemma 3 27B is the single most
       | impressive open source model I have ever used. Well done!
        
       | cheriot wrote:
       | Is there already a Helium for GPUs?
        
       | punnerud wrote:
       | Just tested the 27B, and it's not very good at following
       | instructions and is very limited on more complex code problems.
       | 
       | Mapping from one JSON with a lot of plain text, into a new
       | structure and it fails every time.
       | 
       | Ask it to generate SVG, and it's very simple and almost too dumb.
       | 
       | Nice that it doesn't need that huge amount of RAM, and perform ok
       | on smaller languages from my initial tests.
        
       | intalentive wrote:
       | Looking forward to usable uncensored fine tunes.
        
       ___________________________________________________________________
       (page generated 2025-04-20 23:00 UTC)