hngopher.com

       [HN Gopher] Gemma 3 QAT Models: Bringing AI to Consumer GPUs
       ___________________________________________________________________
        
       Gemma 3 QAT Models: Bringing AI to Consumer GPUs
        
       Author : emrah
       Score  : 577 points
       Date   : 2025-04-20 12:22 UTC (1 days ago)
        
 (HTM) web link (developers.googleblog.com)
 (TXT) w3m dump (developers.googleblog.com)
        
       | emrah wrote:
       | Available on ollama: https://ollama.com/library/gemma3
        
         | jinay wrote:
         | Make sure you're using the "-it-qat" suffixed models like
         | "gemma3:27b-it-qat"
        
           | Zambyte wrote:
           | Here are the direct links:
           | 
           | https://ollama.com/library/gemma3:27b-it-qat
           | 
           | https://ollama.com/library/gemma3:12b-it-qat
           | 
           | https://ollama.com/library/gemma3:4b-it-qat
           | 
           | https://ollama.com/library/gemma3:1b-it-qat
        
           | ein0p wrote:
           | Thanks. I was wondering why my open-webui said that I already
           | had the model. I bet a lot of people are making the same
           | mistake I did and downloading just the old, post-quantized
           | 27B.
        
         | Der_Einzige wrote:
         | How many times do I have to say this? Ollama, llamacpp, and
         | many other projects are slower than vLLM/sglang. vLLM is a much
         | superior inference engine and is fully supported by the only
         | LLM frontends that matter (sillytavern).
         | 
         | The community getting obsessed with Ollama has done huge damage
         | to the field, as it's ineffecient compared to vLLM. Many people
         | can get far more tok/s than they think they could if only they
         | knew the right tools.
        
           | m00dy wrote:
           | Ollama is definitely not for production loads but vLLm is.
        
           | janderson215 wrote:
           | I did not know this, so thank you. I read a blogpost a while
           | back that encouraged using Ollama and never mention vLLM. Do
           | you recommend reading any particular resource?
        
           | Zambyte wrote:
           | The significant convenience benefits outweigh the higher TPS
           | that vLLM offers in the context of my single machine homelab
           | GPU server. If I was hosting it for something more critical
           | than just myself and a few friends chatting with it, sure.
           | Being able to just paste a model name into Open WebUI and run
           | it is important to me though.
           | 
           | It is important to know about both to decide between the two
           | for your use case though.
        
             | Der_Einzige wrote:
             | Running any HF model on vllm is as simple as pasting a
             | model name into one command in your terminal.
        
               | Zambyte wrote:
               | What command is it? Because that was not at all my
               | experience.
        
               | Der_Einzige wrote:
               | Vllm serve... huggingface gives run instructions for
               | every model with vllm on their website.
        
               | Zambyte wrote:
               | How do I serve multiple models? I can pick from dozens of
               | models that I have downloaded through Open WebUI.
        
               | iAMkenough wrote:
               | Had to build it from source to run on my Mac, and the
               | experimental support doesn't seem to include these latest
               | Gemma 3 QAT models on Apple Silicon.
        
           | oezi wrote:
           | Why is sillytavern the only LLM frontend which matters?
        
             | GordonS wrote:
             | I tried sillytavern a few weeks ago... wow, that is an
             | "interesting" UI! I blundered around for a while, couldn't
             | figure out how to do _anything_ useful... and then
             | installed LM Studio instead.
        
               | imtringued wrote:
               | I personally thought the lorebook feature was quite neat
               | and then quickly gave up on it because I couldn't get it
               | to trigger, ever.
               | 
               | Whatever those keyword things are, they certainly don't
               | seem to be doing any form of RAG.
        
             | Der_Einzige wrote:
             | It supports more sampler and other settings than anyone
             | else.
        
           | ach9l wrote:
           | instead of ranting, maybe explain how to make a qat q4 work
           | with images in vllm, afaik it is not yet possible
        
           | oezi wrote:
           | Somebody in this thread mentioned 20.x tok/s on ollama. What
           | are you seeing in vLLM?
        
             | Zambyte wrote:
             | FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the
             | 27b qat. You can't really compare inference engine to
             | inference engine without keeping the hardware and model
             | fixed.
             | 
             | Unfortunately Ollama and vLLM are therefore incomparable at
             | the moment, because vLLM does not support these models yet.
             | 
             | https://github.com/vllm-project/vllm/issues/16856
        
           | simonw wrote:
           | Last I looked vLLM didn't work on a Mac.
        
             | mitjam wrote:
             | Afaik vllm is for concurrent serving with batched inference
             | for higher throughput, not single-user inference. I doubt
             | inference throughput is higher with single prompts at a
             | time than Ollama. Update: this is a good Intro to
             | continuous batching in llm inference:
             | https://www.anyscale.com/blog/continuous-batching-llm-
             | infere...
        
               | Der_Einzige wrote:
               | It is much faster on single prompts than ollama. 3X is
               | not unheard of
        
           | prometheon1 wrote:
           | From the HN guidelines:
           | https://news.ycombinator.com/newsguidelines.html
           | 
           | > Be kind. Don't be snarky.
           | 
           | > Please don't post shallow dismissals, especially of other
           | people's work.
           | 
           | In my opinion, your comment is not in line with the
           | guidelines. Especially the part about sillytavern being the
           | only LLM frontend that matters. Telling the devs of any LLM
           | frontend except sillytavern that their app doesn't matter
           | seems exactly like a shallow dismissal of other people's work
           | to me.
        
       | holografix wrote:
       | Could 16gb vram be enough for the 27b QAT version?
        
         | halflings wrote:
         | That's what the chart says yes. 14.1GB VRAM usage for the 27B
         | model.
        
           | erichocean wrote:
           | That's the VRAM required just to load the model weights.
           | 
           | To actually use a model, you need a context window.
           | Realistically, you'll want a 20GB GPU or larger, depending on
           | how many tokens you need.
        
             | oezi wrote:
             | I didn't realize that the context would require such so
             | much memory. Is this KV caches? It would seem like a big
             | advantage if this memory requirement could be reduced.
        
         | jffry wrote:
         | With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory
         | usage is just a hair over 20GB, so no, probably not without a
         | nerfed context window
        
           | woadwarrior01 wrote:
           | Indeed, the default context length in ollama is a mere 2048
           | tokens.
        
         | hskalin wrote:
         | With ollama you could offload a few layers to cpu if they don't
         | fit in the VRAM. This will cost some performance ofcourse but
         | it's much better than the alternative (everything on cpu)
        
           | senko wrote:
           | I'm doing that with a 12GB card, ollama supports it out of
           | the box.
           | 
           | For some reason, it only uses around 7GB of VRAM, probably
           | due to how the layers are scheduled, maybe I could tweak
           | something there, but didn't bother just for testing.
           | 
           | Obviously, perf depends on CPU, GPU and RAM, but on my
           | machine (3060 + i5-13500) it's around 2 t/s.
        
           | dockerd wrote:
           | Does it work on LM Studio? Loading 27b-it-qat taking up more
           | than 22GB on 24GB mac.
        
         | parched99 wrote:
         | I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB)
         | to run with a 100 token context size on a 5070 ti (16GB) using
         | llamacpp.
         | 
         | Prompt Tokens: 10
         | 
         | Time: 229.089 ms
         | 
         | Speed: 43.7 t/s
         | 
         | Generation Tokens: 41
         | 
         | Time: 959.412 ms
         | 
         | Speed: 42.7 t/s
        
           | floridianfisher wrote:
           | Try one of the smaller versions. 27b is too big for your gpu
        
             | parched99 wrote:
             | I'm aware. I was addressing the question being asked.
        
           | tbocek wrote:
           | This is probably due to this: https://github.com/ggml-
           | org/llama.cpp/issues/12637. This GitHub issue is about
           | interleaved sliding window attention (iSWA) not available in
           | llama.cpp for Gemma 3. This could reduce the memory
           | requirements a lot. They mentioned for a certain scenario,
           | going from 62GB to 10GB.
        
             | parched99 wrote:
             | Resolving that issue, would help reduce (not eliminate) the
             | size of the context. The model will still only just barely
             | fit in 16 GB, which is what the parent comment asked.
             | 
             | Best to have two or more low-end, 16GB GPUs for a total of
             | 32GB VRAM to run most of the better local models.
        
             | nolist_policy wrote:
             | Ollama supports iSWA.
        
           | idonotknowwhy wrote:
           | I didn't realise the 5070 is slower than the 3090. Thanks.
           | 
           | If you want a bit more context, try -ctv q8 -ctk q8 (from
           | memory so look it up) to quant the kv cache.
           | 
           | Also an imatrix gguf like iq4xs might be smaller with better
           | quality
        
             | parched99 wrote:
             | I answered the question directly. IQ4_X_S is smaller, but
             | slower and less accurate than Q4_0. The parent comment
             | specifically asked about the QAT version. That's literally
             | what this thread is about. The context-length mention was
             | relevant to show how it's only barely usable.
        
         | abawany wrote:
         | I tried the 27b-iat model on a 4090m with 16gb vram with mostly
         | default args via llama.cpp and it didn't fit - used up the vram
         | and tried to use about 2gb of system ram: performance in this
         | setup was < 5 tps.
        
       | diggan wrote:
       | First graph is a comparison of the "Elo Score" while using
       | "native" BF16 precision in various models, second graph is
       | comparing VRAM usage between native BF16 precision and their QAT
       | models, but since this method is about doing quantization while
       | also maintaining quality, isn't the obvious graph of comparing
       | the quality between BF16 and QAT missing? The text doesn't seem
       | to talk about it either, yet it's basically the topic of the blog
       | post.
        
         | croemer wrote:
         | Indeed, the one thing I was looking for was Elo/performance of
         | the quantized models, not how good the base model is. Showing
         | how much memory is saved by quantization in a figure is a bit
         | of an insult to the intelligence of the reader.
        
         | nithril wrote:
         | In addition the graph "Massive VRAM Savings" graph states what
         | looks like a tautology, reducing from 16 bits to 4 bits leads
         | unsurprisingly to a x4 reduction in memory usage
        
         | claiir wrote:
         | Yea they mention a "perplexity drop" relative to naive
         | quantization, but that's meaningless to me. > We reduce the
         | perplexity drop by 54% (using llama.cpp perplexity evaluation)
         | when quantizing down to Q4_0.
         | 
         | Wish they showed benchmarks / added quantized versions to the
         | arena! :>
        
       | jarbus wrote:
       | Very excited to see these kinds of techniques, I think getting a
       | 30B level reasoning model usable on consumer hardware is going to
       | be a game changer, especially if it uses less power.
        
         | apples_oranges wrote:
         | Deepseek does reasoning on my home Linux pc but not sure how
         | power hungry it is
        
           | gcr wrote:
           | what variant? I'd considered DeepSeek far too large for any
           | consumer GPUs
        
             | scosman wrote:
             | Some people run Deepseek on CPU. 37B active params - it
             | isn't fast but it's passible.
        
               | danielbln wrote:
               | Actual deepseek or some qwen/llama reasoning fine-tune?
        
               | scosman wrote:
               | Actual Deepseek. 500gb of memory and a threadripper
               | works. Not a standard PC spec, but a common ish home brew
               | setup for single user Deepseek.
        
       | wtcactus wrote:
       | They keep mentioning the RTX 3090 (with 24 GB VRAM), but the
       | model is only 14.1 GB.
       | 
       | Shouldn't it fit a 5060 Ti 16GB, for instance?
        
         | jsnell wrote:
         | Memory is needed for more than just the parameters, e.g. the KV
         | cache.
        
           | cubefox wrote:
           | KV = key-value
        
         | oktoberpaard wrote:
         | With a 128K context length and 8 bit KV cache, the 27b model
         | occupies 22 GiB on my system. With a smaller context length you
         | should be able to fit it on a 16 GiB GPU.
        
         | Havoc wrote:
         | Just checked - 19 gigs with 8k context @ q8 kv.Plus another
         | 2.5-ish or so for OS etc.
         | 
         | ...so yeah 3090
        
       | noodletheworld wrote:
       | ?
       | 
       | Am I missing something?
       | 
       | These have been out for a while; if you follow the HF link you
       | can see, for example, the 27b quant has been downloaded from HF
       | 64,000 times over the last 10 days.
       | 
       | Is there something more to this, or is just a follow up blog
       | post?
       | 
       | (is it just that ollama finally has partial (no images right?)
       | support? Or something else?)
        
         | deepsquirrelnet wrote:
         | QAT "quantization aware training" means they had it quantized
         | to 4 bits during training rather than after training in full or
         | half precision. It's supposedly a higher quality, but
         | unfortunately they don't show any comparisons between QAT and
         | post-training quantization.
        
           | noodletheworld wrote:
           | I understand that, but the qat models (1) are not new
           | uploads.
           | 
           | How is this more significant now than when they were uploaded
           | 2 weeks ago?
           | 
           | Are we expecting new models? I don't understand the timing.
           | This post feels like it's two weeks late.
           | 
           | [1] - https://huggingface.co/collections/google/gemma-3-qat-6
           | 7ee61...
        
             | llmguy wrote:
             | 8 days is closer to 1 week then 2. And it's a blog post,
             | nobody owes you realtime updates.
        
               | noodletheworld wrote:
               | https://huggingface.co/google/gemma-3-27b-it-
               | qat-q4_0-gguf/t...
               | 
               | > 17 days ago
               | 
               | Anywaaay...
               | 
               | I'm literally asking, quite honestly, if this is just an
               | 'after the fact' update literally weeks later, that they
               | uploaded a bunch of models, or if there is something more
               | significant about this I'm missing.
        
               | timcobb wrote:
               | Probably the former... I see your confusion but it's
               | really only a couple weeks at most. The news cycle is
               | strong in you, grasshopper :)
        
               | osanseviero wrote:
               | Hi! Omar from the Gemma team here.
               | 
               | Last time we only released the quantized GGUFs. Only
               | llama.cpp users could use it (+ Ollama, but without
               | vision).
               | 
               | Now, we released the unquantized checkpoints, so anyone
               | can quantize themselves and use in their favorite tools,
               | including Ollama with vision, MLX, LM Studio, etc. MLX
               | folks also found that the model worked decently with 3
               | bits compared to naive 3-bit, so by releasing the
               | unquantized checkpoints we allow further experimentation
               | and research.
               | 
               | TL;DR. One was a release in a specific format/tool, we
               | followed-up with a full release of artifacts that enable
               | the community to do much more.
        
               | oezi wrote:
               | Hey Omar, is there any chance that Gemma 3 might get a
               | speech (ASR/AST/TTS) release?
        
             | simonw wrote:
             | The official announcement of the QAT models happened on
             | Friday 18th, two days ago. It looks like they uploaded them
             | to HF in advance of that announcement:
             | https://developers.googleblog.com/en/gemma-3-quantized-
             | aware...
             | 
             | The partnership with Ollama and MLX and LM Studio and
             | llama.cpp was revealed in that announcement, which made the
             | models a lot easier for people to use.
        
         | xnx wrote:
         | The linked blog post was 2 days ago
        
         | Patrick_Devine wrote:
         | Ollama has had vision support for Gemma3 since it came out. The
         | implementation is _not_ based on llama.cpp 's version.
        
       | behnamoh wrote:
       | This is what local LLMs need--being treated like first-class
       | citizens by the companies that make them.
       | 
       | That said, the first graph is misleading about the number of
       | H100s required to run DeepSeek r1 at FP16. The model is FP8.
        
         | freeamz wrote:
         | so what is the real comparison against DeepSeek r1 ? Would be
         | good to know which is actually more cost efficient and open
         | (reproducible build) to run locally.
        
           | behnamoh wrote:
           | half the amount of those dots is what it takes. but also, why
           | compare a 27B model with a +600B? that doesn't make sense.
        
             | smallerize wrote:
             | It's an older image that they just reused for the blog
             | post. It's on https://ai.google.dev/gemma for example
        
         | mmoskal wrote:
         | Also ~noone runs h100 at home, ie at batch size 1. What matters
         | is throughput. With 37b active parameters and a massive
         | deployment throughout (per gpu) should be similar to Gemma.
        
       | mythz wrote:
       | The speed gains are real, after downloading latest QAT gemma3:27b
       | eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11
       | tok/s (on A4000's).
        
       | btbuildem wrote:
       | Is 27B the largest QAT Gemma 3? Given these size reductions, it
       | would be amazing to have the 70B!
        
         | arnaudsm wrote:
         | The original Gemma 3 does not have a 70B version.
        
           | btbuildem wrote:
           | Ah thank you
        
       | umajho wrote:
       | I am currently using the Q4_K_M quantized version of
       | gemma-3-27b-it locally. I previously assumed that a 27B model
       | with image input support wouldn't be very high quality, but after
       | actually using it, the generated responses feel better than those
       | from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M),
       | and its recognition of images is also stronger than I expected.
       | (I thought the model could only roughly understand the concepts
       | in the image, but I didn't expect it to be able to recognize text
       | within the image.)
       | 
       | Since this article publishes the optimized Q4 quantized version,
       | it would be great if it included more comparisons between the new
       | version and my currently used unoptimized Q4 version (such as
       | benchmark scores).
       | 
       | (I deliberately wrote this reply in Chinese and had
       | gemma-3-27b-it Q4_K_M translate it into English.)
        
       | rob_c wrote:
       | Given how long between this being released and this community
       | picking up on it... Lol
        
         | GaunterODimm wrote:
         | 2days :/...
        
           | rob_c wrote:
           | Given I know people running gemma3 on local devices for over
           | almost a month now this is either a very slow news day or
           | evidence of finger missing the pulse...
           | https://blog.google/technology/developers/gemma-3/
        
             | simonw wrote:
             | This is new. These are new QAT (Quantization-Aware
             | Training) models released by the Gemma team.
        
               | rob_c wrote:
               | There's nothing more than an iteration on the topic,
               | gemma3 was smashing local results a month ago and made no
               | waves as it dropped...
        
               | simonw wrote:
               | Quoting the linked story:
               | 
               | > Last month, we launched Gemma 3, our latest generation
               | of open models. Delivering state-of-the-art performance,
               | Gemma 3 quickly established itself as a leading model
               | capable of running on a single high-end GPU like the
               | NVIDIA H100 using its native BFloat16 (BF16) precision.
               | 
               | > To make Gemma 3 even more accessible, we are announcing
               | new versions optimized with Quantization-Aware Training
               | (QAT) that dramatically reduces memory requirements while
               | maintaining high quality.
               | 
               | The thing that's new, and that is clearly resonating with
               | people, is the "To make Gemma 3 even more accessible..."
               | bit.
        
               | rob_c wrote:
               | As I've said in my lectures on how to perform 1bit
               | training of QAT systems to build classifiers...
               | 
               | "An iteration on a theme".
               | 
               | Once the network design is proven to work yes it's an
               | impressive technical achievement, but as I've said given
               | I've known people in multiple research institutes and
               | companies using Gemma3 for a month mostly saying they're
               | surprised it's not getting noticed... This is just
               | enabling more users but the none QAT version will almost
               | always perform better...
        
               | simonw wrote:
               | Sounds like you're excited to see Gemma 3 get the
               | recognition it deserves on Hacker News then.
        
               | rob_c wrote:
               | No just pointing out the flooding obvious as usual and
               | collecting down votes for it
        
               | fragmede wrote:
               | Speaking for myself, my downvotes are not because of the
               | content of your arguments, but because your tone is
               | consistently condescending and dismissive. Comments like
               | "just pointing out the flooding obvious" come off as smug
               | and combative rather than constructive.
               | 
               | HN works best when people engage in good faith, stay
               | curious, and try to move the conversation forward. That
               | kind of tone -- even when technically accurate --
               | discourages others from participating and derails
               | meaningful discussion.
               | 
               | If you're getting downvotes regularly, maybe it's worth
               | considering how your comments are landing with others,
               | not just whether they're "right."
        
               | rob_c wrote:
               | My tone only switches once people get uppity. The
               | original comment is on point and accurate, not combative
               | and not insulting (unless the community seriously takes a
               | 'lol'....
               | 
               | Tbh I give up writing that in response to this rant. My
               | polite poke holds and it's non insulting so I'm not going
               | to capitulate to those childish enough to not look
               | inwards.
        
       | simonw wrote:
       | I think gemma-3-27b-it-qat-4bit is my new favorite local model -
       | or at least it's right up there with Mistral Small 3.1 24B.
       | 
       | I've been trying it on an M2 64GB via both Ollama and MLX. It's
       | very, very good, and it only uses ~22Gb (via Ollama) or ~15GB
       | (MLX) leaving plenty of memory for running other apps.
       | 
       | Some notes here:
       | https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
       | 
       | Last night I had it write me a complete plugin for my LLM tool
       | like this:                 llm install llm-mlx       llm mlx
       | download-model mlx-community/gemma-3-27b-it-qat-4bit
       | llm -m mlx-community/gemma-3-27b-it-qat-4bit \         -f
       | https://raw.githubusercontent.com/simonw/llm-hacker-
       | news/refs/heads/main/llm_hacker_news.py \         -f https://raw.
       | githubusercontent.com/simonw/tools/refs/heads/main/github-issue-
       | to-markdown.html \         -s 'Write a new fragments plugin in
       | Python that registers         issue:org/repo/123 which fetches
       | that issue             number from the specified github repo and
       | uses the same             markdown logic as the HTML page to turn
       | that into a             fragment'
       | 
       | It gave a solid response!
       | https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... -
       | more notes here: https://simonwillison.net/2025/Apr/20/llm-
       | fragments-github/
        
         | rs186 wrote:
         | Can you quote tps?
         | 
         | More and more I start to realize that cost saving is a small
         | problem for local LLMs. If it is too slow, it becomes unusable,
         | so much that you might as well use public LLM endpoints. Unless
         | you really care about getting things done locally without
         | sending information to another server.
         | 
         | With OpenAI API/ChatGPT, I get response much faster than I can
         | read, and for simple question, it means I just need a glimpse
         | of the response, copy & paste and get things done. Whereas on
         | local LLM, I watch it painstakingly prints preambles that I
         | don't care about, and get what I actually need after 20 seconds
         | (on a fast GPU).
         | 
         | And I am not yet talking about context window etc.
         | 
         | I have been researching about how people integrate local LLMs
         | in their workflows. My finding is that most people play with it
         | for a short time and that's about it, and most people are much
         | better off spending money on OpenAI credits (which can last a
         | very long time with typical usage) than getting a beefed up Mac
         | Studio or building a machine with 4090.
        
           | simonw wrote:
           | My tooling doesn't measure TPS yet. It feels snappy to me on
           | MLX.
           | 
           | I agree that hosted models are usually a better option for
           | most people - much faster, higher quality, handle longer
           | inputs, really cheap.
           | 
           | I enjoy local models for research and for the occasional
           | offline scenario.
           | 
           | I'm also interested in their applications for journalism,
           | specifically for dealing with extremely sensitive data like
           | leaked information from confidential sources.
        
             | freeamz wrote:
             | >I'm also interested in their applications for journalism,
             | specifically for dealing with extremely sensitive data like
             | leaked information from confidential sources.
             | 
             | Think it is NOT just you. Most company with decent
             | management also would not want their data going to anything
             | outside the physical server they have in control of. But
             | yeah for most people just use an app and hosted server. But
             | this is HN,there are ppl here hosting their own email
             | servers, so shouldn't be too hard to run llm locally.
        
               | simonw wrote:
               | "Most company with decent management also would not want
               | their data going to anything outside the physical server
               | they have in control of."
               | 
               | I don't think that's been true for over a decade: AWS
               | wouldn't be trillion dollar business if most companies
               | still wanted to stay on-premise.
        
               | terhechte wrote:
               | Or GitHub. I'm always amused when people don't want to
               | send fractions of their code to a LLM but happily host it
               | on GitHub. All big llm providers offer no-training-on-
               | your-data business plans.
        
               | tarruda wrote:
               | > I'm always amused when people don't want to send
               | fractions of their code to a LLM but happily host it on
               | GitHub
               | 
               | What amuses me even more is people thinking their code is
               | too unique and precious, and that GitHub/Microsoft wants
               | to steal it.
        
               | AlexCoventry wrote:
               | Concern about platform risk in regard to Microsoft is
               | historically justified.
        
               | Terretta wrote:
               | Unlikely they think Microsoft or GitHub wants to steal
               | it.
               | 
               | With LLMs, they're thinking of examples that regurgitated
               | proprietary code, and contrary to everyday general
               | observation, valuable proprietary code does exist.
               | 
               | But with GitHub, the thinking is generally the opposite:
               | the worry is that the code is terrible, and seeing it
               | would be like giant blinkenlights* indicating the way in.
               | 
               | * https://en.wikipedia.org/wiki/Blinkenlights
        
               | vikarti wrote:
               | Regulations sometimes matter. Stupid "security" rules
               | sometimes matter too.
        
               | __float wrote:
               | While none of that is false, I think there's a big
               | difference from shipping your data to an external LLM API
               | and using AWS.
               | 
               | Using AWS is basically a "physical server they have
               | control of".
        
               | simonw wrote:
               | That's why AWS Bedrock and Google Vertex AI and Azure AI
               | model inference exist - they're all hosted LLM services
               | that offer the same compliance guarantees that you get
               | from regular AWS-style hosting agreements.
        
               | IanCal wrote:
               | As in aws is a much bigger security concern?
        
               | ipdashc wrote:
               | Yeah, this has been confusing me a bit. I'm not
               | complaining by ANY means, but why does it suddenly feel
               | like everyone cares about data privacy in LLM contexts,
               | way more than previous attitudes to allowing data to sit
               | on a bunch of random SaaS products?
               | 
               | I assume because of the assumption that the AI companies
               | will train off of your data, causing it to leak? But I
               | thought all these services had enterprise tiers where
               | they'll promise not to do that?
               | 
               | Again, I'm not complaining, it's good to see people
               | caring about where their data goes. Just interesting that
               | they care now, but not before. (In some ways LLMs should
               | be one of the safer services, since they don't even
               | really need to store any data, they can delete it after
               | the query or conversation is over.)
        
               | pornel wrote:
               | It is due to the risk of a leak.
               | 
               | Laundering of data through training makes it a more
               | complicated case than a simple data theft or copyright
               | infringement.
               | 
               | Leaks could be accidental, e.g. due to an employee
               | logging in to their free-as-in-labor personal account
               | instead of a no-training Enterprise account. It's safer
               | to have a complete ban on providers that _may_ collect
               | data for training.
        
               | 6510 wrote:
               | Their entire business model based on taking other peoples
               | stuff. I cant imagine someone would willingly drown with
               | the sinking ship if the entire cargo is filled with
               | lifeboats - just because they promised they would.
        
               | vbezhenar wrote:
               | How can you be sure that AWS will not use your data to
               | train their models? They got enormous data, probably most
               | data in the world.
        
               | simonw wrote:
               | Being caught doing they would be wildly harmful to their
               | business - billions of dollars harmful, especially given
               | the contracts they sign with their customers. The brand
               | damage would be unimaginably expensive too.
               | 
               | There is no world in which training on customer data
               | without permission would be worth it for AWS.
               | 
               | Your data really isn't that useful anyway.
        
               | mdp2021 wrote:
               | > _Your data really isn 't that useful anyway_
               | 
               | ? One single random document, maybe, but as an aggregate,
               | I understood some parties were trying to scrape
               | indiscriminately - the "big data" way. And if some of
               | that input is sensitive, and is stored somewhere in the
               | NN, it may come out in an output - in theory...
               | 
               | Actually I never researched the details of the potential
               | phenomenon - that anything personal may be stored (not
               | just George III but Random Randy) -, but it seems
               | possible.
        
               | simonw wrote:
               | There's a pretty common misconception that training LLMs
               | is about loading in as much data as possible no matter
               | the source.
               | 
               | That might have been true a few years ago but today the
               | top AI labs are all focusing on quality: they're trying
               | to find the best possible sources of high quality tokens,
               | not randomly dumping in anything they can obtain.
               | 
               | Andrej Karpathy said this last year:
               | https://twitter.com/karpathy/status/1797313173449764933
               | 
               | > Turns out that LLMs learn a lot better and faster from
               | educational content as well. This is partly because the
               | average Common Crawl article (internet pages) is not of
               | very high value and distracts the training, packing in
               | too much irrelevant information. The average webpage on
               | the internet is so random and terrible it's not even
               | clear how prior LLMs learn anything at all.
        
               | mdp2021 wrote:
               | Obviously the training data should be preferably high
               | quality - but there you have a (pseudo-, I insisted also
               | elsewhere citing the rights to have read whatever is in
               | any public library) problem with "copyright".
               | 
               | If there exists some advantage on quantity though, then
               | achieving high quality imposes questions about tradeoffs
               | and workflows - sources where authors are "free
               | participants" could have odd data sip in.
               | 
               | And the matter of whether such data may be reflected in
               | outputs remains as a question (probably tackled by some I
               | have not read... Ars longa, vita brevis).
        
               | freeamz wrote:
               | In Scandinavian financial related severs must in the
               | country! That always sounded like a sane approach. The
               | whole putting your data on saas or AWS just seems like
               | the same "Let's shift the responsibility to a big
               | player".
               | 
               | Any important data should NOT be in devices that is NOT
               | physically with in our jurisdiction.
        
               | mjlee wrote:
               | AWS has a strong track record, a clear business model
               | that isn't predicated on gathering as much data as
               | possible, and an awful lot to lose if they break their
               | promises.
               | 
               | Lots of AI companies have some of these, but not to the
               | same extent.
        
               | Tepix wrote:
               | on-premises
               | 
               | https://twominenglish.com/premise-vs-premises/
        
               | belter wrote:
               | > "Most company with decent management also would not
               | want their data going to anything outside the physical
               | server they have in control of."
               | 
               | Most companies physical and digital security controls are
               | so much worst than anything from AWS or Google. Note I
               | dont include Azure...but a _physical server_ they have
               | _control of_ is a phrase that screams vulnerability.
        
             | triyambakam wrote:
             | > specifically for dealing with extremely sensitive data
             | like leaked information from confidential sources.
             | 
             | Can you explain this further? It seems in contrast to your
             | previous comment about trusting Anthropic with your data
        
               | simonw wrote:
               | I trust Anthropic not to train on my data.
               | 
               | If they get hit by a government subpoena because a
               | journalist has been using them to analyze leaked
               | corporate or government secret files I also trust them to
               | honor that subpoena.
               | 
               | Sometimes journalists deal with material that they cannot
               | risk leaving their own machine.
               | 
               | "News is what somebody somewhere wants to suppress"
        
           | overfeed wrote:
           | > Whereas on local LLM, I watch it painstakingly prints
           | preambles that I don't care about, and get what I actually
           | need after 20 seconds.
           | 
           | You may need to "right-size" the models you use to match your
           | hardware, model, and TPS expectations, which may involve
           | using a smaller version of the model with faster TPS,
           | upgrading your jardware, or paying for hosted models.
           | 
           | Alternatively, if you can use agentic workflows or tools like
           | Aider, you don't have to watch the model work slowly with
           | large modles locally. Instead you queue work for it, go to
           | sleep, or eat, or do other work, and then much later look
           | over the Pull Requests whenever it completes them.
        
             | rs186 wrote:
             | I have a 4070 super for gaming, and used it to play with
             | LLM a few times. It is by no means a bad card, but I
             | realize that unless I want to get 4090 or new Macs that I
             | don't have any other use for, I can only use it to run
             | smaller models. However, most smaller models aren't
             | satisfactory and are still slower than hosted LLMs. I
             | haven't found a model that I am happy with for my hardware.
             | 
             | Regarding agentic workflows -- sounds nice but I am too
             | scared to try it out, based on my experience with standard
             | LLMs like GPT or Claude for writing code. Small snippets or
             | filling in missing unit tests, fine, anything more
             | complicated? Has been a disaster for me.
        
               | taneq wrote:
               | As I understand it, these models are limited on GPU
               | memory far more than GPU compute. You'd be better off
               | with dual 4070s than with a single 4090 unless the 4090
               | has more RAM than the other two combined.
        
             | adastra22 wrote:
             | I have never found any agent able to put together sensible
             | pull requests without constant hand holding. I shudder to
             | think of what those repositories must look like.
        
           | otabdeveloper4 wrote:
           | The only actually useful application of LLM's is processing
           | large amounts of data for classification and/or summarizing
           | purposes.
           | 
           | That's not the stuff you want to send to a public API, this
           | is something you want as a 24/7 locally running batch job.
           | 
           | ("AI assistant" is an evolutionary dead end, and Star Trek be
           | damned.)
        
           | DJHenk wrote:
           | > More and more I start to realize that cost saving is a
           | small problem for local LLMs. If it is too slow, it becomes
           | unusable, so much that you might as well use public LLM
           | endpoints. Unless you really care about getting things done
           | locally without sending information to another server.
           | 
           | There is another aspect to consider, aside from privacy.
           | 
           | These models are trained by downloading every scrap of
           | information from the internet, including the works of many,
           | many authors who have never consented to that. And they for
           | sure are not going to get a share of the profits, if there is
           | every going to be any. If you use a cloud provider, you are
           | basically saying that is all fine. You are happy to pay them,
           | and make yourself dependent on their service, based on work
           | that wasn't theirs to use.
           | 
           | However, if you use a local model, the authors still did not
           | give consent, but one could argue that the company that made
           | the model is at least giving back to the community. They
           | don't get any money out of it, and you are not becoming
           | dependent on their hyper capitalist service. No rent-seeking.
           | The benefits of the work are free to use for everyone. This
           | makes using AI a little more acceptable from a moral
           | standpoint.
        
           | ein0p wrote:
           | Sometimes TPS doesn't matter. I've generated textual
           | descriptions for 100K or so images in my photo archive, some
           | of which I have absolutely no interest in uploading to
           | someone else's computer. This works pretty well with Gemma. I
           | use local LLMs all the time for things where privacy is even
           | remotely important. I estimate this constitutes easily a
           | quarter of my LLM usage.
        
             | lodovic wrote:
             | This is a really cool idea. Do you pretrain the model so it
             | can tag people? I have so many photo's that it seems
             | impossible to ever categorize them,using a workflow like
             | yours might help a lot
        
               | ein0p wrote:
               | No, tagging of people is already handled by another
               | model. Gemma just describes what's in the image, and
               | produces a comma separated list of keywords. No
               | additional training is required besides a few tweaks to
               | the prompt so that it outputs just the description,
               | without any "fluff". E.g. it normally prepends such
               | outputs with "Here's a description of the image:" unless
               | you really insist that it should output only the
               | description. I suppose I could use constrained decoding
               | into JSON or something to achieve the same, but I didn't
               | mess with that.
               | 
               | On some images where Gemma3 struggles Mistral Small
               | produces better descriptions, BTW. But it seems harder to
               | make it follow my instructions exactly.
               | 
               | I'm looking forward to the day when I can also do this
               | with videos, a lot of which I also have no interest in
               | uploading to someone else's computer.
        
               | fer wrote:
               | How do you use the keywords after? I have Immich running
               | which does some analysis, but the querying is a bit of a
               | hit and miss.
        
               | ein0p wrote:
               | Search is indeed hit and miss. Immich, for instance,
               | currently does absolutely nothing with the EXIF
               | "description" field, so I store textual descriptions on
               | the side as well. I have found Immich's search by image
               | embeddings to be pretty weak at recall, and even weaker
               | at ranking. IIRC Lightroom Classic (which I also use, but
               | haven't found a way to automate this for without writing
               | an extension) does search that field, but ranking is a
               | bit of a dumpster fire, so your best bet is searching
               | uncommon terms or constraining search by metadata (e.g.
               | not just "black kitten" but "black kitten AND 2025"). I
               | expect this to improve significantly over time - it's a
               | fairly obvious thing to add given the available tech.
        
               | ethersteeds wrote:
               | > No, tagging of people is already handled by another
               | model.
               | 
               | As an aside, what model/tools do you prefer for tagging
               | people?
        
               | mentalgear wrote:
               | Since you already seem to have done some impressive work
               | on this for your personal use, would you mind open
               | sourcing it?
        
             | starik36 wrote:
             | I was thinking of doing the same, but I would like to
             | include people's name. in the description. For example
             | "Jennifer looking out in the desert sky.".
             | 
             | As it stands, Gemma will just say "Woman looking out in the
             | desert sky."
        
               | ein0p wrote:
               | Most search rankers do not consider word order, so if you
               | could also append the person's name at the end of text
               | description, it'd probably work well enough for retrieval
               | and ranking at least.
               | 
               | If you want natural language to resolve the names, that'd
               | at a minimum require bounding boxes of the faces and
               | their corresponding names. It'd also require either
               | preprocessing, or specialized training, or both. To my
               | knowledge no locally-hostable model as of today has that.
               | I don't know if any proprietary models can do this
               | either, but it's certainly worth a try - they might just
               | do it. The vast majority of the things they can do is
               | emergent, meaning they were never specifically trained to
               | do them.
        
           | k__ wrote:
           | The local LLM is your project manager, the big remote ones
           | are the engineers and designers :D
        
           | trees101 wrote:
           | Not sure how accurate my stats are. I used ollama with the
           | --verbose flag. Using a 4090 and all default settings, I get
           | 40TPS for Gemma 29B model
           | 
           | `ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
           | 
           | `ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS
           | +-0.3TPS
           | 
           | Strange results; the full model gives me slightly more TPS.
        
             | orangecat wrote:
             | ollama's `gemma3:27b` is also 4-bit quantized, you need
             | `27b-it-q8_0` for 8 bit or `27b-it-fp16` for FP16. See
             | https://ollama.com/library/gemma3/tags.
        
           | starik36 wrote:
           | On an A5000 with 24GB, this model typically gets between 20
           | to 25 tps.
        
           | a_e_k wrote:
           | I'm seeing ~38--42 tps on a 4090 in a fresh build of
           | llama.cpp under Fedora 42 on my personal machine.
           | 
           | (-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m
           | models/gemma-3-27b-it-qat-q4_0.gguf)
        
           | pantulis wrote:
           | > Can you quote tps?
           | 
           | LLM Studio running on a Mac Studio M4 Max with 128GB,
           | gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get
           | 8.89 tps.
        
             | jychang wrote:
             | That's pretty terrible. I'm getting 18tok/sec Gemma 3 27b
             | QAT on a M1 Max 32gb macbook.
        
               | pantulis wrote:
               | Yeah, I know. Not sure if this due to something in LLM
               | Studio or whatever.
        
             | kristianp wrote:
             | Is QAT a different quantisation format to Q4_0? Can you try
             | "gemma-3-27b-it-qat" for a model:
             | https://lmstudio.ai/model/gemma-3-27b-it-qat
        
         | nico wrote:
         | Been super impressed with local models on mac. Love that the
         | gemma models have 128k token context input size. However,
         | outputs are usually pretty short
         | 
         | Any tips on generating long output? Like multiple pages of a
         | document, a story, a play or even a book?
        
           | simonw wrote:
           | The tool you are using may set a default max output size
           | without you realizing. Ollama has a num_ctx that defaults to
           | 2048 for example: https://github.com/ollama/ollama/blob/main/
           | docs/faq.md#how-c...
        
             | nico wrote:
             | Been playing with that, but doesn't seem to have much
             | effect. It works very well to limit output to smaller bits,
             | like setting it to 100-200. But above 2-4k the output seems
             | to never get longer than about 1 page
             | 
             | Might try using the models with mlx instead of ollama to
             | see if that makes a difference
             | 
             | Any tips on prompting to get longer outputs?
             | 
             | Also, does the model context size determine max output
             | size? Are the two related or are they independent
             | characteristics of the model?
        
               | simonw wrote:
               | Interestingly the Gemma 3 docs say: https://ai.google.dev
               | /gemma/docs/core/model_card_3#:~:text=T...
               | 
               | > Total output context up to 128K tokens for the 4B, 12B,
               | and 27B sizes, and 32K tokens for the 1B size per
               | request, subtracting the request input tokens
               | 
               | I don't know how to get it to output anything that length
               | though.
        
               | nico wrote:
               | Thank you for the insights and useful links
               | 
               | Will keep experimenting, will also try mistral3.1
               | 
               | edit: just tried mistral3.1 and the quality of the output
               | is very good, at least compared to the other models I
               | tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and
               | deepseek-r1:14b)
               | 
               | Doing some research, because of their training sets, it
               | seems like most models are not trained on producing long
               | outputs so even if they technically could, they won't.
               | Might require developing my own training dataset and then
               | doing some fine tuning. Apparently the models and ollama
               | have some safeguards against rambling and repetition
        
               | Gracana wrote:
               | You can probably find some long-form tuned models on HF.
               | I've had decent results with QwQ-32B (which I can run on
               | my desktop) and Mistral Large (which I have to run on my
               | server). Generating and refining an outline before
               | writing the whole piece can help, and you can also split
               | the piece up into multiple outputs (working a paragraph
               | or two at a time, for instance). So far I've found it to
               | be a tough process, with mixed results.
        
               | nico wrote:
               | Thank you, will try out your suggestions
               | 
               | Have you used something like a director model to
               | supervise the output? If so, could you comment on the
               | effectiveness of it and potentially any tips?
        
               | Gracana wrote:
               | Nope, sounds neat though. There's so much to keep up with
               | in this space.
        
           | Casteil wrote:
           | This is basically the opposite of what I've experienced - at
           | least compared to another recent entry like IBM's Granite
           | 3.3.
           | 
           | By comparison, Gemma3's output (both 12b and 27b) seems to
           | typically be more long/verbose, but not problematically so.
        
             | nico wrote:
             | I agree with you. The outputs are usually good, it's just
             | that for the use case I have now (writing several pages of
             | long dialogs), the output is not as long as I'd want it,
             | and definitely not as long as it's supposedly capable of
             | doing
        
           | tootie wrote:
           | I'm using 12b and getting seriously verbose answers. It's
           | squeezed into 8GB and takes its sweet time but answers are
           | really solid.
        
         | tomrod wrote:
         | Simon, what is your local GPU setup? (No doubt you've covered
         | this, but I'm not sure where to dig up).
        
           | simonw wrote:
           | MacBook Pro M2 with 64GB of RAM. That's why I tend to be
           | limited to Ollama and MLX - stuff that requires NVIDIA
           | doesn't work for me locally.
        
             | Elucalidavah wrote:
             | > MacBook Pro M2 with 64GB of RAM
             | 
             | Are there non-mac options with similar capabilities?
        
               | simonw wrote:
               | Yes, but I don't really know anything about those.
               | https://www.reddit.com/r/LocalLLaMA/ is full of people
               | running models on PCs with NVIDIA cards.
               | 
               | The unique benefit of an Apple Silicon Mac at the moment
               | is that the 64GB of RAM is available to both the GPU and
               | the CPU at once. With other hardware you usually need
               | dedicated separate VRAM for the GPU.
        
               | _neil wrote:
               | It's not out yet, but the upcoming Framework desktop [0]
               | is supposed to have a similar unified memory setup.
               | 
               | [0] https://frame.work/desktop
        
               | dwood_dev wrote:
               | Anything with the Radeon 8060S/Ryzen AI Max+ 395. One of
               | the popular MiniPC Chinese brands has them for
               | preorder[0] with shipping starting May 7th. Framework
               | also has them, but shipping Q3.
               | 
               | 0: https://www.gmktec.com/products/prepaid-deposit-amd-
               | ryzen(tm)-a...
        
               | chpatrick wrote:
               | I've never been able to get ROCm working reliably
               | personally.
        
               | danans wrote:
               | Nvidia Orin AGX if a desktop form factor works for you.
        
               | chpatrick wrote:
               | I remember seeing a post about someone running the full
               | size DeepSeek model in a dual-Xeon server with a ton of
               | RAM.
        
             | jychang wrote:
             | MLX is slower than GGUFs on Macs.
             | 
             | On my M1 Max macbook pro, the GGUF version
             | bartowski/google_gemma-3-27b-it-qat-GGUF is 15.6gb and runs
             | at 17tok/sec, whereas mlx-community/gemma-3-27b-it-qat-4bit
             | is 16.8gb and runs at 15tok/sec. Note that both of these
             | are the new QAT 4bit quants.
        
               | phaedrix wrote:
               | No, in general mlx versions are always faster, ice tested
               | most of them.
        
               | 85392_school wrote:
               | What TPS difference are you getting?
        
         | littlestymaar wrote:
         | > and it only uses ~22Gb (via Ollama) or ~15GB (MLX)
         | 
         | Why is the memory use different? Are you using different
         | context size in both set-ups?
        
           | simonw wrote:
           | No idea. MLX is its own thing, optimized for Apple Silicon.
           | Ollama uses GGUFs.
           | 
           | https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0.
           | https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit
           | says it's 4bit. I think those are the same quantization?
        
             | jychang wrote:
             | Those are the same quant, but this is a good example of why
             | you shouldn't use ollama. Either directly use llama.cpp, or
             | use something like LM Studio if you want something with a
             | GUI/easier user experience.
             | 
             | The Gemma 3 17b QAT GGUF should be taking up ~15gb, not
             | 22gb.
        
           | Patrick_Devine wrote:
           | The vision tower is 7GB, so I was wondering if you were
           | loading it without vision?
        
         | paprots wrote:
         | The original gemma3:27b also took only 22GB using Ollama on my
         | 64GB MacBook. I'm quite confused that the QAT took the same. Do
         | you know why? Which model is better? `gemma3:27b`, or
         | `gemma3:27b-qat`?
        
           | nolist_policy wrote:
           | I suspect your "original gemma3:27b" was a quantized model
           | since the non-quantized (16bit) version needs around 54gb.
        
           | kgwgk wrote:
           | Look up 27b in https://ollama.com/library/gemma3/tags
           | 
           | You'll find the id a418f5838eaf which also corresponds to
           | 27b-it-q4_K_M
        
           | superkuh wrote:
           | Quantization aware training just means having the model deal
           | with quantized values a bit during training so it handles the
           | quantization better when it is quantized after training/etc.
           | It doesn't change the model size itself.
        
           | zorgmonkey wrote:
           | Both versions are quantized and should use the same amount of
           | RAM, the difference with QAT is the quantization happens
           | during training time and it should result in slightly better
           | (closer to the bf16 weights) output
        
         | prvc wrote:
         | > ~15GB (MLX) leaving plenty of memory for running other apps.
         | 
         | Is that small enough to run well (without thrashing) on a
         | system with only 16GiB RAM?
        
           | simonw wrote:
           | I expect not. On my Mac at least I've found I need a bunch of
           | GB free to have anything else running at all.
        
             | mnoronha wrote:
             | Any idea why MLX and ollama use such different amounts of
             | ram?
        
               | jychang wrote:
               | I don't think ollama is quantizing the embeddings table,
               | which is still full FP16.
               | 
               | If you're using MLX, that means you're on a mac, in which
               | case ollama actually isn't your best option. Either
               | directly use llama.cpp if you're a power user, or use LM
               | Studio if you want something a bit better than ollama but
               | more user friendly than llama.cpp. (LM Studio has a GUI
               | and is also more user friendly than ollama, but has the
               | downsides of not being as scriptable. You win some, you
               | lose some.)
               | 
               | Don't use MLX, it's not as fast/small as the best GGUFs
               | currently (and also tends to be more buggy, it currently
               | has some known bugs with japanese). Download the LM
               | Studio version of the Gemma 3 QAT GGUF quants, which are
               | made by Bartowski. Google actually directly mentions
               | Bartowski in blog post linked above (ctrl-f his name),
               | and his models are currently the best ones to use.
               | 
               | https://huggingface.co/bartowski/google_gemma-3-27b-it-
               | qat-G...
               | 
               | The "best Gemma 3 27b model to download" crown has taken
               | a very roundabout path. After the initial Google release,
               | it went from Unsloth Q4_K_M, to Google QAT Q4_0, to
               | stduhpf Q4_0_S, to Bartowski Q4_0 now.
        
         | codybontecou wrote:
         | Can you run the mlx-variation of this model through Ollama so
         | that I can interact with it in Open WebUI?
        
           | simonw wrote:
           | I haven't tried it yet but there's an MLX project that
           | exposes an OpenAI-compatible serving endpoint that should
           | work with Open WebUI: https://github.com/madroidmaq/mlx-omni-
           | server
        
             | codybontecou wrote:
             | Appreciate the link. I'll try to tinker with it later
             | today.
        
         | bobjordan wrote:
         | Thanks for the call out on this model! I have 42gb usable VRAM
         | on my ancient (~10yrs old) quad-sli titan-x workstation and
         | have been looking for a model to balance large context window
         | with output quality. I'm able to run this model with a 56K
         | context window and it just fits into my 42gb VRAM to run 100%
         | GPU. The output quality is really good and 56K context window
         | is very usable. Nice find!
        
         | ygreif wrote:
         | Do many consumer GPUs have >20 gigabytes RAM? That sounds like
         | a lot to me
        
           | mcintyre1994 wrote:
           | I don't think so, but Apple's unified memory architecture
           | makes it a possibility for people with Macbook Pros.
        
       | justanotheratom wrote:
       | Anyone packaged one of these in an iPhone App? I am sure it is
       | doable, but I am curious what tokens/sec is possible these days.
       | I would love to ship "private" AI Apps if we can get reasonable
       | tokens/sec.
        
         | Alifatisk wrote:
         | If you ever ship a private AI app, don't forget to implement
         | the export functionality, please!
        
           | idonotknowwhy wrote:
           | You mean conversations? Just the jsonl of the standard hf
           | dataset format to import into other systems?
        
             | Alifatisk wrote:
             | Yeah I mean conversations.
        
         | nico wrote:
         | What kind of functionality do you need from the model?
         | 
         | For basic conversation and RAG, you can use tinyllama or
         | qwen-2.5-0.5b, both of which run on a raspberry pi at around
         | 5-20 tokens per second
        
           | justanotheratom wrote:
           | I am looking for structured output at about 100-200
           | tokens/second on iPhone 14+. Any pointers?
        
         | zamadatix wrote:
         | There are many such apps, e.g. Mollama, Enclave AI or
         | PrivateLLM or dozens of others, but you could tell me it runs
         | at 1,000,000 tokens/second on an iPhone and I wouldn't care
         | because the largest model version you're going to be able to
         | load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you
         | still need context) and it's just not worth the time to use.
         | 
         | That said, if you really care, it generates faster than reading
         | speed (on an A18 based model at least).
        
           | woodson wrote:
           | Some of these small models still have their uses, e.g. for
           | summarization. Don't expect them to fully replace ChatGPT.
        
             | zamadatix wrote:
             | The use case is more "I'm willing to have really bad
             | answers that have extremely high rates of making things up"
             | than based on the application. The same goes for
             | summarization, it's not like it does it well like a large
             | model would.
        
         | nolist_policy wrote:
         | FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with
         | 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with
         | Termux.
        
           | Casteil wrote:
           | Does this turn your phone into a personal space heater too?
        
       | Alifatisk wrote:
       | Except this being lighter than the other models, is there
       | anything else the Gemma model is specifically good at or better
       | than the other models at doing?
        
         | itake wrote:
         | Google claims to have better multi language support, due
         | tokenizer improvements.
        
         | nico wrote:
         | They are multimodal. Havent tried the QAT one yet. But the
         | gemma3s released a few weeks ago are pretty good at processing
         | images and telling you details about what's in them
        
         | Zambyte wrote:
         | I have found Gemma models are able to produce useful
         | information about more niche subjects that other models like
         | Mistral Small cannot, at the expense of never really saying "I
         | don't know", where other models will, and will instead produce
         | false information.
         | 
         | For example, if I ask mistral small who I am by name, it will
         | say there is no known notable figure by that name before the
         | knowledge cutoff. Gemma 3 will say I am a well known <random
         | profession> and make up facts. On the other hand, I have asked
         | both about local organization in my area that I am involved
         | with, and Gemma 3 could produce useful and factual information,
         | where Mistral Small said it did not know.
        
       | trebligdivad wrote:
       | It seems pretty impressive - I'm running it on my CPU (16 core
       | AMD 3950x) and it's very very impressive at translation, and the
       | image description is very impressive as well. I'm getting about
       | 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was
       | previously using). It does tend to be a bit chatty unless you
       | tell it not to be; pretty much everything it'll give you a
       | 'breakdown' unless you tell it not to - so for traslation my
       | prompt is 'Translate the input to English, only output the
       | translation' to stop it giving a breakdown of the input language.
        
         | simonw wrote:
         | What are you using to run it? I haven't got image input working
         | yet myself.
        
           | trebligdivad wrote:
           | I'm using llama.cpp - built last night from head; to do image
           | stuff you have to run a separate client they provide, with
           | something like:
           | 
           | ./build/bin/llama-gemma3-cli -m
           | /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj
           | /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this
           | image." --image ~/Downloads/surprise.png
           | 
           | Note the 2nd gguf in there - I'm not sure, but I think that's
           | for encoding the image.
        
           | terhechte wrote:
           | Image input has been working with LM Studio for quite some
           | time
        
         | Havoc wrote:
         | The upcoming qwen3 series is supposed to be MoE...likely to
         | give better tk/s on CPU
        
           | slekker wrote:
           | What's MoE?
        
             | zamalek wrote:
             | Mixture of Experts. Very broadly speaking, there are a
             | bunch of mini networks (experts) which can be independently
             | activated.
        
             | Havoc wrote:
             | Mixture of experts like other guy said - everything gets
             | loaded into mem but not every byte is needed to generate a
             | token (unlike classic LLMs like gemma).
             | 
             | So for devices that have lots of mem but weaker processing
             | power it can get you similar output quality but faster. So
             | tends to do better on CPU and APU like setups
        
               | trebligdivad wrote:
               | I'm not even sure they're loading everything into memory
               | for MoE; maybe they can get away with only the relevant
               | experts being paged in.
        
       | XCSme wrote:
       | So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?
        
       | perching_aix wrote:
       | This is my first time trying to locally host a model - gave both
       | the 12B and 27B QAT models a shot.
       | 
       | I was both impressed and disappointed. Setup was piss easy, and
       | the models are great conversationalists. I have a 12 gig card
       | available and the 12B model ran very nice and swift.
       | 
       | However, they're seemingly terrible at actually assisting with
       | stuff. Tried something very basic: asked for a powershell one
       | liner to get the native blocksize of my disks. Ended up
       | hallucinating fields, then telling me to go off into the deep
       | end, first elevating to admin, then using WMI, then bringing up
       | IOCTL. Pretty unfortunate. Not sure I'll be able to put it to
       | actual meaningful use as a result.
        
         | parched99 wrote:
         | I think Powershell is a bad test. I've noticed all local models
         | have trouble providing accurate responses to Powershell-related
         | prompts. Strangely, even Microsoft's model, Phi 4, is bad at
         | answering these questions without careful prompting. Though, MS
         | can't even provide accurate PS docs.
         | 
         | My best guess is that there's not enough discussion/development
         | related to Powershell in training data.
        
           | fragmede wrote:
           | Which, like, you'd think Microsoft has an entire team there
           | who's purpose would be to generate good PowerShell for it to
           | train on.
        
         | terhechte wrote:
         | Local models, due to their size more than big cloud models,
         | favor popular languages rather than more niche ones. They work
         | fantastic for JavaScript, Python, Bash but much worse at less
         | popular things like Clojure, Nim or Haskell. Powershell is
         | probably on the less popular side compared to Js or Bash.
         | 
         | If this is your main use case you can always try to fine tune a
         | model. I maintain a small llm bench of different programming
         | languages and the performance difference between say Python and
         | Rust on some smaller models is up to 70%
        
           | perching_aix wrote:
           | How accessible and viable is model fine-tuning? I'm not in
           | the loop at all unfortunately.
        
             | terhechte wrote:
             | This is a very accessible way of playing around with the
             | topic: https://transformerlab.ai
        
         | HachiWari8 wrote:
         | I tried the 27B QAT model and it hallucinates like crazy. When
         | I ask it for information about some made up person, restaurant,
         | place name, etc., it never says "I don't know about that" and
         | instead seems eager to just make up details. The larger local
         | models like the older Llama 3.3 70B seem better at this, but
         | are also too big to fit on a 24GB GPU.
        
         | jayavanth wrote:
         | you should set a lower temperature
        
       | CyberShadow wrote:
       | How does it compare to CodeGemma for programming tasks?
        
       | api wrote:
       | When I see 32B or 70B models performing similarly to 200+B
       | models, I don't know what to make of this. Either the latter
       | contains more breadth of information but we have managed to
       | distill latent capabilities to be similar, the larger models are
       | just less efficient, or the tests are not very good.
        
         | simonw wrote:
         | It makes intuitive sense to me that this would be possible,
         | because LLMs are still mostly opaque black boxes. I expect you
         | could drop a whole hunch of the weights without having a huge
         | impact on quality - maybe you end up mostly ditching the parts
         | that are derived from shitposts on Reddit but keep the bits
         | from Arxiv for example.
         | 
         | (That's a massive simplification of how any of this works, but
         | it's how I think about it at a high level.)
        
         | retinaros wrote:
         | its just bs benchmarks. they are all cheating at this point
         | feeding the data in the training set. doesnt mean the llm arent
         | becoming better but when they all lie...
        
       | mekpro wrote:
       | Gemma 3 is way way better than Llama 4. I think Meta will start
       | to lose its position in LLM mindshare. Another weakness of Llama
       | 4 is its model size that is too large (even though it can run
       | fast with MoE), which greatly limits the applicable users to a
       | small percentage of enthusiasts who have enough GPU VRAM.
       | Meanwhile, Gemma 3 is widely usable across all hardware sizes.
        
       | miki123211 wrote:
       | What would be the best way to deploy this if you're maximizing
       | for GPU utilization in a multi-user (API) scenario? Structured
       | output support would be a big plus.
       | 
       | We're working with a GPU-poor organization with very strict data
       | residency requirements, and these models might be exactly what we
       | need.
       | 
       | I would normally say VLLM, but the blog post notably does not
       | mention VLLM support.
        
         | PhilippGille wrote:
         | vLLM lists Gemma 3 as supported, if I'm not mistaken:
         | https://docs.vllm.ai/en/latest/models/supported_models.html#...
        
       | 999900000999 wrote:
       | Assuming this can match Claude's latest, and full time usage ( as
       | in you have a system that's constantly running code without any
       | user input,) you'd probably save 600 to 700 a month. A 4090 is
       | only 2K and you'll see an ROI within 90 days.
       | 
       | I can imagine this will serve to drive prices for hosted llms
       | lower.
       | 
       | At this level any company that produces even a nominal amount of
       | code should be running LMS on prem( AWS if your on the cloud).
        
         | rafaelmn wrote:
         | I'd say using a Mac studio with M4 Max and 128 GB RAM will get
         | you way further than 4090 in context size and model size.
         | Cheaper than 2x4090 and less power while being a great overall
         | machine.
         | 
         | I think these consumer GPUs are way too expensive for the
         | amount of memory they pack - and that's intentional price
         | discrimination. Also the builds are gimmicky. It's just not
         | setup for AI models, and the versions that are cost 20k.
         | 
         | AMD has that 128GB RAM strix halo chip but even with soldered
         | ram the bandwidth there is very limited, half of M4 Max, which
         | is half of 4090.
         | 
         | I think this generation of hardware and local models is not
         | there yet - would wait for M5/M6 release.
        
           | tootie wrote:
           | There's certainly room to grow but I'm running Gemma 12b on a
           | 4060 (8GB VRAM) which I bought for gaming and it's a tad slow
           | but still gives excellent results. And it certainly seems
           | software is outpacing hardware right now. The target is
           | making a good enough model that can run on a phone.
        
           | retinaros wrote:
           | two 3090 are the way to go
        
       | briandear wrote:
       | The normal Gemma models seem to work fine on Apple silicon with
       | Metal. Am I missing something?
        
         | simonw wrote:
         | These new special editions of those models claim to work better
         | with less memory.
        
       | porphyra wrote:
       | It is funny that Microsoft had been peddling "AI PCs" and Apple
       | had been peddling "made for Apple Intelligence" for a while now,
       | when in fact usable models for consumer GPUs are only barely
       | starting to be a thing on extremely high end GPUs like the 3090.
        
         | ivape wrote:
         | This is why the "AI hardware cycle is hype" crowd is so wrong.
         | We're not even close, we're basically at ColecoVision/Atari
         | stage of hardware here. It's going be quite a thing when
         | _everyone_ gets a SNES /Genesis.
        
         | icedrift wrote:
         | Capable local models have been usable on Macs for a while now
         | thanks to their unified memory.
        
         | NorwegianDude wrote:
         | A 3090 is not a extremely high end GPU. Is a consumer GPU
         | launched in 2020, and even in price and compute it's around a
         | mid-range consumer GPU these days.
         | 
         | The high end consumer card from Nvidia is the RTX 5090, and the
         | professional version of the card is the RTX PRO 6000.
        
           | zapnuk wrote:
           | A 3090 still costs 1800EUR. Thats not mid-range by a long
           | shot
           | 
           | The 5070 or 5070ti are mid range. They cost 650/900EUR.
        
             | NorwegianDude wrote:
             | 3090s are no longer produced, that's why new ones are so
             | expensive. At least here, used 3090s are around EUR650, and
             | a RTX 5070 is around EUR625.
             | 
             | It's definitely not extremely high end any more, the price
             | is(at least here) the same as the new mid range consumer
             | cards.
             | 
             | I guess the price can vary by location, but EUR1800 for a
             | 3090 is crazy, that's more than the new price in 2020.
        
             | sentimentscan wrote:
             | A year ago, I bought a brand-new EVGA hybrid-cooled 3090 Ti
             | for 700 euros. I'm still astonished at how good of a
             | decision it was, especially considering the scarcity of
             | 24GB cards available for a similar price. For pure gaming,
             | many cards perform better, but they mostly come with 12 to
             | 16GB of VRAM.
        
           | dragonwriter wrote:
           | For model usability as a binary yes/no, pretty much the only
           | dimension that matters is VRAM, and at 24GB the 3090 is still
           | high end for a consumer NVidia GPUs, yes, the 5090 (and
           | _only_ the 5090) is above it, at 32GB, but 24GB is way ahead
           | of the mid-range.
        
             | NorwegianDude wrote:
             | 24 GB of VRAM is a large amount of VRAM on a consumer GPU,
             | that I totally agree with you on. But it's definitely not
             | an extremely high end GPU these days. It is suitable, yes,
             | but not high end. The high end alternative for a consumer
             | GPU would be the RTX 5090, but that is only available for
             | EUR3000 now, while used 3090s are around EUR650.
        
         | dragonwriter wrote:
         | AI PCs aren't about running the kind of models that take a
         | 3090-class GPU, or even _running on GPU at all_ , but systems
         | where the local end is running something like Phi-3.5-vision-
         | instruct, on system RAM using a CPU with an integrated NPU,
         | which is why the AI PC requirements specify an NPU, a certain
         | amount of processing capacity, and a minimum amount of
         | DDR5/LPDDR5 system RAM.
        
       | mark_l_watson wrote:
       | Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using
       | Ollama for routine work on my 32G memory Mac.
       | 
       | gemma3:27b-it-qat with open-codex, running locally, is just
       | amazingly useful, not only for Python dev, but for Haskell and
       | Common Lisp also.
       | 
       | I still like Gemini 2.5 Pro and o3 for brainstorming or working
       | on difficult problems, but for routine work it (simply) makes me
       | feel good to have everything open source/weights running on my
       | own system.
       | 
       | Wen I bought my 32G Mac a year ago, I didn't expect to be so
       | happy as running gemma3:27b-it-qat with open-codex locally.
        
         | Tsarp wrote:
         | What tps are you hitting? And did you have to change KV size?
        
         | nxobject wrote:
         | Fellow owner of a 32GB MBP here: how much memory does it use
         | while resident - or, if swapping happens, do you see the
         | effects in your day to day work? I'm in the awkward position of
         | using on a daily basis a lot of virtualized bloated Windows
         | software (mostly SAS).
        
           | mark_l_watson wrote:
           | I have the usual programs running on my Mac, along with open-
           | codex: Emacs, web browser, terminals, VSCode, etc. Even with
           | large contexts, open-codex with Ollama and Gemma 3 27B QAT
           | does not seem to overload my system.
           | 
           | To be clear, I sometimes toggle open-codex to use the Gemini
           | 3.5 Pro API also, but I enjoy running locally for simpler
           | routine work.
        
         | pantulis wrote:
         | How did you manage to run open-codex against a local ollama? I
         | keep getting 400 Errors no matter what I try with the
         | --provider and --model options.
        
           | pantulis wrote:
           | Never mind, found your Leanpub book and followed the
           | instructions and at least I have it running with qwen-2.5.
           | I'll investigate what happens with Gemma.
        
       | piyh wrote:
       | Meta Maverick is crying in the shower getting so handily beat by
       | a model with 15x fewer params
        
       | Samin100 wrote:
       | I have a few private "vibe check" questions and the 4 bit QAT 27B
       | model got them all correctly. I'm kind of shocked at the
       | information density locked in just 13 GB of weights. If anyone at
       | Deepmind is reading this -- Gemma 3 27B is the single most
       | impressive open source model I have ever used. Well done!
        
         | itake wrote:
         | I tried to use the -it models for translation, but it
         | completely failed at translating adult content.
         | 
         | I think this means I either have to train the -pt model with my
         | own instruction tuning or use another provider :(
        
           | andhuman wrote:
           | Have you tried Mistral Small 24b?
        
             | itake wrote:
             | My current architecture is an on-device model for fast
             | translation and then replace that with a slow translation
             | (via an API call) when its ready.
             | 
             | 24b would be too small to run on device and I'm trying to
             | keep my cloud costs low (meaning I can't afford to host a
             | small 24b 24/7).
        
           | jychang wrote:
           | Try mradermacher/amoral-gemma3-27B-v2-qat-GGUF
        
             | itake wrote:
             | My current architecture is an on-device model for fast
             | translation and then replace that with a slow translation
             | (via an API call) when its ready.
             | 
             | 24b would be too small to run on device and I'm trying to
             | keep my cloud costs low (meaning I can't afford to host a
             | small 27b 24/7).
        
       | cheriot wrote:
       | Is there already a Helium for GPUs?
        
       | punnerud wrote:
       | Just tested the 27B, and it's not very good at following
       | instructions and is very limited on more complex code problems.
       | 
       | Mapping from one JSON with a lot of plain text, into a new
       | structure and it fails every time.
       | 
       | Ask it to generate SVG, and it's very simple and almost too dumb.
       | 
       | Nice that it doesn't need that huge amount of RAM, and perform ok
       | on smaller languages from my initial tests.
        
       | Havoc wrote:
       | Definitely my current fav. Also interesting that for many
       | questions the response is very similar to the gemini series. Must
       | be sharing training datasets pretty directly.
        
       | mattfrommars wrote:
       | anyone had success using Gemma 3 QAT models on Ollama with cline?
       | They just don't work as good compared Gemini 2.0 flash provided
       | by API
        
       | technologesus wrote:
       | Just for fun I created a new personal benchmark for vision-
       | enabled LLMs: playing minecraft. I used JSON structured output in
       | LM Studio to create basic controls for the game. Unfortunately no
       | matter how hard I proompted, gemma-3-27b QAT is not really able
       | to understand simple minecraft scenarios. It would say things
       | like "I'm now looking at a stone block. I need to break it" when
       | it is looking out at the horizon in the desert.
       | 
       | Here is the JSON schema: https://pastebin.com/SiEJ6LEz System
       | prompt: https://pastebin.com/R68QkfQu
        
         | jvictor118 wrote:
         | i've found the vision capabilities are very bad with spatial
         | awareness/reasoning. They seem to know that certain things are
         | in the image, but not where they are relative to each other,
         | their relative sizes, etc.
        
       | gigel82 wrote:
       | FWIW, the 27b Q4_K_M takes about 23Gb of VRAM with 4k context and
       | 29Gb with 16k context and runs at ~61t/s on my 5090.
        
       | manjunaths wrote:
       | I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine
       | with ROCm and llama.cpp on Windows 11. I can use Open-webui or
       | the native gui for the interface. It is made available via an
       | internal IP to all members of my home.
       | 
       | It runs at around 26 tokens/sec and FP16, FP8 is not supported by
       | the Radeon 7900 GRE.
       | 
       | I just love it.
       | 
       | For coding QwQ 32b is still king. But with a 16GB VRAM card it
       | gives me ~3 tokens/sec, which is unusable.
       | 
       | I tried to make Gemma 3 write a powershell script with Terminal
       | gui interface and it ran into dead-ends and finally gave up. QwQ
       | 32B performed a lot better.
       | 
       | But for most general purposes it is great. My kid's been using it
       | to feed his school textbooks and ask it questions. It is better
       | than anything else currently.
       | 
       | Somehow it is more "uptight" than llama or the chinese models
       | like Qwen. Can't put my finger on it, the Chinese models seem
       | nicer and more talkative.
        
         | mdp2021 wrote:
         | > _My kid 's been using it to feed his school textbooks and ask
         | it questions_
         | 
         | Which method are you employing to feed a textbook into the
         | model?
        
       | anshumankmr wrote:
       | my trusty RTX 3060 is gonna have its day in the sun... though I
       | have run a bunch of 7B models fairly easily on Ollama.
        
       | ece wrote:
       | On Hugging Face:
       | https://huggingface.co/collections/google/gemma-3-qat-67ee61...
        
       | yuweiloopy2 wrote:
       | Been using the 27B QAT model for batch processing 50K+ internal
       | documents. The 128K context is game-changing for our legal review
       | pipeline. Though I wish the token generation was faster - at
       | 20tps it's still too slow for interactive use compared to Claude
       | Opus.
        
       | gitroom wrote:
       | nice, loving the push with local models lately - always makes me
       | wonder though, you think privacy wins out over speed and
       | convenience in the long run or people just stick with what's
       | quickest?
        
         | simonw wrote:
         | Speed and convenience will definitely win for most people.
         | Hosted LLMs are _so cheap_ these days, and are massively more
         | capable than anything you can fit on even a very beefy
         | ($4,000+) consumer machine.
         | 
         | The privacy concerns are honestly _mostly_ imaginary at this
         | point, too. Plenty of hosted LLM vendors will promise not to
         | train on your data. The bigger threat is if they themselves log
         | data and then have a security incident, but honestly the risk
         | that your own personal machine gets stolen or hacked is a lot
         | higher than that.
        
       ___________________________________________________________________
       (page generated 2025-04-21 23:01 UTC)