[HN Gopher] Qwen3-4B-Thinking-2507
       ___________________________________________________________________
        
       Qwen3-4B-Thinking-2507
        
       Author : IdealeZahlen
       Score  : 163 points
       Date   : 2025-08-06 15:50 UTC (7 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | gok wrote:
       | So this 4B dense model gets very similar performance to the 30B
       | MoE variant with 7.5x smaller footprint.
        
         | smallerize wrote:
         | It gets similar performance to the old version of the 30B MoE
         | model, but not the updated version.
         | https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
        
           | Imustaskforhelp wrote:
           | I still think that its still very commendable though.
           | 
           | I am running this beast on my dumb pc with no gpu, now we are
           | talking!
        
       | esafak wrote:
       | This one should work on personal computers! I'm thankful for
       | Chinese companies raising the floor.
        
       | frontsideair wrote:
       | According to the benchmarks, this one is improved in every one of
       | them compared to the previous version, some better than 30B-A3B.
       | Definitely worth a try, it'll easily fit into memory and token
       | generation speed will be pleasantly fast.
        
         | GaggiX wrote:
         | There is a new Qwen3-30B-A3B, you are compare it to the old
         | one. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
        
       | tolerance wrote:
       | Is there like a leaderboard or power rankings sort of thing that
       | tracks these small open models and assigns ratings or grades to
       | them based on particular use cases?
        
         | esafak wrote:
         | https://artificialanalysis.ai/leaderboards/models?open_weigh...
        
           | cowpig wrote:
           | Compare these rankings to actual usage:
           | https://openrouter.ai/rankings
           | 
           | Claude is not cheap, why is it far and away the most popular
           | if it's not top 10 in performance?
           | 
           | Qwen3 235b ranks highest on these benchmarks among open
           | models, but I have never met someone who prefers its output
           | over Deepseek R1. It's extremely wordy and often gets caught
           | in thought loops.
           | 
           | My interpretation is that the models at the top of
           | ArtificialAnalysis are focusing the most on public benchmarks
           | in their training. Note I am not saying XAI is necessarily
           | nefariously doing this, could just be that they decided it's
           | better bang for the buck to rely on public benchmarks than to
           | try to focus on building their own evaluation systems.
           | 
           | But Grok is not very good compared to the anthropic, openai,
           | or google models despite ranking so highly in benchmarks.
        
             | GaggiX wrote:
             | Claude Opus is in the top 10, also people via OpenRouter
             | mostly use these models for coding and Claude models are
             | particularly good at this, the benchmark doesn't account
             | only for coding capacities tho
        
             | byefruit wrote:
             | The openrouter rankings can be biased.
             | 
             | For example, Google's inexplicable design decisions around
             | libraries and APIs means it's often worth the 5% premium to
             | just use OpenRouter to access their models. In other cases
             | it's about which models particular agents default to.
             | 
             | Sonnet 4 is extremely good for tool-usage agentic setups
             | though - something I have found other models struggle to do
             | over a long-context.
        
             | ImageXav wrote:
             | Thanks for sharing that. Interesting that the leaderboard
             | is dominated by Anthropic, Google and DeepSeek. Openai
             | doesn't even register.
        
               | reilly3000 wrote:
               | OpenAI has a lot of share that simply doesn't exist via
               | OpenRouter. Typical enterprise chat bot apps use it
               | directly without paying a tax and may use litellm with
               | another vendor for fallback.
        
             | esafak wrote:
             | I shared a link to small, open source models; Claude is
             | neither.
        
             | whimsicalism wrote:
             | grok is not bad, i think 4 is better than claude for most
             | things other than tool calling.
             | 
             | of course, this is a politically charged subject now so
             | fair assessments might be hard to come by - as evidenced by
             | the downvotes i've already gotten on this comment
        
             | threeducks wrote:
             | OpenRouter rankings conflate many factors like output
             | quality, popularity, price and legal concerns. They can not
             | tell us whether a model is popular because it is genuinely
             | good, or because many people have heard about it, or
             | because it is free, or because the lawyers trust the
             | provider.
        
           | decide1000 wrote:
           | Qwen3-30A-A3B-2507 is much faster on my machine compared to
           | gpt-oss-20B. This leaderboard does not reflect that.
        
           | tolerance wrote:
           | This is perfect. Thanks.
        
       | jampa wrote:
       | I am reading this right, is this model way better than Gemma
       | 3n[1]? (For only the benchmarks that are common among the models)
       | 
       | =====
       | 
       | LiveCodeBench
       | 
       | E4B IT: 13.2
       | 
       | Qwen: 55.2
       | 
       | ===== AIME25
       | 
       | E4B IT: 11.6
       | 
       | Qwen: 81.3
       | 
       | [1]: https://huggingface.co/google/gemma-3n-E4B
        
         | meatmanek wrote:
         | Reasoning models do a lot better at AIME than non-reasoning
         | models, with o3 mini getting 85% and 4o-mini getting 11%. It
         | makes some sense that this would apply to small models as well.
        
       | film42 wrote:
       | Is there a crowd-sourced sentiment score for models? I know all
       | these scores are juiced like crazy. I stopped taking them at face
       | value months ago. What I want to know is if other folks out there
       | actually use them or if they are unreliable.
        
         | nurettin wrote:
         | This has been around for a while
         | https://lmarena.ai/leaderboard/text/coding
        
         | klohto wrote:
         | openrouter usage stats
        
           | esafak wrote:
           | https://openrouter.ai/rankings
           | 
           | The new qwen3 model is not out yet.
        
           | setsewerd wrote:
           | Since the ranking is based on token usage, wouldn't this
           | ranking be skewed by the fact that small models' APIs are
           | often used for consumer products, especially free ones?
           | Meanwhile reasoning models skew it in the opposite direction,
           | but to what extent I don't know.
           | 
           | It's an interesting proxy, but idk how reliable it'd be.
        
             | matznerd wrote:
             | Also, these small models are meant to be run local so not
             | going to appear on openrouter...
        
         | hnfong wrote:
         | Besides the LM Arena Leaderboard mentioned by a sibling
         | comment, if go to the r/LocalLlama/ subreddit, you can very
         | unscientifically get a rough sentiment of the performance of
         | the models by reading the comments (and maybe even check the
         | upvotes). I think the crowd's knee-jerk reaction is unreliable
         | though, but that's what you asked for.
        
           | NitpickLawyer wrote:
           | Not anymore tho. It used to be the place to vibe-check a
           | model ~1 year ago, but lately it's filled with toxic my team
           | vs. your team, memes about CEOs (wtf) and general poor takes
           | on a lot of things.
           | 
           | For a while it was china vs. world, but lately it's even more
           | divided, with heavy camping on specific models. You can still
           | get some signal, but you have to either ban a lot of
           | accounts, or read new during different tzs so you can get
           | some of that "i'm just here for the tech stack" vibe from
           | posters.
        
             | littlestymaar wrote:
             | Yeah, some people just can't stop acting as if tech
             | companies were sport teams, and it gets annoying fast.
        
       | svnt wrote:
       | It is interesting to think about how they are achieving these
       | scores. The evals are rated by GPT-4.1. Beyond just overfitting
       | to benchmarks, is it possible the models are internalizing how to
       | manipulate the ratings model/agent? Is anyone manually auditing
       | these performance tables?
        
       | nisten wrote:
       | If you want to have an opinion on it,
       | 
       | just install lmstudio and run the q8_0 version of it i.e. here
       | https://huggingface.co/bartowski/Qwen_Qwen3-4B-Instruct-2507....
       | 
       | you can even run it on a 4gb raspberry pi
       | Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf https://lmstudio.ai/
       | 
       | Keep in mind if you run it at the full 262144 tokens of context
       | youll need ~65gb of ram.
       | 
       | Anyway if you're on mac you can search for "qwen3 4b 2507 mlx
       | 4bit" and run the mlx version which is often faster on m chips.
       | Crazy impressive what you get from a 2gb file in my opinion.
       | 
       | It's pretty good for summaries etc, can even make simple
       | index.html sites if you're teaching students but it can't really
       | vibecode in my opinion. However for local automation tasks like
       | summarizing your emails, or home automation or whatever it is
       | excellent.
       | 
       | It's crazy that we're at this point now.
        
         | Aeroi wrote:
         | how about on apple silicon for the iphone
        
           | jasonjmcghee wrote:
           | https://joejoe1313.github.io/2025-05-06-chat-qwen3-ios.html
        
         | esafak wrote:
         | Thank you. To spare Mac readers time:
         | 
         | mlx 4bit: https://huggingface.co/lmstudio-
         | community/Qwen3-4B-Thinking-...
         | 
         | mlx 5bit: https://huggingface.co/lmstudio-
         | community/Qwen3-4B-Thinking-...
         | 
         | mlx 6bit: https://huggingface.co/lmstudio-
         | community/Qwen3-4B-Thinking-...
         | 
         | mlx 8bit: https://huggingface.co/lmstudio-
         | community/Qwen3-4B-Thinking-...
         | 
         | edit: corrected the 4b link
        
           | ckcheng wrote:
           | Did you mean mlx 4bit:
           | 
           | https://huggingface.co/lmstudio-
           | community/Qwen3-4B-Thinking-...
        
           | belter wrote:
           | This comment saved 3 tons of CO2
        
         | magnat wrote:
         | > if you run it at the full 262144 tokens of context youll need
         | ~65gb of ram
         | 
         | What is the relationship between context size and RAM required?
         | Isn't the size of RAM related only to number of parameters and
         | quantization?
        
           | DSingularity wrote:
           | No. Your KV cache is kept in memory also.
        
           | Gracana wrote:
           | The context cache (or KV cache) is where intermediate results
           | are stored. One for each output token. Its size depends on
           | the model architecture and dimensions.
           | 
           | KV cache size = 2 * batch_size * context_len *
           | num_key_value_heads * head_dim * num_layers * element_size.
           | The "2" is for the two parts, key and value. Element size is
           | the precision in bytes. This model uses grouped query
           | attention, which reduces num_key_value_heads compared to a
           | multi head attention (MHA) model.
           | 
           | With batch size 1 (for low-latency single-user inference),
           | 32k context (recommended in the model card), fp16 precision:
           | 
           | 2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
           | 
           | I think, anyway. It's hard to keep up with this stuff. :)
        
           | hnuser123456 wrote:
           | A 24GB GPU can run a ~30b parameter model at 4bit
           | quantization at about 8k-12k context length before every GB
           | of VRAM is occupied.
        
           | 0x457 wrote:
           | I mean...where do you think context is stored?
        
       ___________________________________________________________________
       (page generated 2025-08-06 23:01 UTC)