[HN Gopher] Qwen3-4B-Thinking-2507
___________________________________________________________________
Qwen3-4B-Thinking-2507
Author : IdealeZahlen
Score : 163 points
Date : 2025-08-06 15:50 UTC (7 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| gok wrote:
| So this 4B dense model gets very similar performance to the 30B
| MoE variant with 7.5x smaller footprint.
| smallerize wrote:
| It gets similar performance to the old version of the 30B MoE
| model, but not the updated version.
| https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
| Imustaskforhelp wrote:
| I still think that its still very commendable though.
|
| I am running this beast on my dumb pc with no gpu, now we are
| talking!
| esafak wrote:
| This one should work on personal computers! I'm thankful for
| Chinese companies raising the floor.
| frontsideair wrote:
| According to the benchmarks, this one is improved in every one of
| them compared to the previous version, some better than 30B-A3B.
| Definitely worth a try, it'll easily fit into memory and token
| generation speed will be pleasantly fast.
| GaggiX wrote:
| There is a new Qwen3-30B-A3B, you are compare it to the old
| one. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
| tolerance wrote:
| Is there like a leaderboard or power rankings sort of thing that
| tracks these small open models and assigns ratings or grades to
| them based on particular use cases?
| esafak wrote:
| https://artificialanalysis.ai/leaderboards/models?open_weigh...
| cowpig wrote:
| Compare these rankings to actual usage:
| https://openrouter.ai/rankings
|
| Claude is not cheap, why is it far and away the most popular
| if it's not top 10 in performance?
|
| Qwen3 235b ranks highest on these benchmarks among open
| models, but I have never met someone who prefers its output
| over Deepseek R1. It's extremely wordy and often gets caught
| in thought loops.
|
| My interpretation is that the models at the top of
| ArtificialAnalysis are focusing the most on public benchmarks
| in their training. Note I am not saying XAI is necessarily
| nefariously doing this, could just be that they decided it's
| better bang for the buck to rely on public benchmarks than to
| try to focus on building their own evaluation systems.
|
| But Grok is not very good compared to the anthropic, openai,
| or google models despite ranking so highly in benchmarks.
| GaggiX wrote:
| Claude Opus is in the top 10, also people via OpenRouter
| mostly use these models for coding and Claude models are
| particularly good at this, the benchmark doesn't account
| only for coding capacities tho
| byefruit wrote:
| The openrouter rankings can be biased.
|
| For example, Google's inexplicable design decisions around
| libraries and APIs means it's often worth the 5% premium to
| just use OpenRouter to access their models. In other cases
| it's about which models particular agents default to.
|
| Sonnet 4 is extremely good for tool-usage agentic setups
| though - something I have found other models struggle to do
| over a long-context.
| ImageXav wrote:
| Thanks for sharing that. Interesting that the leaderboard
| is dominated by Anthropic, Google and DeepSeek. Openai
| doesn't even register.
| reilly3000 wrote:
| OpenAI has a lot of share that simply doesn't exist via
| OpenRouter. Typical enterprise chat bot apps use it
| directly without paying a tax and may use litellm with
| another vendor for fallback.
| esafak wrote:
| I shared a link to small, open source models; Claude is
| neither.
| whimsicalism wrote:
| grok is not bad, i think 4 is better than claude for most
| things other than tool calling.
|
| of course, this is a politically charged subject now so
| fair assessments might be hard to come by - as evidenced by
| the downvotes i've already gotten on this comment
| threeducks wrote:
| OpenRouter rankings conflate many factors like output
| quality, popularity, price and legal concerns. They can not
| tell us whether a model is popular because it is genuinely
| good, or because many people have heard about it, or
| because it is free, or because the lawyers trust the
| provider.
| decide1000 wrote:
| Qwen3-30A-A3B-2507 is much faster on my machine compared to
| gpt-oss-20B. This leaderboard does not reflect that.
| tolerance wrote:
| This is perfect. Thanks.
| jampa wrote:
| I am reading this right, is this model way better than Gemma
| 3n[1]? (For only the benchmarks that are common among the models)
|
| =====
|
| LiveCodeBench
|
| E4B IT: 13.2
|
| Qwen: 55.2
|
| ===== AIME25
|
| E4B IT: 11.6
|
| Qwen: 81.3
|
| [1]: https://huggingface.co/google/gemma-3n-E4B
| meatmanek wrote:
| Reasoning models do a lot better at AIME than non-reasoning
| models, with o3 mini getting 85% and 4o-mini getting 11%. It
| makes some sense that this would apply to small models as well.
| film42 wrote:
| Is there a crowd-sourced sentiment score for models? I know all
| these scores are juiced like crazy. I stopped taking them at face
| value months ago. What I want to know is if other folks out there
| actually use them or if they are unreliable.
| nurettin wrote:
| This has been around for a while
| https://lmarena.ai/leaderboard/text/coding
| klohto wrote:
| openrouter usage stats
| esafak wrote:
| https://openrouter.ai/rankings
|
| The new qwen3 model is not out yet.
| setsewerd wrote:
| Since the ranking is based on token usage, wouldn't this
| ranking be skewed by the fact that small models' APIs are
| often used for consumer products, especially free ones?
| Meanwhile reasoning models skew it in the opposite direction,
| but to what extent I don't know.
|
| It's an interesting proxy, but idk how reliable it'd be.
| matznerd wrote:
| Also, these small models are meant to be run local so not
| going to appear on openrouter...
| hnfong wrote:
| Besides the LM Arena Leaderboard mentioned by a sibling
| comment, if go to the r/LocalLlama/ subreddit, you can very
| unscientifically get a rough sentiment of the performance of
| the models by reading the comments (and maybe even check the
| upvotes). I think the crowd's knee-jerk reaction is unreliable
| though, but that's what you asked for.
| NitpickLawyer wrote:
| Not anymore tho. It used to be the place to vibe-check a
| model ~1 year ago, but lately it's filled with toxic my team
| vs. your team, memes about CEOs (wtf) and general poor takes
| on a lot of things.
|
| For a while it was china vs. world, but lately it's even more
| divided, with heavy camping on specific models. You can still
| get some signal, but you have to either ban a lot of
| accounts, or read new during different tzs so you can get
| some of that "i'm just here for the tech stack" vibe from
| posters.
| littlestymaar wrote:
| Yeah, some people just can't stop acting as if tech
| companies were sport teams, and it gets annoying fast.
| svnt wrote:
| It is interesting to think about how they are achieving these
| scores. The evals are rated by GPT-4.1. Beyond just overfitting
| to benchmarks, is it possible the models are internalizing how to
| manipulate the ratings model/agent? Is anyone manually auditing
| these performance tables?
| nisten wrote:
| If you want to have an opinion on it,
|
| just install lmstudio and run the q8_0 version of it i.e. here
| https://huggingface.co/bartowski/Qwen_Qwen3-4B-Instruct-2507....
|
| you can even run it on a 4gb raspberry pi
| Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf https://lmstudio.ai/
|
| Keep in mind if you run it at the full 262144 tokens of context
| youll need ~65gb of ram.
|
| Anyway if you're on mac you can search for "qwen3 4b 2507 mlx
| 4bit" and run the mlx version which is often faster on m chips.
| Crazy impressive what you get from a 2gb file in my opinion.
|
| It's pretty good for summaries etc, can even make simple
| index.html sites if you're teaching students but it can't really
| vibecode in my opinion. However for local automation tasks like
| summarizing your emails, or home automation or whatever it is
| excellent.
|
| It's crazy that we're at this point now.
| Aeroi wrote:
| how about on apple silicon for the iphone
| jasonjmcghee wrote:
| https://joejoe1313.github.io/2025-05-06-chat-qwen3-ios.html
| esafak wrote:
| Thank you. To spare Mac readers time:
|
| mlx 4bit: https://huggingface.co/lmstudio-
| community/Qwen3-4B-Thinking-...
|
| mlx 5bit: https://huggingface.co/lmstudio-
| community/Qwen3-4B-Thinking-...
|
| mlx 6bit: https://huggingface.co/lmstudio-
| community/Qwen3-4B-Thinking-...
|
| mlx 8bit: https://huggingface.co/lmstudio-
| community/Qwen3-4B-Thinking-...
|
| edit: corrected the 4b link
| ckcheng wrote:
| Did you mean mlx 4bit:
|
| https://huggingface.co/lmstudio-
| community/Qwen3-4B-Thinking-...
| belter wrote:
| This comment saved 3 tons of CO2
| magnat wrote:
| > if you run it at the full 262144 tokens of context youll need
| ~65gb of ram
|
| What is the relationship between context size and RAM required?
| Isn't the size of RAM related only to number of parameters and
| quantization?
| DSingularity wrote:
| No. Your KV cache is kept in memory also.
| Gracana wrote:
| The context cache (or KV cache) is where intermediate results
| are stored. One for each output token. Its size depends on
| the model architecture and dimensions.
|
| KV cache size = 2 * batch_size * context_len *
| num_key_value_heads * head_dim * num_layers * element_size.
| The "2" is for the two parts, key and value. Element size is
| the precision in bytes. This model uses grouped query
| attention, which reduces num_key_value_heads compared to a
| multi head attention (MHA) model.
|
| With batch size 1 (for low-latency single-user inference),
| 32k context (recommended in the model card), fp16 precision:
|
| 2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
|
| I think, anyway. It's hard to keep up with this stuff. :)
| hnuser123456 wrote:
| A 24GB GPU can run a ~30b parameter model at 4bit
| quantization at about 8k-12k context length before every GB
| of VRAM is occupied.
| 0x457 wrote:
| I mean...where do you think context is stored?
___________________________________________________________________
(page generated 2025-08-06 23:01 UTC)