[HN Gopher] GPT-OSS vs. Qwen3 and a detailed look how things evo...
___________________________________________________________________
GPT-OSS vs. Qwen3 and a detailed look how things evolved since
GPT-2
Author : ModelForge
Score : 254 points
Date : 2025-08-10 15:06 UTC (7 hours ago)
(HTM) web link (magazine.sebastianraschka.com)
(TXT) w3m dump (magazine.sebastianraschka.com)
| homarp wrote:
| "From GPT-2 to gpt-oss: Analyzing the Architectural Advances And
| How They Stack Up Against Qwen3"
| 7moritz7 wrote:
| Qwen3 is substantially better in my local testing. As in, adheres
| to the prompt better (pretty much exactly for the 32B parameter
| variant, very impressive) and is more organic sounding.
|
| In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear
| particularly good at logical puzzles either.
|
| So presumably, this comes down to...
|
| - training technique or data
|
| - dimension
|
| - lower number of large experts vs higher number of small experts
| jszymborski wrote:
| If I had to make a guess, I'd say this has much, much less to
| do with the architecture and far more to do with the data and
| training pipeline. Many have speculated that gpt-oss has
| adopted a Phi-like synthetic-only dataset and focused mostly on
| gaming metrics, and I've found the evidence so far to be
| sufficiently compelling.
| 7moritz7 wrote:
| That would be interesting. I've been a bit sceptical of the
| entire strategy from the beginning. If oss was actually as
| good as o3 mini and in some cases o4 mini outside benchmarks,
| that would undermine openai's api offer for gpt 5 nano and
| maybe mini too.
|
| Edit: found this analysis, it's on the HN frontpage right now
|
| > this thing is clearly trained via RL to think and solve
| tasks for specific reasoning benchmarks. nothing else.
|
| https://x.com/jxmnop/status/1953899426075816164
| CuriouslyC wrote:
| The strategy of Phi isn't bad, it's just not general. It's
| really a model that's meant to be fine tuned, but
| unfortunately fine tuning tends to shit on RL'd behavior,
| so it ended up not being that useful. If someone made a Phi
| style model with an architecture that was designed to take
| knowledge adapters/experts (i.e. small MoE model designed
| to get separately trained networks plugged into them with
| routing updates via special LoRA) it'd actually be super
| useful.
| adastra22 wrote:
| The Phi strategy is bad. It results in very bad models
| that are useless in production, while gaming the
| benchmark to appear like it is actually able to do
| something. This is objectively bad.
| CuriouslyC wrote:
| I like the idea of having a _HIGHLY_ unopinionated base
| model that's just good at basic logic and instruction
| following that I can fine tune to my use case. Sadly,
| full fine tuning tends to make models derpy, and LoRAs
| are limited in terms of what they can achieve.
| adastra22 wrote:
| That seems unrelated? I think we are talking about past
| each other. Phi was trained on purely synthetic data
| derived from emulating the benchmark suite. Not
| surprisingly, this resulted in state of the art scores.
| And a model that was 100% useless at anything other than
| making the benchmark number go up.
| unstatusthequo wrote:
| Yes. I tried to ask oss-gpt to ask me a riddle. The response
| was absurd. Came up with a nonsensical question, then told me
| the answer. The answer was a four letter "word" that wasn't
| actually a real word.
|
| "What is the word that starts with S, ends with E, and
| contains A? - SAEA"
|
| Then when I said that's not a word and you gave me the answer
| already, no fun, it said
|
| "I do not have access to confirm that word."
| verisimi wrote:
| lol. The answer it gave doesn't even end in an 'E'.
| roscas wrote:
| "The word is SAME." for qwen3-coder and it took like 10
| seconds to write all the thinking because it writes it
| down, but processing was 0 seconds.
|
| "I need to find a word that starts with S, ends with E, and
| contains A.
|
| Let me think through this step by step:
| Starts with S Ends with E Contains A
| somewhere in the middle
|
| One word that fits this pattern is "SAGE" - but that only
| has 4 letters and would be S-A-G-E, which doesn't work
| because it doesn't contain A in the middle.
|
| Wait, let me reconsider. The word "SAGE" actually works if
| I read it correctly: Starts with S
| Ends with E Contains A
|
| Actually, "SAGE" is S-A-G-E, so it starts with S and ends
| with E, but it contains A in the second position.
|
| Let me think of other possibilities:
| "SQUARE" - starts with S, ends with E, contains A (in the
| middle) "SALE" - starts with S, ends with E,
| contains A "SAME" - starts with S, ends with E,
| contains A
|
| Wait, "SAME" is S-A-M-E - that works! It starts with S,
| ends with E, and contains A.
|
| The word is SAME. "
| adastra22 wrote:
| He was asking the llm to come up with the riddle.
| threeducks wrote:
| FWIW, I asked gpt-oss-120b this question 10 times and the
| answer was always "sauce", "sane" or "sale". I also tried
| different temperatures (from 0 to 1), which did not seem to
| have an effect on the correctness of the answer.
|
| EDIT: I now have also questioned the smaller gpt-oss-20b
| (free) 10 times via OpenRouter (default settings, provider
| was AtlasCloud) and the answers were: sage, sane, sane,
| space, sane, sane, sane, sane, space, sane.
|
| You are either very unlucky, your configuration is
| suboptimal (weird system prompt perhaps?) or there is some
| bug in whichever system you are using for inference.
| yunusabd wrote:
| GP asked the model to _create_ a riddle, not solve a
| given one.
| threeducks wrote:
| Yes, but the odds of getting GPT-OSS to respond with that
| riddle are pretty low and it is not necessary to
| demonstrate whether the LLM can answer the riddle
| correctly.
| BoorishBears wrote:
| MoE expected performance = sqrt(active heads * total parameter
| count)
|
| sqrt(120*5) ~= 24
|
| GPT-OSS 120B is effectively a 24B parameter model with the
| speed of a much smaller model
| cranberryturkey wrote:
| qwen3 is slow though. i used it. it worked, but it was slow and
| lacking features.
| roscas wrote:
| From my experience, qwen3-coder is way better. I only have gpt-
| oss:20b installed to make a few more tests but I give it a
| program to make a summary of what it does and qwen3 just works in
| a few seconds, while gpt-oss was cancelled after 5 minuts...
| doing nothing.
|
| So I just use qwen3. Fast and great ouput. If for some reason I
| don't get what I need, I might use search engines or Perplexity.
|
| I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.
|
| Qwen3-coder is amazing. Best I used so far.
| smokel wrote:
| The 20B version doesn't fit in 10GB. That might explain some
| issues?
| mhitza wrote:
| I've been using lightly gpt-oss-20b but what I've found is that
| for smaller (single sentence) prompts it was easy enough to
| have it loop infinitely. Since I'm running it with llama.cpp
| I've set a small repetition penalty and haven't encountered
| those issues since (I'm using it a couple of times a day to
| analyze diffs, so I might have just gotten lucky since)
| ModelForge wrote:
| I've been using the ollama version (uses about 13 Gb RAM on
| macOS) and haven't had that issue yet. I wonder if that's
| maybe an issue of the llama.cpp port?
| mhitza wrote:
| Never used ollama, only ready to go models via llamafile
| and llama.cpp.
|
| Maybe ollama has some defaults it applies to models? I
| start testing models at 0 temp and tweak from there
| depending how they behave.
| nicolaslem wrote:
| I had the same issue with other models where they would loop
| repeating the same character, sentence or paragraph
| indefinitely. Turns out the context size some tools set by
| default is 2k and this is way too small.
| lvl155 wrote:
| Qwen3 coder 480B is quite good and on par with Sonnet 4. It's
| the first time I realized the Chinese models are probably going
| to eclipse US-based models pretty soon, at least for coding.
| indigodaddy wrote:
| Where do you use qwen3 480b from, I'm not even seeing it on
| Openrouter. EDIT nm, openrouter is just calling it
| qwen3-coder-- when I click for more info it shows it's
| Qwen3-Coder-480B-A35B-Instruct. And it's one of their free
| models. Nice
| cpursley wrote:
| That might be a stretch, maybe Sonnet 3.5. But it is pretty
| impressive as is Kimi on opencode.
| SV_BubbleTime wrote:
| Are you using this in an agentic way or in a copy and paste and
| "code this" single input single output way?
|
| I'd like to know how far the frontier models are from the local
| for agentic coding.
| Scene_Cast2 wrote:
| I find it interesting that the architectures of modern open
| weight LLMs are so similar, and that most innovation seems to be
| happening on the training (data, RL) front.
|
| This is contrary to what I've seen in a large ML shop, where
| architectural tuning was king.
| ModelForge wrote:
| Good point. LLMs lower the barrier to entry if someone has
| enough resources because those architectures are more robust to
| tweaks given one throws enough compute and data at them. You
| can even violate scaling laws and still get a good model (like
| Llama 3 showed back then)
| bobbylarrybobby wrote:
| My guess is that at LLM scale, you really can't try to
| hyperparameter tune -- it's just too expensive. You probably
| have to do some _basic_ testing of different architectures,
| settle on one, and then figure out how to make best use of it
| (data and RL).
| storus wrote:
| In my tests, GPT-OSS-120B Q8 was close to DeepSeek R1 671B Q16 in
| solving graduate-level math but much faster with way fewer
| thinking tokens.
| overfeed wrote:
| Supporting TFA'd thesis that it's trained to be good at
| benchmarks.
| mark_l_watson wrote:
| Wow, Sebastian Raschk's blog articles are jewels - much
| appreciated.
|
| I use the get-oss and qwen3 models a lot (smaller models locally
| using Ollama and LM Studio) and commercial APIs for the full size
| models.
|
| For local model use, I get very good results with get-oss when I
| "over prompt," that is, I specify a larger amount of context
| information than I usually do. Qwen3 is simply awesome.
|
| Until about three years ago, I have always understood neural
| network models (starting in the 1980s), GAN, Recurrent, LSTM,
| etc. well enough to write implementations. I really miss the
| feeling that I could develop at least simpler LLMs on my own. I
| am slowly working through Sebastian Raschk's excellent book
| https://www.manning.com/books/build-a-large-language-model-f...
| but I will probably never finish it (to be honest).
| lvl155 wrote:
| He does an amazing job of keeping me up to date on this
| insanely fast-paced space.
| pryelluw wrote:
| The Qwen3 4B has been very good to use local. I barely use the
| online models. Web searches are now more targeted thanks to it.
| Don't quite fully trust the output but it's generally good. Mods
| like these will revolutionize local knowledge and automation
| indigodaddy wrote:
| Qwen is telling you better search parameters to then search the
| web with, or qwen is actually doing web searches for you?
| gglon wrote:
| > At the time of writing, the highest-ranking non-purely-
| transformer-based model on the LM Arena is Jamba, which is a
| transformer-state space model hybrid, at rank 96.)
|
| Tencent's hunyuan-turbos, another hybrid, is currently ranked at
| 22. https://arxiv.org/abs/2505.15431
| oezi wrote:
| One question I was wondering about regarding the open models
| released by big labs is how much more the could improve with
| additional training. GPT-OSS has 2.1m hours of training, how much
| score improvements could we see at double that?
| poorman wrote:
| As we saw with GPT-5 the RL technique of training doesn't scale
| forever
| oezi wrote:
| I meant scaling the base training before RL.
| ModelForge wrote:
| I think GPT-4.5 was potentially the original GPT-5 model that
| was larger and pre-trained on more data. Too bad it was too
| expensive to deploy at scale so that we never saw the RL-ed
| version
| chaos_emergent wrote:
| > This is likely because LLMs are typically trained for only a
| single epoch over massive datasets, which is in contrast to the
| multi-hundred-epoch training regimes for which dropout was first
| introduced.
|
| Wait, is this true? That seems like a wild statement to make,
| relatively unsubstantiated?
| typon wrote:
| No this is well known. Look for Table 2.2 in GPT3 paper.
| poorman wrote:
| This article really goes into a lot of detail which is nice. gpt-
| oss is just not good for agentic use in my observation.
|
| tldr; I'll save you a lot of time trying things out for yourself.
| If you are on a >=32 GB Mac download LMStudio and then the
| `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of
| RAM so a 32GB machine is plenty. Set it up with opencode [1] and
| you're off to the races! It has great tool calling ability. The
| tool calling ability of gpt-oss doesn't even come close in my
| observations.
|
| [1] https://opencode.ai/
| ModelForge wrote:
| The ollama one uses even less (around 13 GB), which is nice.
| Apparently the gpt-oss team also shared the mxfp4 optimizations
| for metal
| eurekin wrote:
| I'm still in awe that a local 3090 gpu was able to run the qwen3
| coder instruct 30b-a3b exl3 q6 and...
|
| Was able to create a sample page, tried starting a server,
| recognising a leftover server was running, killing it (and forced
| a prompt for my permission), retrying and finding out it's ip for
| me to open in the browser.
|
| This isn't a demo anymore. That's actually very useful help for
| interns/juniors already.
| ahmedfromtunis wrote:
| When I visit the site I get the error "Your connection is not
| private". Also: "You cannot visit magazine.sebastianraschka.com
| right now because the website uses HSTS."
|
| Chrome latest on Ubuntu.
___________________________________________________________________
(page generated 2025-08-10 23:00 UTC)