[HN Gopher] GPT-OSS vs. Qwen3 and a detailed look how things evo...
       ___________________________________________________________________
        
       GPT-OSS vs. Qwen3 and a detailed look how things evolved since
       GPT-2
        
       Author : ModelForge
       Score  : 254 points
       Date   : 2025-08-10 15:06 UTC (7 hours ago)
        
 (HTM) web link (magazine.sebastianraschka.com)
 (TXT) w3m dump (magazine.sebastianraschka.com)
        
       | homarp wrote:
       | "From GPT-2 to gpt-oss: Analyzing the Architectural Advances And
       | How They Stack Up Against Qwen3"
        
       | 7moritz7 wrote:
       | Qwen3 is substantially better in my local testing. As in, adheres
       | to the prompt better (pretty much exactly for the 32B parameter
       | variant, very impressive) and is more organic sounding.
       | 
       | In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear
       | particularly good at logical puzzles either.
       | 
       | So presumably, this comes down to...
       | 
       | - training technique or data
       | 
       | - dimension
       | 
       | - lower number of large experts vs higher number of small experts
        
         | jszymborski wrote:
         | If I had to make a guess, I'd say this has much, much less to
         | do with the architecture and far more to do with the data and
         | training pipeline. Many have speculated that gpt-oss has
         | adopted a Phi-like synthetic-only dataset and focused mostly on
         | gaming metrics, and I've found the evidence so far to be
         | sufficiently compelling.
        
           | 7moritz7 wrote:
           | That would be interesting. I've been a bit sceptical of the
           | entire strategy from the beginning. If oss was actually as
           | good as o3 mini and in some cases o4 mini outside benchmarks,
           | that would undermine openai's api offer for gpt 5 nano and
           | maybe mini too.
           | 
           | Edit: found this analysis, it's on the HN frontpage right now
           | 
           | > this thing is clearly trained via RL to think and solve
           | tasks for specific reasoning benchmarks. nothing else.
           | 
           | https://x.com/jxmnop/status/1953899426075816164
        
             | CuriouslyC wrote:
             | The strategy of Phi isn't bad, it's just not general. It's
             | really a model that's meant to be fine tuned, but
             | unfortunately fine tuning tends to shit on RL'd behavior,
             | so it ended up not being that useful. If someone made a Phi
             | style model with an architecture that was designed to take
             | knowledge adapters/experts (i.e. small MoE model designed
             | to get separately trained networks plugged into them with
             | routing updates via special LoRA) it'd actually be super
             | useful.
        
               | adastra22 wrote:
               | The Phi strategy is bad. It results in very bad models
               | that are useless in production, while gaming the
               | benchmark to appear like it is actually able to do
               | something. This is objectively bad.
        
               | CuriouslyC wrote:
               | I like the idea of having a _HIGHLY_ unopinionated base
               | model that's just good at basic logic and instruction
               | following that I can fine tune to my use case. Sadly,
               | full fine tuning tends to make models derpy, and LoRAs
               | are limited in terms of what they can achieve.
        
               | adastra22 wrote:
               | That seems unrelated? I think we are talking about past
               | each other. Phi was trained on purely synthetic data
               | derived from emulating the benchmark suite. Not
               | surprisingly, this resulted in state of the art scores.
               | And a model that was 100% useless at anything other than
               | making the benchmark number go up.
        
           | unstatusthequo wrote:
           | Yes. I tried to ask oss-gpt to ask me a riddle. The response
           | was absurd. Came up with a nonsensical question, then told me
           | the answer. The answer was a four letter "word" that wasn't
           | actually a real word.
           | 
           | "What is the word that starts with S, ends with E, and
           | contains A? - SAEA"
           | 
           | Then when I said that's not a word and you gave me the answer
           | already, no fun, it said
           | 
           | "I do not have access to confirm that word."
        
             | verisimi wrote:
             | lol. The answer it gave doesn't even end in an 'E'.
        
             | roscas wrote:
             | "The word is SAME." for qwen3-coder and it took like 10
             | seconds to write all the thinking because it writes it
             | down, but processing was 0 seconds.
             | 
             | "I need to find a word that starts with S, ends with E, and
             | contains A.
             | 
             | Let me think through this step by step:
             | Starts with S         Ends with E         Contains A
             | somewhere in the middle
             | 
             | One word that fits this pattern is "SAGE" - but that only
             | has 4 letters and would be S-A-G-E, which doesn't work
             | because it doesn't contain A in the middle.
             | 
             | Wait, let me reconsider. The word "SAGE" actually works if
             | I read it correctly:                   Starts with S
             | Ends with E           Contains A
             | 
             | Actually, "SAGE" is S-A-G-E, so it starts with S and ends
             | with E, but it contains A in the second position.
             | 
             | Let me think of other possibilities:
             | "SQUARE" - starts with S, ends with E, contains A (in the
             | middle)         "SALE" - starts with S, ends with E,
             | contains A         "SAME" - starts with S, ends with E,
             | contains A
             | 
             | Wait, "SAME" is S-A-M-E - that works! It starts with S,
             | ends with E, and contains A.
             | 
             | The word is SAME. "
        
               | adastra22 wrote:
               | He was asking the llm to come up with the riddle.
        
             | threeducks wrote:
             | FWIW, I asked gpt-oss-120b this question 10 times and the
             | answer was always "sauce", "sane" or "sale". I also tried
             | different temperatures (from 0 to 1), which did not seem to
             | have an effect on the correctness of the answer.
             | 
             | EDIT: I now have also questioned the smaller gpt-oss-20b
             | (free) 10 times via OpenRouter (default settings, provider
             | was AtlasCloud) and the answers were: sage, sane, sane,
             | space, sane, sane, sane, sane, space, sane.
             | 
             | You are either very unlucky, your configuration is
             | suboptimal (weird system prompt perhaps?) or there is some
             | bug in whichever system you are using for inference.
        
               | yunusabd wrote:
               | GP asked the model to _create_ a riddle, not solve a
               | given one.
        
               | threeducks wrote:
               | Yes, but the odds of getting GPT-OSS to respond with that
               | riddle are pretty low and it is not necessary to
               | demonstrate whether the LLM can answer the riddle
               | correctly.
        
         | BoorishBears wrote:
         | MoE expected performance = sqrt(active heads * total parameter
         | count)
         | 
         | sqrt(120*5) ~= 24
         | 
         | GPT-OSS 120B is effectively a 24B parameter model with the
         | speed of a much smaller model
        
         | cranberryturkey wrote:
         | qwen3 is slow though. i used it. it worked, but it was slow and
         | lacking features.
        
       | roscas wrote:
       | From my experience, qwen3-coder is way better. I only have gpt-
       | oss:20b installed to make a few more tests but I give it a
       | program to make a summary of what it does and qwen3 just works in
       | a few seconds, while gpt-oss was cancelled after 5 minuts...
       | doing nothing.
       | 
       | So I just use qwen3. Fast and great ouput. If for some reason I
       | don't get what I need, I might use search engines or Perplexity.
       | 
       | I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.
       | 
       | Qwen3-coder is amazing. Best I used so far.
        
         | smokel wrote:
         | The 20B version doesn't fit in 10GB. That might explain some
         | issues?
        
         | mhitza wrote:
         | I've been using lightly gpt-oss-20b but what I've found is that
         | for smaller (single sentence) prompts it was easy enough to
         | have it loop infinitely. Since I'm running it with llama.cpp
         | I've set a small repetition penalty and haven't encountered
         | those issues since (I'm using it a couple of times a day to
         | analyze diffs, so I might have just gotten lucky since)
        
           | ModelForge wrote:
           | I've been using the ollama version (uses about 13 Gb RAM on
           | macOS) and haven't had that issue yet. I wonder if that's
           | maybe an issue of the llama.cpp port?
        
             | mhitza wrote:
             | Never used ollama, only ready to go models via llamafile
             | and llama.cpp.
             | 
             | Maybe ollama has some defaults it applies to models? I
             | start testing models at 0 temp and tweak from there
             | depending how they behave.
        
           | nicolaslem wrote:
           | I had the same issue with other models where they would loop
           | repeating the same character, sentence or paragraph
           | indefinitely. Turns out the context size some tools set by
           | default is 2k and this is way too small.
        
         | lvl155 wrote:
         | Qwen3 coder 480B is quite good and on par with Sonnet 4. It's
         | the first time I realized the Chinese models are probably going
         | to eclipse US-based models pretty soon, at least for coding.
        
           | indigodaddy wrote:
           | Where do you use qwen3 480b from, I'm not even seeing it on
           | Openrouter. EDIT nm, openrouter is just calling it
           | qwen3-coder-- when I click for more info it shows it's
           | Qwen3-Coder-480B-A35B-Instruct. And it's one of their free
           | models. Nice
        
           | cpursley wrote:
           | That might be a stretch, maybe Sonnet 3.5. But it is pretty
           | impressive as is Kimi on opencode.
        
         | SV_BubbleTime wrote:
         | Are you using this in an agentic way or in a copy and paste and
         | "code this" single input single output way?
         | 
         | I'd like to know how far the frontier models are from the local
         | for agentic coding.
        
       | Scene_Cast2 wrote:
       | I find it interesting that the architectures of modern open
       | weight LLMs are so similar, and that most innovation seems to be
       | happening on the training (data, RL) front.
       | 
       | This is contrary to what I've seen in a large ML shop, where
       | architectural tuning was king.
        
         | ModelForge wrote:
         | Good point. LLMs lower the barrier to entry if someone has
         | enough resources because those architectures are more robust to
         | tweaks given one throws enough compute and data at them. You
         | can even violate scaling laws and still get a good model (like
         | Llama 3 showed back then)
        
         | bobbylarrybobby wrote:
         | My guess is that at LLM scale, you really can't try to
         | hyperparameter tune -- it's just too expensive. You probably
         | have to do some _basic_ testing of different architectures,
         | settle on one, and then figure out how to make best use of it
         | (data and RL).
        
       | storus wrote:
       | In my tests, GPT-OSS-120B Q8 was close to DeepSeek R1 671B Q16 in
       | solving graduate-level math but much faster with way fewer
       | thinking tokens.
        
         | overfeed wrote:
         | Supporting TFA'd thesis that it's trained to be good at
         | benchmarks.
        
       | mark_l_watson wrote:
       | Wow, Sebastian Raschk's blog articles are jewels - much
       | appreciated.
       | 
       | I use the get-oss and qwen3 models a lot (smaller models locally
       | using Ollama and LM Studio) and commercial APIs for the full size
       | models.
       | 
       | For local model use, I get very good results with get-oss when I
       | "over prompt," that is, I specify a larger amount of context
       | information than I usually do. Qwen3 is simply awesome.
       | 
       | Until about three years ago, I have always understood neural
       | network models (starting in the 1980s), GAN, Recurrent, LSTM,
       | etc. well enough to write implementations. I really miss the
       | feeling that I could develop at least simpler LLMs on my own. I
       | am slowly working through Sebastian Raschk's excellent book
       | https://www.manning.com/books/build-a-large-language-model-f...
       | but I will probably never finish it (to be honest).
        
         | lvl155 wrote:
         | He does an amazing job of keeping me up to date on this
         | insanely fast-paced space.
        
       | pryelluw wrote:
       | The Qwen3 4B has been very good to use local. I barely use the
       | online models. Web searches are now more targeted thanks to it.
       | Don't quite fully trust the output but it's generally good. Mods
       | like these will revolutionize local knowledge and automation
        
         | indigodaddy wrote:
         | Qwen is telling you better search parameters to then search the
         | web with, or qwen is actually doing web searches for you?
        
       | gglon wrote:
       | > At the time of writing, the highest-ranking non-purely-
       | transformer-based model on the LM Arena is Jamba, which is a
       | transformer-state space model hybrid, at rank 96.)
       | 
       | Tencent's hunyuan-turbos, another hybrid, is currently ranked at
       | 22. https://arxiv.org/abs/2505.15431
        
       | oezi wrote:
       | One question I was wondering about regarding the open models
       | released by big labs is how much more the could improve with
       | additional training. GPT-OSS has 2.1m hours of training, how much
       | score improvements could we see at double that?
        
         | poorman wrote:
         | As we saw with GPT-5 the RL technique of training doesn't scale
         | forever
        
           | oezi wrote:
           | I meant scaling the base training before RL.
        
         | ModelForge wrote:
         | I think GPT-4.5 was potentially the original GPT-5 model that
         | was larger and pre-trained on more data. Too bad it was too
         | expensive to deploy at scale so that we never saw the RL-ed
         | version
        
       | chaos_emergent wrote:
       | > This is likely because LLMs are typically trained for only a
       | single epoch over massive datasets, which is in contrast to the
       | multi-hundred-epoch training regimes for which dropout was first
       | introduced.
       | 
       | Wait, is this true? That seems like a wild statement to make,
       | relatively unsubstantiated?
        
         | typon wrote:
         | No this is well known. Look for Table 2.2 in GPT3 paper.
        
       | poorman wrote:
       | This article really goes into a lot of detail which is nice. gpt-
       | oss is just not good for agentic use in my observation.
       | 
       | tldr; I'll save you a lot of time trying things out for yourself.
       | If you are on a >=32 GB Mac download LMStudio and then the
       | `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of
       | RAM so a 32GB machine is plenty. Set it up with opencode [1] and
       | you're off to the races! It has great tool calling ability. The
       | tool calling ability of gpt-oss doesn't even come close in my
       | observations.
       | 
       | [1] https://opencode.ai/
        
         | ModelForge wrote:
         | The ollama one uses even less (around 13 GB), which is nice.
         | Apparently the gpt-oss team also shared the mxfp4 optimizations
         | for metal
        
       | eurekin wrote:
       | I'm still in awe that a local 3090 gpu was able to run the qwen3
       | coder instruct 30b-a3b exl3 q6 and...
       | 
       | Was able to create a sample page, tried starting a server,
       | recognising a leftover server was running, killing it (and forced
       | a prompt for my permission), retrying and finding out it's ip for
       | me to open in the browser.
       | 
       | This isn't a demo anymore. That's actually very useful help for
       | interns/juniors already.
        
       | ahmedfromtunis wrote:
       | When I visit the site I get the error "Your connection is not
       | private". Also: "You cannot visit magazine.sebastianraschka.com
       | right now because the website uses HSTS."
       | 
       | Chrome latest on Ubuntu.
        
       ___________________________________________________________________
       (page generated 2025-08-10 23:00 UTC)