[HN Gopher] Qwen3.6-35B-A3B on my laptop drew me a better pelica...
       ___________________________________________________________________
        
       Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude
       Opus 4.7
        
       Author : simonw
       Score  : 237 points
       Date   : 2026-04-16 17:37 UTC (5 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | ericpauley wrote:
       | Going to have to disagree on the backup test. Opus flamingo is
       | actually on the pedals and seat with functional spokes and beak.
       | In terms of adherence to physical reality Qwen is completely off.
       | To me it's a little puzzling that someone would prefer the Qwen
       | output.
       | 
       | I'd say the example actually does (vaguely) suggest that Qwen
       | might be overfitting to the Pelican.
        
         | wongarsu wrote:
         | Qwen's flamingo is artistically far more interesting. It's a
         | one-eyed flamingo with sunglasses and a bow tie who smokes pot.
         | Meanwhile Opus just made a boring, somewhat dorky flamingo.
         | Even the ground and sky are more interesting in Qwen's version
         | 
         | But in terms of making something physically plausible, Opus
         | certainly got a lot closer
        
           | kmacdough wrote:
           | Given adherence is a more significant practical barrier, it's
           | probably the better signal. That is, if we decide too look
           | for signal here.
        
             | doobiedowner wrote:
             | Is "signal" the new word you weirdos grossly abuse?... The
             | new "optics"?
        
         | tecoholic wrote:
         | Even the first one - Qwen added extra details in the background
         | sure. But he Pelican itself is a stork with a bent beak and
         | it's feet is cut off it's legs. While impressive for a local
         | model, I don't think it's a winner.
        
           | mejutoco wrote:
           | Did you see opus bike though for that same test? I know it is
           | about the flamingo but that is bad.
        
         | irthomasthomas wrote:
         | It's a 3B model. It should not be this close. Debating their
         | artistic qualities is missing the point.
        
       | comandillos wrote:
       | I've been using Qwen3.5-35B-A3B for a bit via open code and oMLX
       | on M5 Max with 128Gb of RAM and I have to say it's impressively
       | good for a model of that size. I've seen a huge jump in the
       | quality of the tool calls and how well it handles the agentic
       | workflow.
        
         | iib wrote:
         | This is about the newly release Qwen3.6. Just wanted to make
         | sure you got that correctly.
        
       | mentalgear wrote:
       | I understand the 'fun factor' but at this point I really wonder
       | what this pelican still proofs ? I mean, providers certainly
       | could have adapted for it if they wanted, and if you want to test
       | how well a model adapts to potential out of distribution
       | contexts, it might be more worthwhile to mix different animals
       | with different activity types (a whale on a skateboard) than
       | always the same.
        
         | simonw wrote:
         | That's why I did the flamingo on a unicycle.
         | 
         | For a delightful moment this morning I thought I might have
         | finally caught a model provider cheating by training for the
         | pelican, but the flamingo convinced me that wasn't the case.
        
           | prodigycorp wrote:
           | To me the opus flamingo is waaaay better than the qwen one.
           | qwen has the better pelican, though.
        
           | dude250711 wrote:
           | Is a flamingo on a unicycle not merely a special case of a
           | pelican on a bicycle?
        
           | furyofantares wrote:
           | It is completely wild to me that you prefer Qwen's flamingo.
           | I think it's really bad and Opus' is pretty good.
        
             | simonw wrote:
             | The Opus one doesn't even have a bowtie.
        
               | furyofantares wrote:
               | The Opus one looks like a flamingo, and looks like it's
               | riding the unicycle. Sitting on the seat. Feet on the
               | pedals.
               | 
               | The Qwen one looks like a 3-tailed, broken-winged,
               | beakless (I guess? Is that offset white thing a beak? Or
               | is it chewing on a pelican feather like it's a piece of
               | straw?) monstrosity not sitting on the seat, with its one
               | foot off the pedal (the other chopped off at the knee) of
               | a malmanufactured wheel that has bonus spokes that are
               | longer than the wheel.
               | 
               | But yeah, it does have a bowtie and sunglasses that you
               | didn't ask for! Plus it says "<3 Flamingo on a Unicycle
               | <3", which perhaps resolves all ambiguity.
        
               | bigyabai wrote:
               | Let's not oversell Opus' output. The Qwen flamingo is
               | flawed but could be easily fixed with 1-2 prompts if
               | you're really upset with it. The Opus SVG is not any
               | better than something that I could make in Inkscape with
               | 3 minutes and sufficient motivation. Calling Opus'
               | flamingo "programmer art" would be an insult to
               | programmers.
        
               | monksy wrote:
               | Game over opus
        
           | akavel wrote:
           | r/LocalLlama is now doing a horse in a racing car:
           | 
           | https://redd.it/1slz38i
        
         | stephbook wrote:
         | They're certainly aware of the test, but a turtle doing a
         | kickflip on a skateboard? I seriously doubt they train their
         | models for that.
         | 
         | https://x.com/JeffDean/status/2024525132266688757
         | 
         | If anything, the disastrous Opus4.7 pelican shows us they don't
         | pelicanmaxx
        
           | bitwize wrote:
           | I think I found the leaked Claude Mythos version of the
           | turtle benchmark: https://www.youtube.com/watch?v=l82XWTKLZuk
        
         | BoorishBears wrote:
         | This is a gag that's long outlived its humor, but we're in a
         | space so driven by hype there are people who will unironically
         | take some signal from it. They'll swear up and down they know
         | it's for fun, but let a great pelican come out and see if they
         | don't wave it as proof the model is great alongside their
         | carwash test.
        
       | jbellis wrote:
       | For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking
       | tasks (best-of-two), compared to 10/98 for the same size qwen
       | 3.5. So it's at best very slightly improved and not at all in the
       | class of qwen 3.5 27b dense (26 solved) let alone opus (95/98
       | solved, for 4.6).
        
         | __natty__ wrote:
         | You compare tiny modal for local inference vs propertiary,
         | expensive frontier model. It would be more fair to compare
         | against similar priced model or tiny frontier models like
         | haiku, flash or gpt nano.
        
           | ericd wrote:
           | Eh it's important perspective, lest someone start thinking
           | they can drop $5k on a laptop and be free of
           | Anthropic/OpenAI. Expensive lesson.
        
           | javawizard wrote:
           | Not when the article they're commenting on was doing
           | literally exactly the same thing.
        
         | kristianp wrote:
         | This has similar problems to swe bench in that models are
         | likely trained on the same open source projects that the
         | benchmark uses.
         | 
         | https://blog.brokk.ai/introducing-the-brokk-power-ranking/
        
           | yorwba wrote:
           | If all models are trained on the benchmark data, you cannot
           | extrapolate the benchmark scores to performance on unseen
           | data, but the ranking of different models still tells you
           | something. A model that solves 95/98 benchmark problems may
           | turn out much worse than that in real life, but probably not
           | much worse than the one that only solved 11/98 despite
           | training on the benchmark problems.
           | 
           | This doesn't hold if some models trained on the benchmark and
           | some didn't, but you can fix this by deliberately fine-tuning
           | all models for the benchmark before comparing them. For more
           | in-depth discussion of this, see
           | https://mlbenchmarks.org/11-evaluating-language-
           | models.html#...
        
       | 19qUq wrote:
       | How about switching to MechaStalin on a tricycle? It gets kind of
       | boring.
        
         | mvanbaak wrote:
         | boring ... the ways all the models fail at a simple task never
         | gets boring to me
        
       | VHRanger wrote:
       | That's not surprising; Opus & Sonnet have been regressing on many
       | non-coding tasks since about the 4.1 release in our testing
        
       | aliljet wrote:
       | I'm really curious about what competes with Claude Code to drive
       | a local LLM like Qwen 3.6?
        
         | smashed wrote:
         | OpenCode?
        
         | chabes wrote:
         | OpenCode or Pi are popular agent harnesses. Lots of IDEs
         | integrate LLMs now. I believe there's also a Qwen Code that
         | exists, but I have yet to try it.
        
       | lofaszvanitt wrote:
       | That Qwen flamingo on the unicycle is actually quite good. A work
       | of art.
        
       | jedisct1 wrote:
       | I'm currently testing Qwen3.6-35B-A3B with https://swival.dev for
       | security reviews.
       | 
       | It's pretty good at finding bugs, but not so good at writing
       | patches to fix them.
        
       | throwuxiytayq wrote:
       | I literally cannot believe that people are wasting their time
       | doing this either as a benchmark _or_ for fun. After every single
       | language model release, no less.
        
         | sharkjacobs wrote:
         | It feels like the results stopped being interesting a little
         | while ago but the practice has become part of simonw's brand,
         | and it gives him something to post even when there is nothing
         | interesting to say about another incremental improvement to a
         | model, and so I don't imagine he'll stop.
        
           | stephbook wrote:
           | I, for one, expected progress. Uneven, sometimes delayed, but
           | ever increasing progress.
           | 
           | But that Opus pelican?
        
         | segmondy wrote:
         | I can't believe you're such a party pooper. It's exciting
         | times, the silly things do matter!
        
         | recursive wrote:
         | Fun is so un-productive. Everyone doing things for "fun" is
         | going to be sorry when they look back and realizes they were
         | wasting time having a "good time" rather than optimizing their
         | KPIs.
        
         | cedws wrote:
         | It's not a waste of time. As the boundaries of AI are pushed we
         | increasingly struggle to define what intelligence actually is.
         | It becomes more useful to test what models cannot do instead of
         | what they can. Random tasks like the pelican test can show how
         | general the intelligence really is, putting aside the obvious
         | flaw that the labs can optimise for such a simple public
         | benchmark.
        
       | sailingcode wrote:
       | I'm an iguana and need to wash my bicycle in the carwash. Shall I
       | walk or take the bus?
        
         | DANmode wrote:
         | That's a long walk! You should reserve a ride with
         | $PartnerRideshareCo.
        
         | layer8 wrote:
         | You should have the pelican ride it to the carwash and wash it
         | for you.
        
       | wood_spirit wrote:
       | Such a disconnect from the minutes I've lost and given up on
       | Gemini trying to get it to update a diagram in a slide today. The
       | one shot joke stuff is great but trying to say "that is close but
       | just make this small change" seems impossible. It's the gap
       | between toy and tool.
        
       | JaggerFoo wrote:
       | FYI, using a 128GB M5 MacBook Pro, sourced from another article
       | by the author.
        
       | bottlepalm wrote:
       | I really wish they spent some time training for computer use.
       | This model is incapable of finding anywhere near the correct x,y
       | coordinate of a simple object in a picture.
        
       | simon_is_genius wrote:
       | Great analysis
        
       | justinbaker84 wrote:
       | I love this benchmark!
        
       | refulgentis wrote:
       | I liked both of Opus' better, it was very illuminating, in both
       | cases I didn't see the error's Simon saw and wondered why Simon
       | skipped over the errors I saw.
       | 
       | Pelican: saturated!
        
       | nba456_ wrote:
       | Good reminder that these tests have always been useless, even
       | before they started training on it.
        
       | f33d5173 wrote:
       | I don't know what such a demo would prove in the first place.
       | LLMs are good at things that they have been trained on, or are
       | analogues of things they have been trained on. SVG generation
       | isn't really an analogue to any task that we usually call on LLMs
       | to do. Early models were bad at it because their training only
       | had poor examples of it. At a certain point model companies
       | decided it would be good PR to be halfway decent at generating
       | SVGs, added a bunch of examples to the finetuning, and voila.
       | They still aren't good enough to be useful for anything, and such
       | improvements don't lead them to be good at anything else - likely
       | the opposite - but it makes for cute demos.
       | 
       | I guess initially it would have been a silly way to demonstrate
       | the effect of model size. But the size of the largest models
       | stopped increasing a while ago, recent improvements are driven
       | principally by optimizing for specific tasks. If you had some
       | secret task that you knew they weren't training for then you
       | could use that as a benchmark for how much the models are
       | improving versus overfitting for their training set, but this is
       | not that.
        
         | simonw wrote:
         | Comparing the SVGs I got for GPT-5.4, -mini and -nano at the
         | different thinking levels was surprisingly interesting:
         | https://simonwillison.net/2026/Mar/17/mini-and-nano/ (bottom of
         | post)
        
       | whywhywhywhy wrote:
       | How can this be a test when every lab is testing against it...
       | You spam this every model release but it's asinine.
        
         | simonw wrote:
         | If they're testing against it why do most of their attempts
         | suck so much?
        
       | yieldcrv wrote:
       | All those models that were just at version 1.x in 2024
       | 
       | That's so wild
        
       ___________________________________________________________________
       (page generated 2026-04-16 23:00 UTC)