[HN Gopher] Qwen3.6-35B-A3B on my laptop drew me a better pelica...
___________________________________________________________________
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude
Opus 4.7
Author : simonw
Score : 237 points
Date : 2026-04-16 17:37 UTC (5 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| ericpauley wrote:
| Going to have to disagree on the backup test. Opus flamingo is
| actually on the pedals and seat with functional spokes and beak.
| In terms of adherence to physical reality Qwen is completely off.
| To me it's a little puzzling that someone would prefer the Qwen
| output.
|
| I'd say the example actually does (vaguely) suggest that Qwen
| might be overfitting to the Pelican.
| wongarsu wrote:
| Qwen's flamingo is artistically far more interesting. It's a
| one-eyed flamingo with sunglasses and a bow tie who smokes pot.
| Meanwhile Opus just made a boring, somewhat dorky flamingo.
| Even the ground and sky are more interesting in Qwen's version
|
| But in terms of making something physically plausible, Opus
| certainly got a lot closer
| kmacdough wrote:
| Given adherence is a more significant practical barrier, it's
| probably the better signal. That is, if we decide too look
| for signal here.
| doobiedowner wrote:
| Is "signal" the new word you weirdos grossly abuse?... The
| new "optics"?
| tecoholic wrote:
| Even the first one - Qwen added extra details in the background
| sure. But he Pelican itself is a stork with a bent beak and
| it's feet is cut off it's legs. While impressive for a local
| model, I don't think it's a winner.
| mejutoco wrote:
| Did you see opus bike though for that same test? I know it is
| about the flamingo but that is bad.
| irthomasthomas wrote:
| It's a 3B model. It should not be this close. Debating their
| artistic qualities is missing the point.
| comandillos wrote:
| I've been using Qwen3.5-35B-A3B for a bit via open code and oMLX
| on M5 Max with 128Gb of RAM and I have to say it's impressively
| good for a model of that size. I've seen a huge jump in the
| quality of the tool calls and how well it handles the agentic
| workflow.
| iib wrote:
| This is about the newly release Qwen3.6. Just wanted to make
| sure you got that correctly.
| mentalgear wrote:
| I understand the 'fun factor' but at this point I really wonder
| what this pelican still proofs ? I mean, providers certainly
| could have adapted for it if they wanted, and if you want to test
| how well a model adapts to potential out of distribution
| contexts, it might be more worthwhile to mix different animals
| with different activity types (a whale on a skateboard) than
| always the same.
| simonw wrote:
| That's why I did the flamingo on a unicycle.
|
| For a delightful moment this morning I thought I might have
| finally caught a model provider cheating by training for the
| pelican, but the flamingo convinced me that wasn't the case.
| prodigycorp wrote:
| To me the opus flamingo is waaaay better than the qwen one.
| qwen has the better pelican, though.
| dude250711 wrote:
| Is a flamingo on a unicycle not merely a special case of a
| pelican on a bicycle?
| furyofantares wrote:
| It is completely wild to me that you prefer Qwen's flamingo.
| I think it's really bad and Opus' is pretty good.
| simonw wrote:
| The Opus one doesn't even have a bowtie.
| furyofantares wrote:
| The Opus one looks like a flamingo, and looks like it's
| riding the unicycle. Sitting on the seat. Feet on the
| pedals.
|
| The Qwen one looks like a 3-tailed, broken-winged,
| beakless (I guess? Is that offset white thing a beak? Or
| is it chewing on a pelican feather like it's a piece of
| straw?) monstrosity not sitting on the seat, with its one
| foot off the pedal (the other chopped off at the knee) of
| a malmanufactured wheel that has bonus spokes that are
| longer than the wheel.
|
| But yeah, it does have a bowtie and sunglasses that you
| didn't ask for! Plus it says "<3 Flamingo on a Unicycle
| <3", which perhaps resolves all ambiguity.
| bigyabai wrote:
| Let's not oversell Opus' output. The Qwen flamingo is
| flawed but could be easily fixed with 1-2 prompts if
| you're really upset with it. The Opus SVG is not any
| better than something that I could make in Inkscape with
| 3 minutes and sufficient motivation. Calling Opus'
| flamingo "programmer art" would be an insult to
| programmers.
| monksy wrote:
| Game over opus
| akavel wrote:
| r/LocalLlama is now doing a horse in a racing car:
|
| https://redd.it/1slz38i
| stephbook wrote:
| They're certainly aware of the test, but a turtle doing a
| kickflip on a skateboard? I seriously doubt they train their
| models for that.
|
| https://x.com/JeffDean/status/2024525132266688757
|
| If anything, the disastrous Opus4.7 pelican shows us they don't
| pelicanmaxx
| bitwize wrote:
| I think I found the leaked Claude Mythos version of the
| turtle benchmark: https://www.youtube.com/watch?v=l82XWTKLZuk
| BoorishBears wrote:
| This is a gag that's long outlived its humor, but we're in a
| space so driven by hype there are people who will unironically
| take some signal from it. They'll swear up and down they know
| it's for fun, but let a great pelican come out and see if they
| don't wave it as proof the model is great alongside their
| carwash test.
| jbellis wrote:
| For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking
| tasks (best-of-two), compared to 10/98 for the same size qwen
| 3.5. So it's at best very slightly improved and not at all in the
| class of qwen 3.5 27b dense (26 solved) let alone opus (95/98
| solved, for 4.6).
| __natty__ wrote:
| You compare tiny modal for local inference vs propertiary,
| expensive frontier model. It would be more fair to compare
| against similar priced model or tiny frontier models like
| haiku, flash or gpt nano.
| ericd wrote:
| Eh it's important perspective, lest someone start thinking
| they can drop $5k on a laptop and be free of
| Anthropic/OpenAI. Expensive lesson.
| javawizard wrote:
| Not when the article they're commenting on was doing
| literally exactly the same thing.
| kristianp wrote:
| This has similar problems to swe bench in that models are
| likely trained on the same open source projects that the
| benchmark uses.
|
| https://blog.brokk.ai/introducing-the-brokk-power-ranking/
| yorwba wrote:
| If all models are trained on the benchmark data, you cannot
| extrapolate the benchmark scores to performance on unseen
| data, but the ranking of different models still tells you
| something. A model that solves 95/98 benchmark problems may
| turn out much worse than that in real life, but probably not
| much worse than the one that only solved 11/98 despite
| training on the benchmark problems.
|
| This doesn't hold if some models trained on the benchmark and
| some didn't, but you can fix this by deliberately fine-tuning
| all models for the benchmark before comparing them. For more
| in-depth discussion of this, see
| https://mlbenchmarks.org/11-evaluating-language-
| models.html#...
| 19qUq wrote:
| How about switching to MechaStalin on a tricycle? It gets kind of
| boring.
| mvanbaak wrote:
| boring ... the ways all the models fail at a simple task never
| gets boring to me
| VHRanger wrote:
| That's not surprising; Opus & Sonnet have been regressing on many
| non-coding tasks since about the 4.1 release in our testing
| aliljet wrote:
| I'm really curious about what competes with Claude Code to drive
| a local LLM like Qwen 3.6?
| smashed wrote:
| OpenCode?
| chabes wrote:
| OpenCode or Pi are popular agent harnesses. Lots of IDEs
| integrate LLMs now. I believe there's also a Qwen Code that
| exists, but I have yet to try it.
| lofaszvanitt wrote:
| That Qwen flamingo on the unicycle is actually quite good. A work
| of art.
| jedisct1 wrote:
| I'm currently testing Qwen3.6-35B-A3B with https://swival.dev for
| security reviews.
|
| It's pretty good at finding bugs, but not so good at writing
| patches to fix them.
| throwuxiytayq wrote:
| I literally cannot believe that people are wasting their time
| doing this either as a benchmark _or_ for fun. After every single
| language model release, no less.
| sharkjacobs wrote:
| It feels like the results stopped being interesting a little
| while ago but the practice has become part of simonw's brand,
| and it gives him something to post even when there is nothing
| interesting to say about another incremental improvement to a
| model, and so I don't imagine he'll stop.
| stephbook wrote:
| I, for one, expected progress. Uneven, sometimes delayed, but
| ever increasing progress.
|
| But that Opus pelican?
| segmondy wrote:
| I can't believe you're such a party pooper. It's exciting
| times, the silly things do matter!
| recursive wrote:
| Fun is so un-productive. Everyone doing things for "fun" is
| going to be sorry when they look back and realizes they were
| wasting time having a "good time" rather than optimizing their
| KPIs.
| cedws wrote:
| It's not a waste of time. As the boundaries of AI are pushed we
| increasingly struggle to define what intelligence actually is.
| It becomes more useful to test what models cannot do instead of
| what they can. Random tasks like the pelican test can show how
| general the intelligence really is, putting aside the obvious
| flaw that the labs can optimise for such a simple public
| benchmark.
| sailingcode wrote:
| I'm an iguana and need to wash my bicycle in the carwash. Shall I
| walk or take the bus?
| DANmode wrote:
| That's a long walk! You should reserve a ride with
| $PartnerRideshareCo.
| layer8 wrote:
| You should have the pelican ride it to the carwash and wash it
| for you.
| wood_spirit wrote:
| Such a disconnect from the minutes I've lost and given up on
| Gemini trying to get it to update a diagram in a slide today. The
| one shot joke stuff is great but trying to say "that is close but
| just make this small change" seems impossible. It's the gap
| between toy and tool.
| JaggerFoo wrote:
| FYI, using a 128GB M5 MacBook Pro, sourced from another article
| by the author.
| bottlepalm wrote:
| I really wish they spent some time training for computer use.
| This model is incapable of finding anywhere near the correct x,y
| coordinate of a simple object in a picture.
| simon_is_genius wrote:
| Great analysis
| justinbaker84 wrote:
| I love this benchmark!
| refulgentis wrote:
| I liked both of Opus' better, it was very illuminating, in both
| cases I didn't see the error's Simon saw and wondered why Simon
| skipped over the errors I saw.
|
| Pelican: saturated!
| nba456_ wrote:
| Good reminder that these tests have always been useless, even
| before they started training on it.
| f33d5173 wrote:
| I don't know what such a demo would prove in the first place.
| LLMs are good at things that they have been trained on, or are
| analogues of things they have been trained on. SVG generation
| isn't really an analogue to any task that we usually call on LLMs
| to do. Early models were bad at it because their training only
| had poor examples of it. At a certain point model companies
| decided it would be good PR to be halfway decent at generating
| SVGs, added a bunch of examples to the finetuning, and voila.
| They still aren't good enough to be useful for anything, and such
| improvements don't lead them to be good at anything else - likely
| the opposite - but it makes for cute demos.
|
| I guess initially it would have been a silly way to demonstrate
| the effect of model size. But the size of the largest models
| stopped increasing a while ago, recent improvements are driven
| principally by optimizing for specific tasks. If you had some
| secret task that you knew they weren't training for then you
| could use that as a benchmark for how much the models are
| improving versus overfitting for their training set, but this is
| not that.
| simonw wrote:
| Comparing the SVGs I got for GPT-5.4, -mini and -nano at the
| different thinking levels was surprisingly interesting:
| https://simonwillison.net/2026/Mar/17/mini-and-nano/ (bottom of
| post)
| whywhywhywhy wrote:
| How can this be a test when every lab is testing against it...
| You spam this every model release but it's asinine.
| simonw wrote:
| If they're testing against it why do most of their attempts
| suck so much?
| yieldcrv wrote:
| All those models that were just at version 1.x in 2024
|
| That's so wild
___________________________________________________________________
(page generated 2026-04-16 23:00 UTC)