Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
For anyone who has been taking my pelican riding a bicycle benchmark (https://simonwillison.net/tags/pelican-riding-a-bicycle/) seriously as a robust way to test models, here are pelicans from this morning’s two big model releases - Qwen3.6-35B-A3B from Alibaba (https://qwen.ai/blog?id=qwen3.6-35b-a3b) and Claude Opus 4.7 from Anthropic (https://www.anthropic.com/news/claude-opus-4-7).
Here’s the Qwen 3.6 pelican, generated using this 20.9GB Qwen3.6-35B-A3B-UD-Q4_K_S.gguf (https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf) quantized model by Unsloth, running on my MacBook Pro M5 via LM Studio (https://lmstudio.ai/) (and the llm-lmstudio (https://github.com/agustif/llm-lmstudio) plugin) - transcript here (https://gist.github.com/simonw/4389d355d8e162bc6e4547da214f7dd2):
And here’s one I got from Anthropic’s brand new Claude Opus 4.7 (https://www.anthropic.com/news/claude-opus-4-7) (transcript (https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c118)):
I’m giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!
I tried Opus a second time passing thinking_level: max. It didn’t do much better (transcript (https://gist.github.com/simonw/7566e04a81accfb9affda83451c0f363)):
I don’t think Qwen are cheating A lot of people are convinced that the labs train for my stupid benchmark (https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/). I don’t think they do, but honestly this result did give me a little glint of suspicion. So I’m burning one of my secret backup tests - here’s what I got from Qwen3.6-35B-A3B and Opus 4.7 for “Generate an SVG of a flamingo riding a unicycle”:
Qwen3.6-35B-A3B
(transcript (https://gist.github.com/simonw/f1d1ff01c34dda5fdedf684cfc430d92))
Opus 4.7
(transcript (https://gist.github.com/simonw/35121ad5dcf23bf860397a103ae88d50))
I’m giving this one to Qwen too, partly for the excellent SVG comment.
What can we learn from this? The pelican benchmark has always been meant as a joke - it’s mainly a statement on how obtuse and absurd the task of comparing these models is.
The weird thing about that joke is that, for the most part, there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models. Those first pelicans from October 2024 (https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/) were junk. The more recent entries (https://simonwillison.net/tags/pelican-riding-a-bicycle/) have generally been much, much better - to the point that Gemini 3.1 Pro produces illustrations you could actually use somewhere (https://simonwillison.net/2026/Feb/19/gemini-31-pro/), provided you had a pressing need to illustrate a pelican riding a bicycle.
Today, even that loose connection to utility has been broken. I have enormous respect for Qwen, but I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic’s latest proprietary release.
If the thing you need is an SVG illustration of a pelican riding a bicycle though, right now Qwen3.6-35B-A3B running on a laptop is a better bet than Opus 4.7!
Tags: ai (https://simonwillison.net/tags/ai), generative-ai (https://simonwillison.net/tags/generative-ai), local-llms (https://simonwillison.net/tags/local-llms), llms (https://simonwillison.net/tags/llms), anthropic (https://simonwillison.net/tags/anthropic), claude (https://simonwillison.net/tags/claude), qwen (https://simonwillison.net/tags/qwen), pelican-riding-a-bicycle (https://simonwillison.net/tags/pelican-riding-a-bicycle), lm-studio (https://simonwillison.net/tags/lm-studio)
Write a comment