SWE-bench February 2025 leaderboard update

By Simon Willison's Weblog February 19, 2026 · Edited February 19, 2026

SWE-bench February 2025 leaderboard update (https://www.swebench.com/) SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard is infrequently

SWE-bench February 2025 leaderboard update (https://www.swebench.com/)

SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard is infrequently updated but they just did a full run of it against the current generation of models, which is notable because it’s always good to see benchmark results like this that weren’t self-reported by the labs.

Here’s how the top ten models performed:

It’s interesting to see Claude Opus 4.5 beat Opus 4.6, though only by about a percentage point. 4.5 Opus is top, then Gemini 3 Flash, then MiniMax M2.5 - a 229B model released last week (https://www.minimax.io/news/minimax-m25) by Chinese lab MiniMax. GLM-5, Kimi K2.5 and DeepSeek V3.2 are three more Chinese models that make the top ten as well.

OpenAI’s GPT-5.2 is their highest performing model at position 6, but it’s worth noting that their best coding model, GPT-5.3-Codex, is not represented - maybe because it’s not yet available in the OpenAI API.

This benchmark uses the same system prompt for every model, which is important for a fair comparison but does mean that the quality of the different harnesses or optimized prompts is not being measured here.

The chart above is a screenshot from the SWE-bench website, but their charts don’t include the actual percentage values visible on the bars. I successfully used Claude for Chrome to add these - transcript here (https://claude.ai/share/81a0c519-c727-4caa-b0d4-0d866375d0da). My prompt sequence included:

Use claude in chrome to open https://www.swebench.com/

Click on “Compare results” and then select “Select top 10”

See those bar charts? I want them to display the percentage on each bar so I can take a better screenshot, modify the page like that

I’m impressed at how well this worked - Claude injected custom JavaScript into the page to draw additional labels on top of the existing chart.

Via @KLieret (https://twitter.com/KLieret/status/2024176335782826336)

Tags: benchmarks (https://simonwillison.net/tags/benchmarks), django (https://simonwillison.net/tags/django), ai (https://simonwillison.net/tags/ai), openai (https://simonwillison.net/tags/openai), generative-ai (https://simonwillison.net/tags/generative-ai), llms (https://simonwillison.net/tags/llms), anthropic (https://simonwillison.net/tags/anthropic), claude (https://simonwillison.net/tags/claude), coding-agents (https://simonwillison.net/tags/coding-agents), ai-in-china (https://simonwillison.net/tags/ai-in-china), minimax (https://simonwillison.net/tags/minimax)

Reference: https://simonwillison.net/2026/Feb/19/swe-bench/#atom-everything

Write a comment

No comments yet.