SWE-bench February 2025 leaderboard update
SWE-bench February 2025 leaderboard update (https://www.swebench.com/)
SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard is infrequently updated but they just did a full run of it against the current generation of models, which is notable because it’s always good to see benchmark results like this that weren’t self-reported by the labs.
The fresh results are for their “Bash Only” benchmark, which runs their mini-swe-bench (https://github.com/SWE-agent/mini-swe-agent) agent (~9,000 lines of Python, here are the prompts (https://github.com/SWE-agent/mini-swe-agent/blob/v2.2.1/src/minisweagent/config/benchmarks/swebench.yaml) they use) against the SWE-bench (https://huggingface.co/datasets/princeton-nlp/SWE-bench) dataset of coding problems - 2,294 real-world examples pulled from 12 open source repos: django/django (https://github.com/django/django) (850), sympy/sympy (https://github.com/sympy/sympy) (386), scikit-learn/scikit-learn (https://github.com/scikit-learn/scikit-learn) (229), sphinx-doc/sphinx (https://github.com/sphinx-doc/sphinx) (187), matplotlib/matplotlib (https://github.com/matplotlib/matplotlib) (184), pytest-dev/pytest (https://github.com/pytest-dev/pytest) (119), pydata/xarray (https://github.com/pydata/xarray) (110), astropy/astropy (https://github.com/astropy/astropy) (95), pylint-dev/pylint (https://github.com/pylint-dev/pylint) (57), psf/requests (https://github.com/psf/requests) (44), mwaskom/seaborn (https://github.com/mwaskom/seaborn) (22), pallets/flask (https://github.com/pallets/flask) (11).
Here’s how the top ten models performed:
It’s interesting to see Claude Opus 4.5 beat Opus 4.6, though only by about a percentage point. 4.5 Opus is top, then Gemini 3 Flash, then MiniMax M2.5 - a 229B model released last week (https://www.minimax.io/news/minimax-m25) by Chinese lab MiniMax. GLM-5, Kimi K2.5 and DeepSeek V3.2 are three more Chinese models that make the top ten as well.
OpenAI’s GPT-5.2 is their highest performing model at position 6, but it’s worth noting that their best coding model, GPT-5.3-Codex, is not represented - maybe because it’s not yet available in the OpenAI API.
This benchmark uses the same system prompt for every model, which is important for a fair comparison but does mean that the quality of the different harnesses or optimized prompts is not being measured here.
The chart above is a screenshot from the SWE-bench website, but their charts don’t include the actual percentage values visible on the bars. I successfully used Claude for Chrome to add these - transcript here (https://claude.ai/share/81a0c519-c727-4caa-b0d4-0d866375d0da). My prompt sequence included:
Use claude in chrome to open https://www.swebench.com/
Click on “Compare results” and then select “Select top 10”
See those bar charts? I want them to display the percentage on each bar so I can take a better screenshot, modify the page like that
I’m impressed at how well this worked - Claude injected custom JavaScript into the page to draw additional labels on top of the existing chart.
Via @KLieret (https://twitter.com/KLieret/status/2024176335782826336)
Tags: benchmarks (https://simonwillison.net/tags/benchmarks), django (https://simonwillison.net/tags/django), ai (https://simonwillison.net/tags/ai), openai (https://simonwillison.net/tags/openai), generative-ai (https://simonwillison.net/tags/generative-ai), llms (https://simonwillison.net/tags/llms), anthropic (https://simonwillison.net/tags/anthropic), claude (https://simonwillison.net/tags/claude), coding-agents (https://simonwillison.net/tags/coding-agents), ai-in-china (https://simonwillison.net/tags/ai-in-china), minimax (https://simonwillison.net/tags/minimax)