SWE-bench Verified
Benchmark website →SWE-bench Verified is a human-validated subset of real GitHub issues from popular Python repositories, testing end-to-end software engineering.
About this test
- What it measures
- Real-world software engineering ability: understanding issues, navigating codebases, writing patches.
- How it was administered
- Models receive a GitHub issue and repository; must produce a git patch that resolves the issue and passes tests.
Model rankings
Models ranked by score on this benchmark. Higher is better.
| Rank | Model | Provider | Score | Percentile | Tags |
|---|---|---|---|---|---|
| 1 | Anthropic | 49.0 | p92 | Autonomous, Multimodal, Proprietary | |
| 2 | Cognition | 41.5 | p88 | AI Agent, Autonomous, Code Assistant, Proprietary | |
| 3 |