HumanEval
Benchmark website →HumanEval measures functional correctness of code generation on 164 hand-written Python programming problems.
About this test
- What it measures
- Code generation quality and correctness (pass@k metric).
- How it was administered
- Models generate code completions; solutions are executed against unit tests; pass@1 and pass@100 reported.
Model rankings
Models ranked by score on this benchmark. Higher is better.
| Rank | Model | Provider | Score | Percentile | Tags |
|---|---|---|---|---|---|
| 1 | Alibaba | 92.7 | p99 | Code Assistant, Open Weight, Medium | |
| 2 | Mistral AI | 92.5 | p99 | Reasoning, Small, Code Assistant, Proprietary | |
| 3 |