HumanEval

Name: HumanEval Benchmark Results
Creator: BAUS.AI

HumanEval measures functional correctness of code generation on 164 hand-written Python programming problems.

What it measures: Code generation quality and correctness (pass@k metric).
How it was administered: Models generate code completions; solutions are executed against unit tests; pass@1 and pass@100 reported.

Model rankings

Models ranked by score on this benchmark. Higher is better.

Rank	Model	Provider	Score	Percentile	Tags
1	Qwen 2.5 Coder 32B	Alibaba	92.7	p99	Code Assistant, Open Weight, Medium
2	Codestral	Mistral AI	92.5	p99	Reasoning, Small, Code Assistant, Proprietary
3