MMLU
Benchmark website →Massive Multitask Language Understanding evaluates broad knowledge across 57 subjects (STEM, humanities, etc.) with multiple-choice questions.
About this test
- What it measures
- Broad multitask knowledge and reasoning across many domains.
- How it was administered
- Multiple-choice; 4 options per question; 5-shot in-context examples; 15,908 questions.
Model rankings
Models ranked by score on this benchmark. Higher is better.
| Rank | Model | Provider | Score | Percentile | Tags |
|---|---|---|---|---|---|
| 1 | OpenAI | 91.8 | p99 | Text Generation, Reasoning, Proprietary | |
| 2 | DeepSeek | 90.8 | p99 | Text Generation, Reasoning, Open Weight, Large | |