ARC-Challenge
Benchmark website →AI2 Reasoning Challenge (Challenge set) contains 2,590 grade-school science questions that retrieval-based algorithms fail on.
About this test
- What it measures
- Science reasoning and common knowledge beyond simple retrieval.
- How it was administered
- Multiple-choice; 4-5 options; 0-shot or 25-shot; accuracy metric.
Model rankings
Models ranked by score on this benchmark. Higher is better.
| Rank | Model | Provider | Score | Percentile | Tags |
|---|---|---|---|---|---|
| 1 | OpenAI | 96.9 | — | Text Generation, Small, Multimodal, Reasoning, Proprietary | |
| 2 | Anthropic | 96.2 | — | Code Assistant, Small, Text Generation, Multimodal, Reasoning, Proprietary |