AI Model Benchmarks
Frontier models ranked by real-world performance. Scores sourced from official technical reports, Chatbot Arena, and peer-reviewed papers — not marketing copy.
17 models tracked
Updated Mar 20, 2026
Highest Arena ELO
1,443
Gemini 2.5 Pro
Best GPQA
87.7%
o3
Best SWE-bench
71.7%
o3
Open Source Models
6
of 17 total
17 models
| Model ↕ | Provider ↕ | Type | Context ↕ | Input/1M ↕ | Output/1M ↕ | Arena ELO ↓ | MMLU ↕ | GPQA ↕ | MATH ↕ | HumanEval ↕ | SWE-bench ↕ | MGSM ↕ | MMLU-Pro ↕ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemini 2.5 ProMar 2025 | Prop. | 1M | $1.25 | $10 | 1443 | 89.8% | 84% | 91.6% | 90% | 63.2% | 93.3% | — | |
o3Apr 2025 | OpenAI | Prop. | 200K | $10 | $40 | 1420 | 91.6% | 87.7% | 97.2% | 97.9% | 71.7% | — | — |
o4-miniApr 2025 | OpenAI | Prop. | 200K | $1.1 | $4.4 | 1380 | 89% | 81.4% | 99.5% | 97.9% | 68.1% | — | — |
Claude 3.7 SonnetFeb 2025 | Anthropic | Prop. | 200K | $3 | $15 | 1366 | 90.6% | 68% | 78.2% | 97.5% | 62.3% | 91.8% | — |
Gemini 2.0 FlashJan 2025 | Prop. | 1M | $0.1 | $0.4 | 1354 | 85.2% | 62.1% | 89.7% | 87% | 51.2% | 89.7% | — | |
o1Sep 2024 | OpenAI | Prop. | 128K | $15 | $60 | 1352 | 92.3% | 78% | 94.8% | 92.4% | 48.9% | 90.8% | — |
DeepSeek-R1Jan 2025 | DeepSeek | ✓ Open | 128K | $0.55 | $2.19 | 1340 | 90.8% | 71.5% | 97.3% | 92.6% | 49.2% | 98.3% | — |
DeepSeek-V3Dec 2024 | DeepSeek | ✓ Open | 128K | $0.27 | $1.1 | 1322 | 88.5% | 59.1% | 90.2% | 89.9% | 42% | 93.5% | — |
Claude 3.5 SonnetOct 2024 | Anthropic | Prop. | 200K | $3 | $15 | 1300 | 88.7% | 59.4% | 71.1% | 93.7% | 49% | 91.6% | — |
GPT-4oMay 2024 | OpenAI | Prop. | 128K | $5 | $15 | 1287 | 88.7% | 53.6% | 76.6% | 90.2% | 49% | 90.5% | — |
Qwen 2.5 72BSep 2024 | Alibaba | ✓ Open | 128K | — | — | 1268 | 85.4% | 49% | 83.1% | 86.6% | — | 92.9% | — |
Llama 3.1 405BJul 2024 | Meta | ✓ Open | 128K | — | — | 1266 | 88.6% | 51.1% | 73.8% | 89% | — | 89% | — |
Llama 3.3 70BDec 2024 | Meta | ✓ Open | 128K | — | — | 1257 | 86% | 50.5% | 77% | 88.4% | 33.4% | 91.1% | — |
Gemini 1.5 ProSep 2024 | Prop. | 2M | $1.25 | $5 | 1256 | 85.9% | 46.2% | 67.7% | 84.1% | — | 89.2% | — | |
Mistral Large 2Jul 2024 | Mistral | Prop. | 128K | $2 | $6 | 1216 | 84% | 55.4% | 68% | 92.1% | — | 83.5% | — |
Claude 3.5 HaikuNov 2024 | Anthropic | Prop. | 200K | $0.8 | $4 | 1180 | 85.5% | 41.6% | 69.2% | 88.1% | 40.6% | 87.7% | — |
Mixtral 8x22BApr 2024 | Mistral | ✓ Open | 65.536K | — | — | 1124 | 77.8% | 46.7% | 41.8% | 75.1% | — | 78.6% | — |
Score colors:■ Top tier■ Strong■ Average■ Below average■ Weak— Not reportedClick any column header to sort
Methodology: Benchmark scores are taken from official model cards and technical papers. Arena ELO reflects the live Chatbot Arena leaderboard. Scores may lag official releases by a few days. Open-source models use the instruct/chat variant unless otherwise noted. Cost figures are list prices as of March 2025 — actual costs may vary with caching/batch APIs.