✦AI Papers Timeline Map Tracks Benchmarks Which Model?

AI Model Benchmarks

Frontier models ranked by real-world performance. Scores sourced from official technical reports, Chatbot Arena, and peer-reviewed papers — not marketing copy.

17 models tracked

Updated Mar 20, 2026

Highest Arena ELO

1,443

Gemini 2.5 Pro

Best GPQA

87.7%

Best SWE-bench

71.7%

Open Source Models

of 17 total

17 models

Model ↕	Provider ↕	Type	Context ↕	Input/1M ↕	Output/1M ↕	Arena ELO ↓	MMLU ↕	GPQA ↕	MATH ↕	HumanEval ↕	SWE-bench ↕	MGSM ↕	MMLU-Pro ↕
Gemini 2.5 ProMar 2025	Google	Prop.	1M	$1.25	$10	1443	89.8%	84%	91.6%	90%	63.2%	93.3%	—
o3Apr 2025	OpenAI	Prop.	200K	$10	$40	1420	91.6%	87.7%	97.2%	97.9%	71.7%	—	—
o4-miniApr 2025	OpenAI	Prop.	200K	$1.1	$4.4	1380	89%	81.4%	99.5%	97.9%	68.1%	—	—
Claude 3.7 SonnetFeb 2025	Anthropic	Prop.	200K	$3	$15	1366	90.6%	68%	78.2%	97.5%	62.3%	91.8%	—
Gemini 2.0 FlashJan 2025	Google	Prop.	1M	$0.1	$0.4	1354	85.2%	62.1%	89.7%	87%	51.2%	89.7%	—
o1Sep 2024	OpenAI	Prop.	128K	$15	$60	1352	92.3%	78%	94.8%	92.4%	48.9%	90.8%	—
DeepSeek-R1Jan 2025	DeepSeek	✓ Open	128K	$0.55	$2.19	1340	90.8%	71.5%	97.3%	92.6%	49.2%	98.3%	—
DeepSeek-V3Dec 2024	DeepSeek	✓ Open	128K	$0.27	$1.1	1322	88.5%	59.1%	90.2%	89.9%	42%	93.5%	—
Claude 3.5 SonnetOct 2024	Anthropic	Prop.	200K	$3	$15	1300	88.7%	59.4%	71.1%	93.7%	49%	91.6%	—
GPT-4oMay 2024	OpenAI	Prop.	128K	$5	$15	1287	88.7%	53.6%	76.6%	90.2%	49%	90.5%	—
Qwen 2.5 72BSep 2024	Alibaba	✓ Open	128K	—	—	1268	85.4%	49%	83.1%	86.6%	—	92.9%	—
Llama 3.1 405BJul 2024	Meta	✓ Open	128K	—	—	1266	88.6%	51.1%	73.8%	89%	—	89%	—
Llama 3.3 70BDec 2024	Meta	✓ Open	128K	—	—	1257	86%	50.5%	77%	88.4%	33.4%	91.1%	—
Gemini 1.5 ProSep 2024	Google	Prop.	2M	$1.25	$5	1256	85.9%	46.2%	67.7%	84.1%	—	89.2%	—
Mistral Large 2Jul 2024	Mistral	Prop.	128K	$2	$6	1216	84%	55.4%	68%	92.1%	—	83.5%	—
Claude 3.5 HaikuNov 2024	Anthropic	Prop.	200K	$0.8	$4	1180	85.5%	41.6%	69.2%	88.1%	40.6%	87.7%	—
Mixtral 8x22BApr 2024	Mistral	✓ Open	65.536K	—	—	1124	77.8%	46.7%	41.8%	75.1%	—	78.6%	—

Score colors:■ Top tier■ Strong■ Average■ Below average■ Weak— Not reportedClick any column header to sort

Methodology: Benchmark scores are taken from official model cards and technical papers. Arena ELO reflects the live Chatbot Arena leaderboard. Scores may lag official releases by a few days. Open-source models use the instruct/chat variant unless otherwise noted. Cost figures are list prices as of March 2025 — actual costs may vary with caching/batch APIs.