AI Model Benchmarks

Frontier models ranked by real-world performance. Scores sourced from official technical reports, Chatbot Arena, and peer-reviewed papers — not marketing copy.

17 models tracked
Updated Mar 20, 2026

Highest Arena ELO

1,443

Gemini 2.5 Pro

Best GPQA

87.7%

o3

Best SWE-bench

71.7%

o3

Open Source Models

6

of 17 total

17 models
Model Provider TypeContext Input/1M Output/1M Arena ELO MMLU GPQA MATH HumanEval SWE-bench MGSM MMLU-Pro
Gemini 2.5 ProMar 2025
GoogleProp.1M$1.25$10144389.8%84%91.6%90%63.2%93.3%
o3Apr 2025
OpenAIProp.200K$10$40142091.6%87.7%97.2%97.9%71.7%
o4-miniApr 2025
OpenAIProp.200K$1.1$4.4138089%81.4%99.5%97.9%68.1%
Claude 3.7 SonnetFeb 2025
AnthropicProp.200K$3$15136690.6%68%78.2%97.5%62.3%91.8%
Gemini 2.0 FlashJan 2025
GoogleProp.1M$0.1$0.4135485.2%62.1%89.7%87%51.2%89.7%
o1Sep 2024
OpenAIProp.128K$15$60135292.3%78%94.8%92.4%48.9%90.8%
DeepSeek-R1Jan 2025
DeepSeek Open128K$0.55$2.19134090.8%71.5%97.3%92.6%49.2%98.3%
DeepSeek-V3Dec 2024
DeepSeek Open128K$0.27$1.1132288.5%59.1%90.2%89.9%42%93.5%
Claude 3.5 SonnetOct 2024
AnthropicProp.200K$3$15130088.7%59.4%71.1%93.7%49%91.6%
GPT-4oMay 2024
OpenAIProp.128K$5$15128788.7%53.6%76.6%90.2%49%90.5%
Qwen 2.5 72BSep 2024
Alibaba Open128K126885.4%49%83.1%86.6%92.9%
Llama 3.1 405BJul 2024
Meta Open128K126688.6%51.1%73.8%89%89%
Llama 3.3 70BDec 2024
Meta Open128K125786%50.5%77%88.4%33.4%91.1%
Gemini 1.5 ProSep 2024
GoogleProp.2M$1.25$5125685.9%46.2%67.7%84.1%89.2%
Mistral Large 2Jul 2024
MistralProp.128K$2$6121684%55.4%68%92.1%83.5%
Claude 3.5 HaikuNov 2024
AnthropicProp.200K$0.8$4118085.5%41.6%69.2%88.1%40.6%87.7%
Mixtral 8x22BApr 2024
Mistral Open65.536K112477.8%46.7%41.8%75.1%78.6%
Score colors:■ Top tier■ Strong■ Average■ Below average■ Weak— Not reportedClick any column header to sort

Methodology: Benchmark scores are taken from official model cards and technical papers. Arena ELO reflects the live Chatbot Arena leaderboard. Scores may lag official releases by a few days. Open-source models use the instruct/chat variant unless otherwise noted. Cost figures are list prices as of March 2025 — actual costs may vary with caching/batch APIs.