Back to Reading List
[Agents]·PAP-YK9O2X·March 17, 2026

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang et al.

4 min readAgents

Core Insight

AgentBench shows LLMs like GPT-4 excel at acting autonomously, outpacing open-source rivals significantly.

Origin Story

arXiv preprint, 2023Shanghai Jiao Tong UniversityXiao Liu, Hao Yu et al.

The Room

A group of ambitious minds at Shanghai Jiao Tong University. They huddled around a table, their discussions punctuated by the hum of computers. The frustration in the air was palpable — existing solutions felt like piecemeal fixes. They craved a leap, something that would push Large Language Models beyond current limits.

The Bet

The team decided to challenge the status quo by evaluating how far LLMs could go as autonomous agents. It seemed audacious, almost reckless, at a time when the focus was on narrowly defined tasks. There was a moment when they hesitated, questioning whether the models could truly stand on their own. They pushed forward, driven by a vision of LLMs working autonomously in ways yet unexplored.

The Blast Radius

Without this paper, the rapid advancement of autonomous agents like AutoGPT and BabyAGI might not have happened. These developments have sparked new discussions about AI's potential roles. The authors, having made a significant mark, continue to drive forward in the evolving landscape of AI, inspiring others to explore the boundaries of autonomous systems.

AutoGPTBabyAGILangChain

Knowledge Prerequisites

git blame for knowledge

To fully understand AgentBench: Evaluating LLMs as Agents, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

Understanding how language models are trained to align with human instructions is crucial for evaluating their performance as agents.

instruction-followingmodel alignmenthuman feedback
DIRECT PREREQIN LIBRARY
ReAct: Synergizing Reasoning and Acting in Language Models

This paper explores the integration of reasoning and acting, a central concept in assessing LLMs as agents.

reasoningactinglanguage models as agents
DIRECT PREREQIN LIBRARY
Toolformer: Language Models Can Teach Themselves to Use Tools

Exploring how language models interact with tools provides foundational knowledge for understanding agent capabilities in LLMs.

tool use in LLMsself-teachinginteraction with external systems
DIRECT PREREQIN LIBRARY
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This paper introduces techniques for enhancing reasoning in language models, relevant for evaluating LLMs as intelligent agents.

chain-of-thought promptingreasoning in LLMsguided thinking
DIRECT PREREQIN LIBRARY
Reflexion: Language Agents with Verbal Reinforcement Learning

Understanding verbal reinforcement learning in language agents contributes to evaluating their performance as capable agents.

verbal reinforcementlanguage agentslearning via feedback

YOU ARE HERE

AgentBench: Evaluating LLMs as Agents

By the Numbers

8

interactive environments

27

LLMs evaluated

GPT-4

leading model

significant

performance gap with open-source models

In Plain English

AgentBench evaluates ' autonomy in decision-making across 8 environments. GPT-4 leads the pack, with OSS models trailing. This benchmark sets the stage for undertaking real-world tasks.

Explained Through an Analogy

Imagine LLMs as novice chefs in a bustling kitchen; AgentBench tests whether they can follow the recipe and adapt to real-time challenges like a burnt pan. It's no longer about reading the recipe but about orchestrating a perfect dish amidst chaos.

Go deeper for $6/mo

Everything a PM needs to turn this paper into a competitive edge — in under 10 minutes.

  • 2-page deep-dive article
  • Highlighted key passages
  • Expert-mode reading layer
  • PM Action Plan — 3 moves
  • Use cases for your product
  • Meeting talking points
  • Interactive paper simulator
  • Test Your Edge quiz

Already subscribed?

Log in

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~284 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding3 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.