Back to Reading List
[Agents]·PAP-7WM647·March 17, 2026

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

4 min readAgentsTool Use

Core Insight

Current AI models barely scratch the surface in solving real-world software issues from GitHub.

Origin Story

arXiv preprintStanfordCarlos E. Jimenez, John Yang et al.

The Room

A small group of researchers at Stanford, 2023. They're surrounded by stacks of GitHub issues and the hum of coffee machines. The team feels the pressure — every day, developers grapple with an overwhelming number of unresolved tickets. They're driven by a shared frustration: current AI tools barely help with real-world software problems.

The Bet

The team decided to push the boundaries of AI's capabilities in software engineering. Their bold bet was that language models, still in their early days of understanding context, could be trained to resolve complex GitHub issues. There was a moment of doubt when they realized the sheer volume of data required. A weekend hackathon almost derailed them when the servers crashed under the data load.

The Blast Radius

Without this paper, the evolution of AI tools in software development would have lagged. Products like an advanced GitHub Copilot might not have been as effective, leaving developers with more manual work. The authors have since become key figures in AI for software engineering, with some leading new projects at major tech companies.

Enhanced GitHub CopilotAI-driven Issue Triage Tools

Knowledge Prerequisites

git blame for knowledge

To fully understand SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Attention Is All You Need

Understanding the foundational architecture of Transformers is crucial before exploring how language models can resolve complex issues.

TransformersAttention MechanismSequence-to-Sequence Models
DIRECT PREREQIN LIBRARY
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT introduces key concepts in pre-training language models which are essential for understanding how models are adapted for specific tasks like issue resolution.

Pre-trainingBidirectional TransformersLanguage Understanding
DIRECT PREREQIN LIBRARY
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This paper discusses techniques to improve reasoning in language models, which is directly applicable to analyzing and resolving real-world issues.

Chain-of-ThoughtReasoning in Language ModelsPrompting Techniques
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Understanding methods to enhance reasoning in language models is important for grasping the potential of models in solving GitHub issues.

Incentivized ReasoningReinforcement LearningLarge Language Models
DIRECT PREREQIN LIBRARY
AgentBench: Evaluating LLMs as Agents

Agent-based evaluation frameworks are critical for understanding the application of language models to real-world tasks.

Agent EvaluationLanguage Model ApplicationReal-World Tasks

YOU ARE HERE

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

By the Numbers

2,294

real GitHub issues in benchmark

1.96%

issues resolved by Claude 2

1.74%

issues resolved by GPT-4

12

popular Python repositories sourced

In Plain English

SWE-bench is a novel benchmark with 2,294 real tested on AI models. Surprisingly, Claude 2 resolves 1.96% and GPT-4 only 1.74% of issues, highlighting a significant capability gap.

Explained Through an Analogy

Imagine assembling a thousand-piece puzzle with only a single piece at a time visible; that's AI tackling real GitHub issues. It’s like trying to fix a watch while blindfolded, relying solely on touch and intuition.

Go deeper for $6/mo

Everything a PM needs to turn this paper into a competitive edge — in under 10 minutes.

  • 2-page deep-dive article
  • Highlighted key passages
  • Expert-mode reading layer
  • PM Action Plan — 3 moves
  • Use cases for your product
  • Meeting talking points
  • Interactive paper simulator
  • Test Your Edge quiz

Already subscribed?

Log in

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~245 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding4 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.