✦AI Papers Timeline Map Tracks Benchmarks Which Model?

Back to Reading List

[Agents]·PAP-7WM647·2023·March 17, 2026·Free Preview

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

2023

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

AGENTS

4 min readAgentsTool Use

Core Insight

Current AI models barely scratch the surface in solving real-world software issues from GitHub.

By the Numbers

2,294

real GitHub issues in benchmark

1.96%

issues resolved by Claude 2

1.74%

issues resolved by GPT-4

12

popular Python repositories sourced

In Plain English

SWE-bench is a novel benchmark with 2,294 real tested on AI models. Surprisingly, Claude 2 resolves 1.96% and GPT-4 only 1.74% of issues, highlighting a significant capability gap.

Knowledge Prerequisites

git blame for knowledge

To fully understand SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY

Attention Is All You Need

Understanding the foundational architecture of Transformers is crucial before exploring how language models can resolve complex issues.

TransformersAttention MechanismSequence-to-Sequence Models

DIRECT PREREQIN LIBRARY

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT introduces key concepts in pre-training language models which are essential for understanding how models are adapted for specific tasks like issue resolution.

Pre-trainingBidirectional TransformersLanguage Understanding

DIRECT PREREQIN LIBRARY

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This paper discusses techniques to improve reasoning in language models, which is directly applicable to analyzing and resolving real-world issues.

Chain-of-ThoughtReasoning in Language ModelsPrompting Techniques

DIRECT PREREQIN LIBRARY

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Understanding methods to enhance reasoning in language models is important for grasping the potential of models in solving GitHub issues.

Incentivized ReasoningReinforcement LearningLarge Language Models

DIRECT PREREQIN LIBRARY

AgentBench: Evaluating LLMs as Agents

Agent-based evaluation frameworks are critical for understanding the application of language models to real-world tasks.

Agent EvaluationLanguage Model ApplicationReal-World Tasks

YOU ARE HERE

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The Idea Graph

The Idea Graph

⚠Problem✦Insight⬡Method◎Result→Impact

15 nodes · 16 edges

Click a node to explore · Drag to pan · Scroll to zoom

793 words · 4 min read13 sections · 15 concepts

Table of Contents

01

The World Before: AI's Struggle with Real-World Software Issues

78 words

Before SWE-bench, the capabilities of AI models in solving real-world software engineering problems were overestimated. While models like GPT-4 and Claude 2 showed promise in isolated tasks, they struggled with complex, often encountered in real-world settings. This gap was largely due to the limitations of existing benchmarks, which failed to accurately reflect the challenges of these problems. The need for a realistic benchmark became evident as AI's potential to automate software engineering tasks remained largely untapped.

02

The Specific Failure: AI's Inability to Handle Complex Codebases

68 words

AI models, despite their advanced architectures, were unable to manage the intricacies of real-world software issues. These issues often span multiple files and require a deep understanding of extensive codebases. Existing models demonstrated a significant capability gap, as evidenced by their poor performance on SWE-bench tasks. This highlighted the need for a benchmark that could accurately assess AI's proficiency in realistic scenarios, beyond isolated or artificially generated problems.

03

The Key Insight: Bridging the Capability Gap

69 words

The realization that current AI models were falling short in real-world applications led to the development of SWE-bench. The key insight was the recognition of the between model performance on traditional benchmarks and the demands of real-world software engineering. By focusing on real GitHub issues, the authors of SWE-bench sought to provide a more accurate assessment of AI capabilities and drive improvements in model architecture and training.

04

Architecture Overview: Introducing SWE-bench

64 words

is a novel benchmark designed to test the ability of language models to solve real-world software engineering problems. It comprises 2,294 issues sourced from 12 popular Python repositories. Unlike prior benchmarks, requires models to understand and manipulate intricate codebases, offering a more realistic challenge. This framework translates practical GitHub issues into structured tasks for AI, providing a comprehensive evaluation of model capabilities.

05

Deep Dive: Real GitHub Issues as a Benchmark

76 words

The heart of is its use of . These issues present a genuine challenge, requiring models to engage with complex, multi-file problems that are representative of real-world software development. By using actual issues and pull requests from widely-used repositories, ensures that the tasks are relevant and challenging. This approach starkly contrasts with traditional benchmarks, which often rely on synthetic or isolated tasks that fail to capture the complexity of real software engineering.

06

Data Strategy: Sourcing and Structuring SWE-bench Tasks

60 words

's is foundational to its success as a benchmark. By sourcing real-world issues from popular Python repositories, ensures that the tasks it presents are relevant and challenging. This strategy contrasts with typical benchmarks that often use synthetic or isolated tasks, providing a more rigorous and realistic evaluation of a model's ability to resolve complex software engineering challenges.

07

Structured Tasks: Transforming GitHub Issues for AI Evaluation

51 words

transforms practical GitHub issues into that AI models can tackle. This transformation is key to evaluating a model's ability to understand and resolve real-world problems. By structuring these issues, provides a comprehensive framework for assessing model performance on tasks that are representative of real-world software engineering challenges.

08

Training Techniques: Preparing Models for SWE-bench

58 words

The paper highlights the need for improved that can better prepare language models for the complexities of real-world software engineering tasks. Current models, despite their advanced architectures, require further refinement to tackle the intricate challenges presented by SWE-bench. Future research should focus on developing training paradigms that enhance model proficiency in understanding and resolving multi-file issues.

09

Key Results: Benchmarking AI on SWE-bench

53 words

The from SWE-bench reveal a stark contrast between the perceived capabilities of advanced language models and their actual performance in software engineering contexts. With Claude 2 resolving only 1.96% and GPT-4 resolving 1.74% of tasks, these figures highlight the models' current limitations and underscore the complexity of real-world software engineering problems.

10

Ablation Studies: Evaluating Model Components

42 words

The paper does not extensively cover ablation studies. However, the low performance rates of current models on SWE-bench suggest that further research is needed to identify which components of model architecture are most critical for improving performance on real-world software engineering tasks.

11

What This Changed: Impact on AI and Software Engineering

62 words

SWE-bench has significant implications for both AI research and software engineering. By highlighting the capability gap of current models, it opens up new avenues for research into model architectures and training paradigms better suited to handle complex software issues. For enterprise companies like GitHub and Atlassian, the results suggest a need to focus on augmenting human efforts rather than promising full automation.

12

Limitations & Open Questions: Addressing Unsolved Challenges

56 words

Despite its contributions, SWE-bench has limitations, such as the potential for model training on similar datasets to skew results. Open questions remain about what specific model improvements are needed to better handle the complexities of real-world software engineering. Future research should focus on addressing these challenges to enhance the efficacy of AI models in practical applications.

13

Why You Should Care: Product Implications for AI Developers

56 words

For AI developers and product managers, SWE-bench offers a reality check on the current capabilities of language models in software engineering. The benchmark results suggest that AI should be used to augment human capabilities rather than replace them, emphasizing the importance of developing products that leverage AI in a supportive role rather than promising full automation.

Experience It

Live Experiment

SWE-bench Benchmark

See SWE-bench in Action

Compare AI's ability to resolve real-world GitHub issues with and without the SWE-bench benchmark. This highlights the capability gap in current models.

Notice how the SWE-bench Benchmark helps the model provide more structured and comprehensive solutions by understanding the broader code context, unlike the baseline approach.

Try an example — see the difference instantly

Enter a GitHub issue description — or try your own

⌘↵ to run

Read Original Paper on arXiv

Origin Story

arXiv preprintStanfordCarlos E. Jimenez, John Yang et al.

The Room

A small group of researchers at Stanford, 2023. They're surrounded by stacks of GitHub issues and the hum of coffee machines. The team feels the pressure — every day, developers grapple with an overwhelming number of unresolved tickets. They're driven by a shared frustration: current AI tools barely help with real-world software problems.

The Bet

The team decided to push the boundaries of AI's capabilities in software engineering. Their bold bet was that language models, still in their early days of understanding context, could be trained to resolve complex GitHub issues. There was a moment of doubt when they realized the sheer volume of data required. A weekend hackathon almost derailed them when the servers crashed under the data load.

The Blast Radius

Without this paper, the evolution of AI tools in software development would have lagged. Products like an advanced GitHub Copilot might not have been as effective, leaving developers with more manual work. The authors have since become key figures in AI for software engineering, with some leading new projects at major tech companies.

↳Enhanced GitHub Copilot↳AI-driven Issue Triage Tools

Explained Through an Analogy

“

Imagine assembling a thousand-piece puzzle with only a single piece at a time visible; that's AI tackling real GitHub issues. It’s like trying to fix a watch while blindfolded, relying solely on touch and intuition.

The Full Story

~2 min · 259 words

01

The Context

What problem were they solving?

WE-bench uses extensive real GitHub data to set a novel challenge for language models.

02

The Breakthrough

What did they actually do?

The evaluation revealed the limits of current AI, with minimal success rates in resolving issues.

03

Under the Hood

How does it work?

These findings suggest AI is not yet ready for full automation in coding tasks.

World & Industry Impact

For companies like GitHub and Atlassian, the findings indicate that current AI models are not yet ready to automate issue resolution in enterprise software development environments. This suggests that products promising AI-driven bug fixes or feature suggestions may require a significant reality check. Engineering teams should therefore be cautious in over-promising AI capabilities and might instead focus on augmenting human effort rather than attempting full automation.

Highlighted Passages

Verbatim lines from the paper — the sentences that carry the most weight.

“SWE-bench requires language models to comprehend extensive codebases and resolve issues spanning multiple files.”
→ This demonstrates the need for AI models to handle complex, interconnected problems, unlike isolated tasks.

“The results starkly illustrate the disparity between the sophisticated capabilities attributed to these models and the complexities of realistic software engineering tasks.”
→ Highlights the gap between AI potentials and real-world application, urging cautious deployment.

“Researchers were surprised by the minimal success rate, underlining the models' lack of proficiency in juggling multifaceted, real-world problem-solving contexts.”
→ Indicates a critical area for improvement in AI model development to meet practical demands.

Interactive Diagram

AI Models vs Real-World GitHub Issues

Step 1 / 5

The Challenge: Real-World Issues

Real-World Complexity

GitHub issues are multifaceted, needing comprehension of different code segments across files.

AI models struggle with real-world GitHub issues. Unlike synthetic tasks, these require understanding complex codebases and interconnected files.

The Challenge: Real-World Issues → Introducing SWE-bench → Benchmark Structure → Model Performance → Implications for AI Development

TL;DR

SWE-bench demonstrates a significant gap in AI capabilities to solve real-world GitHub issues, necessitating advancements in model development.

Key Terms

SWE-bench

A benchmark of real GitHub issues for AI evaluation.

Like a real-world test for AI problem-solving skills.

GitHub Issues

Reported problems or feature requests in software repositories.

Like customer complaints in software development.

Claude 2

An AI model tested on SWE-bench.

GPT-4

Another AI model evaluated on SWE-bench.

Benchmark

A standard or point of reference for evaluating performance.

Like a yardstick for measuring progress.

Python Repositories

Codebases written in Python, used for testing in SWE-bench.

Real-World Problems

Complex issues stemming from practical software development.

Success Rate

Percentage of issues resolved by AI models.

Core Ideas

1
SWE-bench Benchmark
Provides a structured approach to evaluate AI on real-world problems.
2
Model Evaluation
Highlights the limitations of current AI models in handling complex tasks.
3
Capability Gap
Reveals the disparity between AI's touted capabilities and real-world problem-solving.
4
Future Development
Indicates a need for advancing AI to better solve practical issues.

Key Formula

Performance = Data × Compute × Architecture

Data

Real-world issues and tasks.

Compute

The processing power available to the AI model.

Architecture

The design and structure of the AI model.

Before vs After

Before

AI models had untested capabilities in real-world software engineering tasks.

After

SWE-bench provides evidence of the challenges AI faces in real-world applications, urging further research and development.

Remember it as

"SWE-bench is the real-world test that AI models need to pass to truly revolutionize software engineering."

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~245 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding4 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.

Continue Reading

Beyond automation: where AI agents and large language models add value across the HR lifecycle

Mehdi Rajaeian et al.

Autonomous AI Agents for Adaptive Test Intelligence in Large-Scale Healthcare Systems

Baradwa Bandi Sudakara

AgentsEfficiency

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu et al.

AgentsReasoning

Measuring Massive Multitask Language Understanding High-Resolution Image Synthesis with Latent Diffusion Models