Back to Reading List
[Agents]·PAP-7WM647·2023·March 17, 2026·Free Preview

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

2023

Carlos E. Jimenez, John Yang, Alexander Wettig et al.

4 min readAgentsTool Use

Core Insight

Current AI models barely scratch the surface in solving real-world software issues from GitHub.

By the Numbers

2,294

real GitHub issues in benchmark

1.96%

issues resolved by Claude 2

1.74%

issues resolved by GPT-4

12

popular Python repositories sourced

In Plain English

SWE-bench is a novel benchmark with 2,294 real tested on AI models. Surprisingly, Claude 2 resolves 1.96% and GPT-4 only 1.74% of issues, highlighting a significant capability gap.

Knowledge Prerequisites

git blame for knowledge

To fully understand SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Attention Is All You Need

Understanding the foundational architecture of Transformers is crucial before exploring how language models can resolve complex issues.

TransformersAttention MechanismSequence-to-Sequence Models
DIRECT PREREQIN LIBRARY
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT introduces key concepts in pre-training language models which are essential for understanding how models are adapted for specific tasks like issue resolution.

Pre-trainingBidirectional TransformersLanguage Understanding
DIRECT PREREQIN LIBRARY
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This paper discusses techniques to improve reasoning in language models, which is directly applicable to analyzing and resolving real-world issues.

Chain-of-ThoughtReasoning in Language ModelsPrompting Techniques
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Understanding methods to enhance reasoning in language models is important for grasping the potential of models in solving GitHub issues.

Incentivized ReasoningReinforcement LearningLarge Language Models
DIRECT PREREQIN LIBRARY
AgentBench: Evaluating LLMs as Agents

Agent-based evaluation frameworks are critical for understanding the application of language models to real-world tasks.

Agent EvaluationLanguage Model ApplicationReal-World Tasks

YOU ARE HERE

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The Idea Graph

The Idea Graph
15 nodes · 16 edges
Click a node to explore · Drag to pan · Scroll to zoom
793 words · 4 min read13 sections · 15 concepts

Table of Contents

01

The World Before: AI's Struggle with Real-World Software Issues

78 words

Before SWE-bench, the capabilities of AI models in solving real-world software engineering problems were overestimated. While models like GPT-4 and Claude 2 showed promise in isolated tasks, they struggled with complex, often encountered in real-world settings. This gap was largely due to the limitations of existing benchmarks, which failed to accurately reflect the challenges of these problems. The need for a realistic benchmark became evident as AI's potential to automate software engineering tasks remained largely untapped.

02

The Specific Failure: AI's Inability to Handle Complex Codebases

68 words

AI models, despite their advanced architectures, were unable to manage the intricacies of real-world software issues. These issues often span multiple files and require a deep understanding of extensive codebases. Existing models demonstrated a significant capability gap, as evidenced by their poor performance on SWE-bench tasks. This highlighted the need for a benchmark that could accurately assess AI's proficiency in realistic scenarios, beyond isolated or artificially generated problems.

03

The Key Insight: Bridging the Capability Gap

69 words

The realization that current AI models were falling short in real-world applications led to the development of SWE-bench. The key insight was the recognition of the between model performance on traditional benchmarks and the demands of real-world software engineering. By focusing on real GitHub issues, the authors of SWE-bench sought to provide a more accurate assessment of AI capabilities and drive improvements in model architecture and training.

04

Architecture Overview: Introducing SWE-bench

64 words

is a novel benchmark designed to test the ability of language models to solve real-world software engineering problems. It comprises 2,294 issues sourced from 12 popular Python repositories. Unlike prior benchmarks, requires models to understand and manipulate intricate codebases, offering a more realistic challenge. This framework translates practical GitHub issues into structured tasks for AI, providing a comprehensive evaluation of model capabilities.

05

Deep Dive: Real GitHub Issues as a Benchmark

76 words

The heart of is its use of . These issues present a genuine challenge, requiring models to engage with complex, multi-file problems that are representative of real-world software development. By using actual issues and pull requests from widely-used repositories, ensures that the tasks are relevant and challenging. This approach starkly contrasts with traditional benchmarks, which often rely on synthetic or isolated tasks that fail to capture the complexity of real software engineering.

06

Data Strategy: Sourcing and Structuring SWE-bench Tasks

60 words

's is foundational to its success as a benchmark. By sourcing real-world issues from popular Python repositories, ensures that the tasks it presents are relevant and challenging. This strategy contrasts with typical benchmarks that often use synthetic or isolated tasks, providing a more rigorous and realistic evaluation of a model's ability to resolve complex software engineering challenges.

07

Structured Tasks: Transforming GitHub Issues for AI Evaluation

51 words

transforms practical GitHub issues into that AI models can tackle. This transformation is key to evaluating a model's ability to understand and resolve real-world problems. By structuring these issues, provides a comprehensive framework for assessing model performance on tasks that are representative of real-world software engineering challenges.

08

Training Techniques: Preparing Models for SWE-bench

58 words

The paper highlights the need for improved that can better prepare language models for the complexities of real-world software engineering tasks. Current models, despite their advanced architectures, require further refinement to tackle the intricate challenges presented by SWE-bench. Future research should focus on developing training paradigms that enhance model proficiency in understanding and resolving multi-file issues.

09

Key Results: Benchmarking AI on SWE-bench

53 words

The from SWE-bench reveal a stark contrast between the perceived capabilities of advanced language models and their actual performance in software engineering contexts. With Claude 2 resolving only 1.96% and GPT-4 resolving 1.74% of tasks, these figures highlight the models' current limitations and underscore the complexity of real-world software engineering problems.

10

Ablation Studies: Evaluating Model Components

42 words

The paper does not extensively cover ablation studies. However, the low performance rates of current models on SWE-bench suggest that further research is needed to identify which components of model architecture are most critical for improving performance on real-world software engineering tasks.

11

What This Changed: Impact on AI and Software Engineering

62 words

SWE-bench has significant implications for both AI research and software engineering. By highlighting the capability gap of current models, it opens up new avenues for research into model architectures and training paradigms better suited to handle complex software issues. For enterprise companies like GitHub and Atlassian, the results suggest a need to focus on augmenting human efforts rather than promising full automation.

12

Limitations & Open Questions: Addressing Unsolved Challenges

56 words

Despite its contributions, SWE-bench has limitations, such as the potential for model training on similar datasets to skew results. Open questions remain about what specific model improvements are needed to better handle the complexities of real-world software engineering. Future research should focus on addressing these challenges to enhance the efficacy of AI models in practical applications.

13

Why You Should Care: Product Implications for AI Developers

56 words

For AI developers and product managers, SWE-bench offers a reality check on the current capabilities of language models in software engineering. The benchmark results suggest that AI should be used to augment human capabilities rather than replace them, emphasizing the importance of developing products that leverage AI in a supportive role rather than promising full automation.

Experience It

Live Experiment

SWE-bench Benchmark

See SWE-bench in Action

Compare AI's ability to resolve real-world GitHub issues with and without the SWE-bench benchmark. This highlights the capability gap in current models.

Notice how the SWE-bench Benchmark helps the model provide more structured and comprehensive solutions by understanding the broader code context, unlike the baseline approach.

Try an example — see the difference instantly

⌘↵ to run

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~245 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding4 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.