Back to Reading List
[Agents]·PAP-6HBRQV·2021·March 17, 2026·Free Preview

Evaluating Large Language Models Trained on Code

2021

Mark Chen, Jerry Tworek, Heewoo Jun et al.

4 min readAgentsTool Use

Core Insight

Codex rewrites the future of code with a 70.2% success rate, leaving GPT-3's 0% in the dust.

By the Numbers

70.2%

success rate with 100 samples

0%

GPT-3's success rate

28.8%

success rate with a single attempt

164

number of HumanEval challenges

In Plain English

Codex, a GPT model fine-tuned on GitHub code, solves 70.2% of Python tasks with multiple samples. This dwarfs GPT-3’s 0% performance.

Knowledge Prerequisites

git blame for knowledge

To fully understand Evaluating Large Language Models Trained on Code, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Attention Is All You Need

This paper introduces the transformer architecture, a foundational framework for understanding how language models, including those trained on code, process sequential data.

transformer architectureself-attentionpositional encodings
DIRECT PREREQIN LIBRARY
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Understanding BERT's architecture and bidirectional training is essential for grasping the evolution of large language models and their training methodologies.

bidirectional transformerpre-trainingmasked language model
DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

This paper discusses techniques for training models with human feedback, which is crucial for understanding how models handle tasks like code generation.

instruction tuninghuman feedbacklanguage model fine-tuning
DIRECT PREREQIN LIBRARY
GPT-4 Technical Report

The technical report on GPT-4 provides insights into large-scale language models, which is necessary for understanding advancements in models trained on code.

GPT architecturescaling lawsfew-shot learning
DIRECT PREREQIN LIBRARY
Competition-Level Code Generation with AlphaCode

This paper offers a detailed example of a language model specifically focused on code generation, a crucial aspect for evaluating language models trained on code.

code synthesislanguage model evaluationprogramming competitions

YOU ARE HERE

Evaluating Large Language Models Trained on Code

The Idea Graph

The Idea Graph
10 nodes · 9 edges
Click a node to explore · Drag to pan · Scroll to zoom
319 words · 2 min read6 sections · 10 concepts

Table of Contents

01

The Problem: GPT-3 Limitations in Code

49 words

Before the introduction of Codex, GPT-3, a prominent language model, struggled with programming tasks. It managed to solve 0% of Python challenges, highlighting its limitations in understanding programming languages. This failure stems from its training on generalized language data, which lacks the specific syntax and structure of coding languages.

02

Key Insight: Codex Breakthrough

53 words

The key insight behind the paper is the development of Codex, a breakthrough in language models fine-tuned specifically on code. Codex's success lies in its targeted training on programming languages, allowing it to achieve a 70.2% success rate on Python tasks. This insight demonstrates the power of specialization over generalization in AI training.

03

Method: Domain-Specific Training and Fine-Tuning with GitHub

44 words

Codex was developed by employing , focusing on data from GitHub repositories. By fine-tuning the model specifically with code, it learned the syntax and nuances of programming languages. This approach allowed Codex to significantly surpass the performance of general language models like GPT-3.

04

Method: HumanEval Benchmark

45 words

To rigorously evaluate Codex, researchers used the , a set of 164 hand-written programming challenges. This benchmark provided a robust framework for testing the practical coding abilities of AI models, ensuring that Codex's performance was not just theoretical but applicable to real programming scenarios.

05

Results: Codex Success Rate and Contrast to GPT-3

54 words

Codex achieved remarkable results, solving 28.8% of problems on its first attempt and 70.2% with 100 samples per task. These results are in stark 's 0% success rate, underscoring the effectiveness of Codex's domain-specific training approach. This contrast highlights the importance of training models with data that matches their intended application domain.

06

Impact: Transforming Developer Tools and the Future of Coding

74 words

Codex's success has profound implications for developer tools and the . Its ability to generate functional code enhances tools like IDEs and code-assist plugins, making AI-assisted coding more prevalent. , powered by Codex, exemplifies this shift, lowering the entry barrier for programming and boosting productivity across the tech industry. This transformation suggests a future where AI is integral to both learning and executing programming tasks, fostering innovation and accelerating development cycles.

Experience It

Live Experiment

Codex Fine-Tuning

See Codex's Code Mastery in Action

You will see how Codex, fine-tuned on GitHub code, dramatically outperforms a general-purpose language model in solving programming tasks.

Notice how Codex's specialized training on code allows it to solve programming tasks with high accuracy, unlike the general model which struggles with syntax and logic.

Try an example — see the difference instantly

⌘↵ to run

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~249 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding4 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.