Back to Reading List
[Reasoning]·PAP-XUFI8B·March 17, 2026

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

ByteDance Seed, Qiying Yu, Zheng Zhang et al.

4 min readReasoningTrainingOpen Source

Core Insight

DAPO: Raising the bar in LLM training with open-source reinforcement learning breakthroughs.

Origin Story

arXiv preprintByteDanceQiying Yu, Zheng Zhang et al.

The Room

A small team at ByteDance, late nights in a bustling Beijing office. They are driven by a shared vision but frustrated by the limitations of current AI training paradigms. Reinforcement learning offers promise, but scaling it for large language models feels like trying to fit a square peg in a round hole.

The Bet

While others focused on refining transformers, this team took a gamble on combining reinforcement learning with large language models at an open-source level. They believed that democratizing access could accelerate AI advancements. There was a moment when the system crashed under the weight of its own ambition, casting doubt among the team.

The Blast Radius

Without this paper, open-source advancements in large-scale reinforcement learning might have stalled. Companies like ByteDance and others in Asia would lack a vital tool for AI innovation. The authors have since become respected voices in the AI community, pushing the envelope of what's possible with open-source AI.

ByteDance's LLM-PlusOpenSourceRL Toolkit

Knowledge Prerequisites

git blame for knowledge

To fully understand DAPO: An Open-Source LLM Reinforcement Learning System at Scale, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Proximal Policy Optimization Algorithms

This paper lays the foundational reinforcement learning algorithms that are vital for understanding reinforcement learning as applied to large language models in DAPO.

policy optimizationreinforcement learningPPO algorithms
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Understanding this paper is crucial for learning how reinforcement learning techniques are applied specifically to improve reasoning in large language models.

reasoning capabilityreinforcement learninglarge language models
DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

This paper introduces the concept of training models to follow human instructions, an essential aspect in enhancing the performance of reinforcement learning in DAPO.

instruction followinglanguage model traininghuman feedback
DIRECT PREREQIN LIBRARY
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

It provides insights into modeling language beyond standard LLM training by emphasizing direct preference signals, similar to what DAPO may utilize for reinforcement learning tasks.

preference optimizationreward modellanguage model objectives
DIRECT PREREQ

Reinforcement Learning for Language Models

This is a broad concept that encapsulates the integration of reinforcement learning into language models, a core theme in the DAPO system.

reinforcement learninglanguage modelsintegration techniques

YOU ARE HERE

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

By the Numbers

50

AIME 2024 benchmark score

5 points

improvement over DeepSeek-R1-Zero-32B

Qwen2.5-32B

model used for testing

In Plain English

DAPO enhances LLM training with four algorithmic innovations, setting a new benchmark by scoring 50 on AIME 2024. It beats DeepSeek-R1-Zero-32B by 5 points using Qwen2.5-32B and provides a comprehensive open-source framework.

Explained Through an Analogy

Imagine training an AI like tuning a race car engine. DAPO is the pit crew that tweaks each part for max speed without accidental burnout, ensuring every move on the track counts without excessive laps.

Go deeper for $6/mo

Everything a PM needs to turn this paper into a competitive edge — in under 10 minutes.

  • 2-page deep-dive article
  • Highlighted key passages
  • Expert-mode reading layer
  • PM Action Plan — 3 moves
  • Use cases for your product
  • Meeting talking points
  • Interactive paper simulator
  • Test Your Edge quiz

Already subscribed?

Log in

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~246 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding3 / 3

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.