Back to Reading List
[Reasoning]·PAP-XUFI8B·2025·March 17, 2026

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

2025

ByteDance Seed, Qiying Yu, Zheng Zhang et al.

4 min readReasoningTrainingOpen Source

Core Insight

DAPO: Raising the bar in LLM training with open-source reinforcement learning breakthroughs.

By the Numbers

50

AIME 2024 benchmark score

5 points

improvement over DeepSeek-R1-Zero-32B

Qwen2.5-32B

model used for testing

In Plain English

DAPO enhances LLM training with four algorithmic innovations, setting a new benchmark by scoring 50 on AIME 2024. It beats DeepSeek-R1-Zero-32B by 5 points using Qwen2.5-32B and provides a comprehensive open-source framework.

Knowledge Prerequisites

git blame for knowledge

To fully understand DAPO: An Open-Source LLM Reinforcement Learning System at Scale, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Proximal Policy Optimization Algorithms

This paper lays the foundational reinforcement learning algorithms that are vital for understanding reinforcement learning as applied to large language models in DAPO.

policy optimizationreinforcement learningPPO algorithms
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Understanding this paper is crucial for learning how reinforcement learning techniques are applied specifically to improve reasoning in large language models.

reasoning capabilityreinforcement learninglarge language models
DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

This paper introduces the concept of training models to follow human instructions, an essential aspect in enhancing the performance of reinforcement learning in DAPO.

instruction followinglanguage model traininghuman feedback
DIRECT PREREQIN LIBRARY
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

It provides insights into modeling language beyond standard LLM training by emphasizing direct preference signals, similar to what DAPO may utilize for reinforcement learning tasks.

preference optimizationreward modellanguage model objectives
DIRECT PREREQ

Reinforcement Learning for Language Models

This is a broad concept that encapsulates the integration of reinforcement learning into language models, a core theme in the DAPO system.

reinforcement learninglanguage modelsintegration techniques

YOU ARE HERE

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

The Idea Graph

The Idea Graph
16 nodes · 19 edges
Click a node to explore · Drag to pan · Scroll to zoom
1,031 words · 6 min read12 sections · 16 concepts

Table of Contents

01

The World Before: State-of-the-Art Challenges in LLM Training

112 words

Before the introduction of DAPO, the field of large language models (LLMs) was grappling with several significant challenges. Despite the impressive capabilities of models like GPT-3 and BERT, there were persistent issues related to training stability, efficiency, and output verbosity. Reinforcement learning, a powerful tool for training models, often led to unstable training processes, particularly when applied to LLMs. This instability was largely due to the inherent complexity of language tasks and the variability in sequence lengths encountered during training. Furthermore, was a common problem, introducing inefficiencies that hampered the learning process. These challenges limited the practical applications of LLMs, especially in scenarios requiring precise and reliable reasoning capabilities.

02

The Specific Failure: Zero-Reward Training Noise and Variable-Length Sequences

96 words

A key technical problem that motivated the development of DAPO was the presence of . In reinforcement learning, particularly with LLMs, many training samples provide no reward signal, leading to wasted computational resources and inefficient learning. This noise complicates the training process, as the model struggles to distinguish between informative and non-informative samples. Additionally, the variability in sequence lengths posed another challenge. Longer sequences could disproportionately influence the learning process, causing instability in the model's outputs. Addressing these specific failures was crucial for enhancing the performance and applicability of LLMs in complex reasoning tasks.

03

The Key Insight: Reinforcement Learning Enhancements for LLMs

98 words

The core insight behind DAPO was the realization that by enhancing reinforcement learning strategies, the major challenges in LLM training could be addressed effectively. Imagine if you could fine-tune a musical instrument to produce clearer notes; similarly, by refining the components of reinforcement learning, the learning process for LLMs could be optimized. This led to the development of four unique strategies: decoupled clipping, dynamic sampling, token-level loss stabilization, and nuanced reward shaping. Each of these strategies was designed to tackle specific issues, such as training noise and sequence variability, ultimately leading to more stable and efficient model training.

04

Architecture Overview: Decoupled Clipping for Entropy Management

101 words

is one of the key components of DAPO's architectural enhancements. In reinforcement learning, controlling the entropy of the model is critical for balancing exploration and exploitation — two fundamental processes in learning. Traditionally, gradient clipping, which is used to prevent large updates that destabilize training, was tied to the entropy term, leading to suboptimal control. separates these processes, allowing for more precise management of the model's exploration behavior. This separation ensures that the model maintains a healthy level of exploration without compromising stability, akin to adjusting the throttle and steering of a vehicle independently for better control.

05

Deep Dive: Dynamic Sampling to Eliminate Zero-Reward Training Noise

86 words

To address the challenge of zero-reward training noise, DAPO introduces . This method adjusts the sampling strategy based on the reward distribution, ensuring that the model focuses on informative samples. Imagine a student who only studies the most relevant materials for an exam; allows the model to 'study' the most valuable data points, enhancing learning efficiency. By filtering out uninformative samples, the model avoids the noise associated with zero-reward samples, leading to more effective use of computational resources and faster convergence during training.

06

Deep Dive: Token-Level Loss Stabilization for Variable-Length Sequences

86 words

is another critical component of DAPO's method. In tasks involving variable-length sequences, ensuring that each token contributes equally to the loss calculation is crucial for stable training. Traditional approaches often allow longer sequences to dominate the learning process, skewing the model's outputs. normalizes the loss across tokens, similar to ensuring every player on a team has an equal say in the game strategy. This balance prevents longer sequences from overshadowing shorter ones, maintaining stability and fairness in the learning process.

07

Deep Dive: Nuanced Reward Shaping to Curb Verbosity

74 words

addresses the issue of excessively wordy outputs in LLMs. By refining the reward signal, this strategy encourages the model to generate concise and relevant responses. Think of a teacher who rewards students not just for lengthy essays, but for clear and precise answers. By penalizing verbosity, helps the model focus on delivering quality over quantity, which is particularly important for applications like virtual assistants that require efficient communication.

08

Training & Data: Achieving Stability and Efficiency

79 words

The training process for DAPO involves implementing its four unique strategies to enhance stability and efficiency. By applying decoupled clipping, dynamic sampling, token-level loss stabilization, and nuanced reward shaping, the model achieves a stable and efficient training regimen. is crucial, as it ensures that the model's performance is consistent and reliable across different tasks. The improvements in reasoning efficiency mean that the model can produce more accurate and concise outputs, essential for practical applications in AI-driven products.

09

Key Results: Benchmark Achievements and Model Comparisons

78 words

DAPO's advancements are evidenced by its performance on the , where it achieved a score of 50. This score represents a significant improvement over the DeepSeek-R1-Zero-32B model, which scored 5 points lower. Such achievements demonstrate the effectiveness of DAPO's reinforcement learning strategies. The use of the Qwen2.5-32B model for testing further validates the improvements, as it shows that the enhancements are not solely dependent on model architecture but are a result of the novel training methods.

10

What This Changed: Open-Source Contributions and Future Potential

69 words

One of DAPO's most impactful contributions is its comprehensive , which includes the complete training recipe and data. This level of transparency allows other researchers to replicate and build upon DAPO's work, fostering further advancements in LLM technology. The open-source nature of DAPO not only democratizes access to cutting-edge methods but also encourages collaborative improvements, potentially leading to even more efficient and capable language models in the future.

11

Limitations & Open Questions: Areas for Further Exploration

80 words

While DAPO represents a significant advancement in LLM training, there are still limitations and open questions that require further exploration. For instance, the scalability of its methods to even larger models or more diverse datasets remains an area of interest. Additionally, the long-term effects of nuanced reward shaping on model behavior in varied contexts are yet to be fully understood. These open questions present opportunities for future research, as exploring these areas could lead to further breakthroughs in the field.

12

Why You Should Care: Implications for AI-Driven Products

72 words

The advancements made by DAPO have significant implications for AI-driven products. By improving the reasoning efficiency and stability of LLMs, products like virtual assistants and AI-based recommendation engines can enhance their interpretative and user interaction capabilities. This means more reliable and efficient AI applications, leading to faster adoption and integration into mainstream technology. For product managers and developers, understanding and leveraging DAPO's methods could result in more competitive and innovative AI solutions.

Experience It

Live Experiment

DAPO Reinforcement Learning

See DAPO in Action: Enhanced LLM Training

In this simulator, you'll see how DAPO's reinforcement learning techniques improve reasoning in language models. Compare responses to understand the impact of enhanced training methods.

Notice how DAPO's techniques lead to more concise and efficient reasoning, with reduced verbosity and improved clarity in the model's responses.

Try an example — see the difference instantly

⌘↵ to run

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~246 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding3 / 3

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.