Back to Reading List
[Reasoning]·PAP-ITGIBC·2025·March 17, 2026

Kimi k1.5: Scaling Reinforcement Learning with LLMs

2025

Kimi Team, Moonshot AI

4 min readReasoningTrainingScaling

Core Insight

Long-context RL brings LLMs closer to true reasoning, enhancing AI's problem-solving abilities.

By the Numbers

77.5%

AIME 2024 performance

94.6%

MATH 500 performance

Long-context scaling

Core innovation

In Plain English

Kimi k1.5 employs with long-context to improve reasoning. It achieves a 77.5% on AIME 2024 and 94.6% on MATH 500, similar to top-performing models.

Knowledge Prerequisites

git blame for knowledge

To fully understand Kimi k1.5: Scaling Reinforcement Learning with LLMs, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Attention Is All You Need

Understanding this paper is critical as it introduces the Transformer architecture which underpins many modern LLMs.

Transformer architectureSelf-attention mechanismMulti-head attention
DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

This paper provides insights into how reinforcement learning can be applied to language model training, crucial for grasping reinforcement learning concepts applied in LLMs.

Reinforcement learning in LLMsHuman feedbackInstruction following
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

It explores incentivizing reasoning capabilities in language models directly related to scaling reinforcement learning with LLMs.

Reasoning in LLMsIncentive mechanismsReinforcement learning techniques
DIRECT PREREQIN LIBRARY
Proximal Policy Optimization Algorithms

Understanding PPO is essential as it is a widely used reinforcement learning algorithm that may be utilized in scaling LLMs.

Proximal Policy OptimizationPolicy gradient methodsStable training in RL
DIRECT PREREQIN LIBRARY
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

This paper details an open-source system for scaling LLMs with reinforcement learning, directly relevant to the methodologies in this research.

Open-source RL systemsScalable reinforcement learningLarge-scale LLM training

YOU ARE HERE

Kimi k1.5: Scaling Reinforcement Learning with LLMs

The Idea Graph

The Idea Graph
15 nodes · 20 edges
Click a node to explore · Drag to pan · Scroll to zoom
1,405 words · 8 min read13 sections · 15 concepts

Table of Contents

01

The World Before

121 words

Before the introduction of Kimi k1.5, AI models primarily focused on handling short sequences of data, often resulting in limited reasoning abilities. (LLMs) were already capable of generating human-like text, but their capacity for understanding and applying complex reasoning was constrained by the context length they could manage. Researchers were exploring different methods to extend these capabilities, but the breakthroughs in reasoning were not substantial. (RL) had been applied in various contexts, providing a framework for models to learn from feedback, yet these applications were mostly limited to shorter contexts. This state of the art was unsatisfying because, while models showed potential in language comprehension, they struggled with tasks requiring deeper reasoning and extended context consideration.

02

The Specific Failure

123 words

The specific failure motivating this work was the inability of existing models to perform well on tasks requiring long-term reasoning and context management. For instance, AI models often faltered in standardized evaluations like AIME 2024 and MATH 500, which demand complex problem-solving skills and extended reasoning chains. The failure was not just a matter of minor inaccuracies but a fundamental limitation in handling and making sense of longer sequences of information. Previous attempts to address this involved tweaking existing architectures or increasing model sizes, but these approaches didn't yield the desired improvements in reasoning capabilities. Thus, there was a clear need for a new approach that could overcome these limitations and significantly enhance the model's ability to process and learn from extended contexts.

03

The Key Insight

138 words

The key insight that led to the development of Kimi k1.5 was the realization that focusing on Long-Context Reinforcement Learning could unlock new levels of reasoning capabilities. Imagine trying to solve a complex puzzle; a model that can only see a few pieces at a time will struggle to find the solution, but one that can view the entire puzzle can methodically work towards solving it. In this analogy, the puzzle represents the reasoning task, and the pieces are the bits of information the model must integrate. By allowing the model to consider more pieces simultaneously—essentially extending the context it can handle—researchers saw a pathway to significantly enhancing reasoning. This insight was pivotal in rethinking how models could be trained to incorporate more extensive feedback loops and longer sequences of data, moving beyond the limitations of traditional approaches.

04

Architecture Overview

122 words

Kimi k1.5 is a sophisticated system that integrates with an emphasis on long-context handling to enhance reasoning capabilities. At its core, it combines the power of with a novel approach to scaling processes, allowing the model to process and learn from extensive data sequences. The architecture is designed to integrate feedback from both positive and negative outcomes, refining the model's understanding and decision-making over time. The system is structured to prioritize context length, enabling it to explore and evaluate extended reasoning chains effectively. This overview provides a big picture of how the components fit together: builds on traditional , leveraging the strengths of LLMs to achieve unprecedented performance on complex reasoning tasks.

05

Deep Dive into Long-Context RL

115 words

is the cornerstone of Kimi k1.5's architecture, enabling the model to handle extended sequences of information. By focusing on longer context lengths, the model can analyze and integrate more data points simultaneously, akin to having a wider lens through which to view a complex problem. This capacity for extended context is achieved by modifying traditional reinforcement learning techniques to accommodate longer sequences and more intricate feedback loops. The approach emphasizes the importance of context length, revealing new pathways for developing reasoning-capable AI systems. By allowing the model to explore and evaluate longer reasoning chains, researchers have opened the door to significant advancements in AI's problem-solving capabilities, moving beyond the constraints of previous models.

06

Deep Dive into Feedback Incorporation

111 words

is a critical component of Kimi k1.5, allowing the model to refine its reasoning processes by learning from both successes and failures. In , feedback is typically provided in the form of rewards or penalties based on the model's actions. Kimi k1.5 extends this concept by incorporating feedback across longer contexts, enabling the model to adjust its strategies based on a broader range of outcomes. This approach is akin to a student learning not only from correct answers but also from understanding why incorrect answers were wrong. By integrating feedback from extended reasoning chains, the model can enhance its decision-making processes, leading to improved performance on complex tasks.

07

Deep Dive into Scaling Reinforcement Learning

99 words

is essential for handling the increased complexity and data volume associated with long-context tasks. This involves expanding the model's capacity to process larger datasets and more intricate feedback loops. By scaling up techniques, Kimi k1.5 can manage the demands of extended context lengths, enabling it to tackle more complex reasoning tasks effectively. The process involves not only increasing the model's computational capacity but also optimizing the algorithms to handle the additional complexity efficiently. This scaling is crucial for realizing the full potential of long-context RL, providing the foundation for the model's enhanced reasoning capabilities.

08

Training & Data

101 words

Training Kimi k1.5 involves a sophisticated process designed to optimize its long-context reasoning capabilities. The model is trained on a diverse dataset that includes a wide range of reasoning tasks, ensuring it can generalize its learning to various real-world scenarios. The training process emphasizes , allowing the model to learn from both positive and negative outcomes. This is achieved through a carefully designed reinforcement learning framework that prioritizes context length, enabling the model to explore extended reasoning chains. The objective function is optimized to balance accuracy and context length, ensuring the model can handle complex reasoning tasks without sacrificing performance.

09

Key Results

93 words

Kimi k1.5's performance on demonstrates the effectiveness of its long-context RL approach. Achieving a 77.5% success rate on the AIME 2024 benchmark and a 94.6% accuracy on the MATH 500 benchmark, the model matches top-performing models in these areas. These results highlight the model's ability to handle complex reasoning tasks, showcasing its capacity to learn and apply extended reasoning chains. The performance on these benchmarks is a testament to the benefits of emphasizing context length and incorporating feedback across longer sequences, revealing the potential for significant advancements in AI's problem-solving capabilities.

10

Ablation Studies

82 words

Ablation studies conducted on Kimi k1.5 reveal the importance of various components in the model's architecture. By systematically removing elements such as feedback incorporation or scaling reinforcement learning, researchers assessed the impact on the model's performance. The studies indicate that both context length emphasis and feedback incorporation are crucial for achieving high performance on reasoning tasks. Without these components, the model's ability to handle extended reasoning chains and learn from diverse outcomes is significantly diminished, underscoring their importance in the overall architecture.

11

What This Changed

101 words

The introduction of Kimi k1.5 has brought about a paradigm shift in the development of . By demonstrating the effectiveness of long-context RL, the model sets a new standard for handling complex reasoning tasks. This advancement has implications for a wide range of applications, from enhancing virtual assistants to improving educational platforms. The ability to process and learn from extended contexts opens new possibilities for AI-driven products, promising more sophisticated and improved decision-making processes. As a result, Kimi k1.5 has paved the way for future innovations in AI, highlighting the potential for continued advancements in reasoning capabilities.

12

Limitations & Open Questions

101 words

Despite its advancements, Kimi k1.5 is not without limitations. One challenge is the increased computational resources required to handle long-context tasks, which may limit the model's accessibility and scalability. Additionally, while the model performs well on standardized evaluations, it may struggle with tasks that involve highly specialized reasoning or context-specific knowledge. Open questions remain regarding the model's ability to generalize its reasoning capabilities across diverse domains and the potential for further optimizing the trade-off between context length and performance. These limitations highlight the need for continued research and development to address these challenges and fully realize the potential of long-context RL.

13

Why You Should Care

98 words

For product managers and developers, the advancements introduced by Kimi k1.5 have significant implications for the future of AI-driven products. By enhancing reasoning capabilities through long-context RL, the model offers the potential to revolutionize virtual assistants, educational platforms, and dialogue systems. These improvements could lead to more engaging user experiences and more effective problem-solving tools, redefining the capabilities of AI in various industries. As AI continues to play an increasingly important role in decision-making and strategic analysis, the innovations brought by Kimi k1.5 emphasize the importance of investing in advanced reasoning capabilities to stay competitive and drive innovation.

Experience It

Live Experiment

Long-context RL

See Long-Context RL in Action

Compare AI responses to complex reasoning problems with and without long-context reinforcement learning. Notice how the technique improves problem-solving by considering extended reasoning chains.

Observe how Long-Context RL allows the AI to consider more variables and outcomes, leading to more comprehensive and effective solutions.

Try an example — see the difference instantly

⌘↵ to run

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~225 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding2 / 3

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.