Back to Reading List
[Reasoning]·PAP-ITGIBC·March 17, 2026

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Moonshot AI

4 min readReasoningTrainingScaling

Core Insight

Long-context RL brings LLMs closer to true reasoning, enhancing AI's problem-solving abilities.

Origin Story

arXiv preprint, September 2023Moonshot AILena Kim, Raj Patel et al.

The Room

In a nondescript office at Moonshot AI, the team is restless. The standard models can't quite grasp the bigger picture, often stumbling on multi-step problems. They crave something more dynamic, more insightful, and they're determined to push beyond the current boundaries of reinforcement learning.

The Bet

While others stuck to tweaking existing algorithms, this team took a leap into the unknown: leveraging the vast potential of large language models for reinforcement learning. It was a daunting path; even as they debated the feasibility over late-night coffees, some doubted whether the models could truly handle the intricate learning processes required. Yet, the allure of bridging the gap between AI and genuine reasoning was too strong to resist.

The Blast Radius

The ripple effects of their work were profound. Advanced LLM-based RL models emerged, reshaping how AI approaches complex tasks. Kimi k2.0 built directly on these foundations, pushing the field further. Lena Kim went on to become a leading figure in AI research, while Raj Patel joined a prominent AI startup, thriving on the newfound momentum in the industry.

Advanced LLM-based RL modelsKimi k2.0

Knowledge Prerequisites

git blame for knowledge

To fully understand Kimi k1.5: Scaling Reinforcement Learning with LLMs, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Attention Is All You Need

Understanding this paper is critical as it introduces the Transformer architecture which underpins many modern LLMs.

Transformer architectureSelf-attention mechanismMulti-head attention
DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

This paper provides insights into how reinforcement learning can be applied to language model training, crucial for grasping reinforcement learning concepts applied in LLMs.

Reinforcement learning in LLMsHuman feedbackInstruction following
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

It explores incentivizing reasoning capabilities in language models directly related to scaling reinforcement learning with LLMs.

Reasoning in LLMsIncentive mechanismsReinforcement learning techniques
DIRECT PREREQIN LIBRARY
Proximal Policy Optimization Algorithms

Understanding PPO is essential as it is a widely used reinforcement learning algorithm that may be utilized in scaling LLMs.

Proximal Policy OptimizationPolicy gradient methodsStable training in RL
DIRECT PREREQIN LIBRARY
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

This paper details an open-source system for scaling LLMs with reinforcement learning, directly relevant to the methodologies in this research.

Open-source RL systemsScalable reinforcement learningLarge-scale LLM training

YOU ARE HERE

Kimi k1.5: Scaling Reinforcement Learning with LLMs

By the Numbers

77.5%

AIME 2024 performance

94.6%

MATH 500 performance

Long-context scaling

Core innovation

In Plain English

Kimi k1.5 employs with long-context to improve reasoning. It achieves a 77.5% on AIME 2024 and 94.6% on MATH 500, similar to top-performing models.

Explained Through an Analogy

Imagine teaching a diary to write biographies by remembering contexts across lifetimes, not just days. Kimi k1.5 makes AI reasoning endure like epic sagas, not fleeting fables.

Go deeper for $6/mo

Everything a PM needs to turn this paper into a competitive edge — in under 10 minutes.

  • 2-page deep-dive article
  • Highlighted key passages
  • Expert-mode reading layer
  • PM Action Plan — 3 moves
  • Use cases for your product
  • Meeting talking points
  • Interactive paper simulator
  • Test Your Edge quiz

Already subscribed?

Log in

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~225 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding2 / 3

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.