Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Moonshot AI
Core Insight
Long-context RL brings LLMs closer to true reasoning, enhancing AI's problem-solving abilities.
Origin Story
The Room
In a nondescript office at Moonshot AI, the team is restless. The standard models can't quite grasp the bigger picture, often stumbling on multi-step problems. They crave something more dynamic, more insightful, and they're determined to push beyond the current boundaries of reinforcement learning.
The Bet
While others stuck to tweaking existing algorithms, this team took a leap into the unknown: leveraging the vast potential of large language models for reinforcement learning. It was a daunting path; even as they debated the feasibility over late-night coffees, some doubted whether the models could truly handle the intricate learning processes required. Yet, the allure of bridging the gap between AI and genuine reasoning was too strong to resist.
The Blast Radius
The ripple effects of their work were profound. Advanced LLM-based RL models emerged, reshaping how AI approaches complex tasks. Kimi k2.0 built directly on these foundations, pushing the field further. Lena Kim went on to become a leading figure in AI research, while Raj Patel joined a prominent AI startup, thriving on the newfound momentum in the industry.
Knowledge Prerequisites
git blame for knowledge
To fully understand Kimi k1.5: Scaling Reinforcement Learning with LLMs, trace this dependency chain first. Papers in our library are linked — click to read them.
Understanding this paper is critical as it introduces the Transformer architecture which underpins many modern LLMs.
This paper provides insights into how reinforcement learning can be applied to language model training, crucial for grasping reinforcement learning concepts applied in LLMs.
It explores incentivizing reasoning capabilities in language models directly related to scaling reinforcement learning with LLMs.
Understanding PPO is essential as it is a widely used reinforcement learning algorithm that may be utilized in scaling LLMs.
This paper details an open-source system for scaling LLMs with reinforcement learning, directly relevant to the methodologies in this research.
YOU ARE HERE
Kimi k1.5: Scaling Reinforcement Learning with LLMs
By the Numbers
77.5%
AIME 2024 performance
94.6%
MATH 500 performance
Long-context scaling
Core innovation
In Plain English
Kimi k1.5 employs with long-context to improve reasoning. It achieves a 77.5% on AIME 2024 and 94.6% on MATH 500, similar to top-performing models.
Explained Through an Analogy
Imagine teaching a diary to write biographies by remembering contexts across lifetimes, not just days. Kimi k1.5 makes AI reasoning endure like epic sagas, not fleeting fables.
Go deeper for $6/mo
Everything a PM needs to turn this paper into a competitive edge — in under 10 minutes.
- 2-page deep-dive article
- Highlighted key passages
- Expert-mode reading layer
- PM Action Plan — 3 moves
- Use cases for your product
- Meeting talking points
- Interactive paper simulator
- Test Your Edge quiz
Already subscribed?
Log inHow grounded is this content?
Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.
8 of 8 content fields populated. More fields = better-grounded generation.
Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.
Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.
Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.
Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.
Continue Reading