Weekly Research Digest
AI Papers for Product Managers.
The seminal AI research, distilled into actionable product insights every week.
Go deeper for $6/mo
Full article, action plan, use cases, quiz — every paper, every week.
Personalize your feed — tell us what you work on and we'll rank papers by relevance.
[ Curated Learning Paths ]
Shunyu Yao, Jeffrey Zhao, Dian Yu et al.
“ReAct fuses reasoning and acting in LLMs, enabling real-time interaction with external tools for superior results.”
Try before you subscribe
These papers are fully unlocked — deep dive, action plan, simulator, and quiz included. No account needed.
★DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
DeepSeek-R1 uses RL to supercharge reasoning in LLMs, rivaling OpenAI with no supervised fine-tuning.
★LoRA: Low-Rank Adaptation of Large Language Models
Edward Hu, Yelong Shen, Phillip Wallis et al.
LoRA slashes fine-tuning costs by 10,000x and GPUs by 3x while preserving quality on large language models.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao et al.
Tree of Thoughts enhances language models by enabling strategic, multi-path reasoning for complex problem solving.
★ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu et al.
ReAct fuses reasoning and acting in LLMs, enabling real-time interaction with external tools for superior results.
★Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans et al.
Chain-of-Thought Prompting elevates reasoning in LLMs, outperforming finetuned GPT-3 on complex math tasks.
★Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu et al.
Train AI with its own feedback to reduce need for human labels and increase precision in behavior control.
★Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang et al.
InstructGPT outperforms GPT-3 using human feedback, showing size isn't everything in AI models.
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan et al.
Larger language models offer more sample efficiency, enabling better results with smaller datasets and fixed compute resources.
★Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
Transformers revolutionize AI by ditching recurrence and convolutions, shining with sheer parallelizable efficiency.
Reasoning
The science behind o1, o3 — the fastest-growing product line in AI.
OpenAI o3 System Card
OpenAI
o3 achieves human-level reasoning, setting new AI benchmarks and exceeding 99.8% of competitive programmers.
QwQ-32B: Embracing the Intelligence Era
Qwen Team, Alibaba Group
QwQ-32B matches 671B param models using RL, revolutionizing size-efficiency in AI reasoning.
Gemini 2.5 Pro Technical Report
Google DeepMind
Gemini 2.5 Pro tops major AI benchmarks with a novel thinking mode and unprecedented 1M token context.
Gemini 2.5 Pro Technical Report
Google DeepMind
Gemini 2.5 Pro pushes boundaries with unparalleled reasoning and multimodal capabilities, redefining AI benchmarks globally.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
ByteDance Seed, Qiying Yu, Zheng Zhang et al.
DAPO: Raising the bar in LLM training with open-source reinforcement learning breakthroughs.
Claude 3.7 Sonnet: Extended Thinking
Anthropic
Claude 3.7 Sonnet redefines AI reasoning with extended thinking, outperforming the competition on complex tasks like coding.
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Moonshot AI
Long-context RL brings LLMs closer to true reasoning, enhancing AI's problem-solving abilities.
★DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
DeepSeek-R1 uses RL to supercharge reasoning in LLMs, rivaling OpenAI with no supervised fine-tuning.
OpenAI o1: Learning to Reason with LLMs
OpenAI
OpenAI o1 redefines AI reasoning, matching PhD-level performance in science and programming challenges.
Scaling LLM Test-Time Compute Optimally
Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar
Smaller models can beat larger ones by optimizing test-time compute for problem difficulty.
Let's Verify Step by Step
Hunter Lightman, Vineet Kosaraju, Yura Burda et al.
Process supervision beats outcome supervision in AI reasoning accuracy—think 78.2% vs 72.4% success in math tasks.
Sparks of Artificial General Intelligence: Early Experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan et al.
GPT-4 edges closer to AGI, excelling in diverse tasks from law to vision.
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans et al.
Self-consistency in language models improves reasoning performance by over 17% on complex tasks.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao et al.
Tree of Thoughts enhances language models by enabling strategic, multi-path reasoning for complex problem solving.
★Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans et al.
Chain-of-Thought Prompting elevates reasoning in LLMs, outperforming finetuned GPT-3 on complex math tasks.
Multimodal
Vision, image generation, and omni-models — GPT-4o and Sora's foundations.
Llama 4: The Frontier of Multimodal Intelligence
Meta AI
Llama 4 sets new standards in open-source AI with powerful multimodal capabilities and unmatched context window.
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google DeepMind
Gemini 1.5 Pro sets a new benchmark with near-perfect retrieval across millions of tokens.
GPT-4 Technical Report
OpenAI
GPT-4: Human-like performance on professional exams signals a new era of AI collaboration.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz et al.
Latent space diffusion cuts AI image generation from 100s of GPU days to a fraction while retaining quality.
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu et al.
Whisper approaches human-level speech accuracy using vast weakly supervised audio data from the internet.
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc et al.
Flamingo redefines few-shot learning by outperforming extensively fine-tuned models with minimal task-specific data.
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol et al.
Hierarchical models boost image generation diversity without losing realism, even matching styles like a digital Picasso.
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy et al.
CLIP bridges vision and language, unlocking powerful image models without traditional labeled datasets.
Architecture
The bedrock — every AI PM is expected to speak fluently about these.
Phi-4 Technical Report
Marah Abdin, Jyoti Aneja, Harkirat Behl et al.
Phi-4 sets a new standard using synthetic data to match GPT-4o's STEM skills with fewer parameters.
DeepSeek-V3 Technical Report
DeepSeek-AI
DeepSeek-V3 matches GPT-4o with less compute; frontier AI on non-frontier budgets.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan et al.
Phi-3-mini puts a GPT-3.5 rival in your pocket, thanks to better data, not more parameters.
Mixtral of Experts
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux et al.
Mixtral 8x7B revolutionizes efficiency, beating Llama 2 70B while using only 12.9B parameters per token.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu, Tri Dao
Mamba models outpace Transformers with 5x throughput and linear scaling for long-sequence tasks.
Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
Speculative decoding accelerates Transformer inference by 2-3x with identical output quality.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, Noam Shazeer
Switch Transformers scale models to trillion parameters with efficient sparsity and faster pre-training.
Language Models are Few-Shot Learners
Tom Brown, Benjamin Mann, Nick Ryder et al.
GPT-3 scales up to 175 billion parameters, acing tasks with few examples and no fine-tuning.
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan et al.
Larger language models offer more sample efficiency, enabling better results with smaller datasets and fixed compute resources.
★BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
BERT revolutionizes NLP by learning context from both directions, improving accuracy across key benchmarks.
★Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
Transformers revolutionize AI by ditching recurrence and convolutions, shining with sheer parallelizable efficiency.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
FlashAttention accelerates Transformers by 15% and cuts memory demand, revolutionizing long-sequence efficiency.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus et al.
RAG models redefine NLP by combining retrieval and generation, achieving state-of-the-art boosts in open domain QA tasks.
Open Source
Open-weight models reshaping the competitive landscape and what ships in products.
Qwen2.5 Technical Report
Qwen Team, Alibaba Group
Qwen2.5-72B rivals GPT-4o, redefining open-source AI capabilities in STEM and multilingual tasks.
Gemma 2: Improving Open Language Models at a Practical Size
Google DeepMind
Gemma 2 matches bigger closed models in performance with smaller, efficient open architectures.
The Llama 3 Herd of Models
Meta AI
Llama 3 pushes boundaries with a massive 405B-parameter model supporting 128K token context.
Mistral 7B
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch et al.
Mistral 7B shatters barriers by outperforming larger models like Llama 2 13B with just 7 billion parameters.
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone et al.
Llama 2 outperforms open-source chat models, challenging its closed-source rivals in safety and dialogue optimization.
Alignment
Making models helpful, harmless, and honest — the core product differentiator.
GRPO: Group Relative Policy Optimization for Reasoning
DeepSeek-AI
GRPO halves RL training resource needs for advanced reasoning in AI, making it a standard approach by 2025.
★Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu et al.
Train AI with its own feedback to reduce need for human labels and increase precision in behavior control.
★Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang et al.
InstructGPT outperforms GPT-3 using human feedback, showing size isn't everything in AI models.
Learning to Summarize with Human Feedback
Nisan Stiennon, Long Ouyang, Jeff Wu et al.
Reinforcement learning aligns AI summarization with human preferences, outperforming GPT-3.
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov
PPO simplifies RL, optimizing AI training with fewer resources and boosting performance across top tech firms.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell et al.
DPO makes chatbots more predictable by turning language models into reward models without complex RL training.
Agents
The frontier — Operator, Deep Research, Codex agents.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig et al.
Current AI models barely scratch the surface in solving real-world software issues from GitHub.
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang et al.
AutoGen empowers multi-agent LLM apps with interactive, customizable agent conversations enhancing development flexibility.
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang et al.
Voyager sets a new standard in AI autonomy by outpacing previous models in Minecraft with 15.3x tech advances.
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai et al.
Generative agents simulate life-like human behavior, making AI feel more authentic and engaging.
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Ashwin Gopinath et al.
Reflexion enables language agents to learn from feedback without costly retraining, enhancing decision-making efficiency.
Competition-Level Code Generation with AlphaCode
Yujia Li, David Choi, Junyoung Chung et al.
AlphaCode ranks in top 54.3% of competitive programmers, showcasing AI's coding prowess.
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun et al.
Codex rewrites the future of code with a 70.2% success rate, leaving GPT-3's 0% in the dust.
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang et al.
AgentBench shows LLMs like GPT-4 excel at acting autonomously, outpacing open-source rivals significantly.
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì et al.
Toolformer empowers language models to smartly use APIs, rivaling larger models’ performance with fewer resources.
★ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu et al.
ReAct fuses reasoning and acting in LLMs, enabling real-time interaction with external tools for superior results.
Scaling
Previously featured papers from our weekly automation.
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani et al.
Larger language models develop unexpected skills, challenging our predictions and scaling strategies.
Training
How models are built — scaling laws, data mixtures, and what makes them capable.
★LoRA: Low-Rank Adaptation of Large Language Models
Edward Hu, Yelong Shen, Phillip Wallis et al.
LoRA slashes fine-tuning costs by 10,000x and GPUs by 3x while preserving quality on large language models.
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch et al.
Training models with balanced size and tokens outperforms bloated giants like GPT-3 and Megatron.
Safety
Preparedness Framework and safety culture — central to every product decision.
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart et al.
GPT-3 narrows gap to human-level multitask performance with 20% boost over chance on MMLU benchmark.
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, Owain Evans
Larger AI models may not mean more truthful results, contradicting the bigger-is-better narrative.
Stay ahead of the curve.
Every Sunday, get the top AI papers distilled into actionable product insights. Full methodological breakdowns, industry impact analysis, and exclusive deep dives.