All Reading Lists
Applied 40 min total

Ship AI Without the GPU Bill

4 papers on making models smaller, faster, and cheaper. Directly relevant to every PM deciding what to build on.

4 papers
1

LoRA slashes fine-tuning costs by 10,000x and GPUs by 3x while preserving quality on large language models.

Why this paper

Fine-tune any model with 10,000x fewer trainable parameters. The technique behind every custom model your team might build.

EfficiencyTrainingRead paper
2

flash attention — coming soon

3x faster training, lower memory — why Llama-2, GPT-4, and Claude use custom attention kernels in production.

3
Mixtral of Experts

Albert Q. Jiang et al.

Pro ~10 min

Mixtral 8x7B revolutionizes efficiency, beating Llama 2 70B while using only 12.9B parameters per token.

Why this paper

Sparse expert routing: 47B total params but only 13B active per token. The key to cost-efficient scale.

ArchitectureMoEEfficiencyRead paper
4

Llama 2 outperforms open-source chat models, challenging its closed-source rivals in safety and dialogue optimization.

Why this paper

Open source at scale. Llama-2 changed what's possible for product teams without $100k GPU budgets.

Open SourceSafetyRead paper

Unlock the full analysis for each paper

Deep-dive articles, expert annotations, PM action plans, and interactive experiments — all for $6/mo.

Go Pro — $6/mo