Claim: Scaling RL (not just pretraining) will be the primary driver of AI capability gains by 2030

Source

  • Scaling laws papers (Kaplan et al., Hoffmann et al.)
  • OpenAI o1/o3 system cards
  • DeepSeek-R1 paper
  • Discussions in AI alignment and capabilities research communities

Evidence For

  • OpenAI’s o-series models show substantial gains from RL-based reasoning
  • DeepSeek-R1 demonstrates that pure RL can induce reasoning behaviors
  • Pretraining data may hit a wall (finite internet data, legal constraints)
  • RL enables self-play and synthetic data generation, breaking the data bottleneck
  • RL directly optimizes for outcomes rather than next-token prediction

Evidence Against

  • Pretraining is the foundation; RL only works on top of strong pretrained models
  • RL is notoriously unstable and hard to scale reliably
  • Current RL techniques (RLHF, GRPO) are still brittle and reward-hackable
  • Gains from RL may saturate faster than pretraining gains
  • The most impressive capabilities still correlate strongly with pretraining compute

Assumptions

  • That pretraining data scarcity will become a binding constraint
  • That RL stability issues can be solved at scale
  • That reward specification for general capabilities is tractable
  • That no fundamentally new paradigm emerges (e.g., active inference, new architectures)

Counterarguments

  • RL and pretraining are complementary, not competing — framing it as “primary driver” may be a false dichotomy
  • The real driver might be architecture innovations, not training methodology
  • Synthetic data from RL may reduce in quality over multiple generations (model collapse)

My Current Confidence

55%

Confidence Log

DateConfidenceReason for Change
2025-01-0155%Initial — RL is clearly important, but “primary driver” is a strong claim