Home/Guides/RL Timeline
Visual Timeline

Reinforcement Learning
From Atari to Robotics

13 years of breakthroughs that took RL from playing video games to aligning language models and training robots. Every milestone, every paradigm shift.

March 2026|20 min read|14 milestones

The Timeline

Each node marks a moment that changed what was possible. Click through 13 years of compounding breakthroughs.

2013DeepMind

DQN — Playing Atari from Pixels

Deep Q-Network combined deep convolutional networks with Q-learning, learning directly from raw pixel input to achieve superhuman performance on multiple Atari 2600 games. Published in Nature 2015, the work that started the deep RL era.

Proved neural networks could replace hand-crafted features in RL.

2015DeepMind

Double DQN & Dueling DQN

Double DQN addressed overestimation bias by decoupling action selection from evaluation. Dueling DQN separated value and advantage streams, improving policy evaluation in states with many similar-valued actions.

Established that architectural innovations could significantly boost sample efficiency.

2016DeepMind

A3C & AlphaGo

Asynchronous Advantage Actor-Critic (A3C) introduced parallel actor-learners for stable training. AlphaGo defeated Lee Sedol 4-1 using policy and value networks with Monte Carlo tree search — a watershed moment for AI.

Combined search + learning dominated the hardest classical game. Policy gradient methods matured.

2017OpenAI / DeepMind

PPO & AlphaZero

Proximal Policy Optimization became the default policy gradient algorithm — simple, stable, scalable. AlphaZero mastered chess, shogi, and Go from self-play alone with zero human knowledge.

PPO remains the backbone of most RL applications including RLHF. AlphaZero showed tabula rasa learning works.

2018UC Berkeley / Schmidhuber

SAC & World Models

Soft Actor-Critic introduced entropy regularization for robust continuous control. World Models learned compressed representations of environments, enabling agents to "dream" and train in imagination.

SAC became the go-to for robotics. World Models sparked the model-based RL renaissance.

2019DeepMind / OpenAI

MuZero & OpenAI Five

MuZero planned without knowing the rules — learning a dynamics model, reward model, and policy end-to-end. OpenAI Five defeated world champions at Dota 2, coordinating 5 agents over 45-minute games.

Model-based planning without ground-truth models. Multi-agent coordination at scale.

2020DeepMind / Google

Dreamer & Agent57

Dreamer v1/v2 trained policies entirely inside learned world models, achieving strong results with far fewer environment interactions. Agent57 was the first agent to outperform the human baseline on all 57 Atari games.

Closed the loop on Atari — superhuman across the full suite. Model-based RL became practical.

2021UC Berkeley

Decision Transformer

Reframed RL as sequence modeling: condition a transformer on desired returns, past states, and actions. No value functions, no policy gradients — just autoregressive prediction.

Bridged RL and large language models. Opened the door to offline RL at scale.

2022OpenAI

RLHF Powers ChatGPT

Reinforcement Learning from Human Feedback (RLHF) used PPO to align GPT models with human preferences. ChatGPT launched and became the fastest-growing consumer app in history.

RL became the alignment mechanism for the entire LLM industry.

2023Google / NVIDIA

RT-2 & Eureka

RT-2 used vision-language models as robot policies, transferring web-scale knowledge to physical manipulation. Eureka used LLMs to automatically generate reward functions for dexterous manipulation tasks.

Foundation models entered robotics. Reward engineering automated by LLMs.

2024DeepSeek / OpenAI

GRPO & Reasoning Models

Group Relative Policy Optimization (GRPO) eliminated the critic network by using group-relative advantages, dramatically simplifying RL for LLMs. OpenAI o1 and DeepSeek-R1 demonstrated that RL could teach models to reason step-by-step.

RL for LLMs became simpler and more effective. Test-time compute scaling emerged.

2025Multiple labs

Physical World Models & Sim-to-Real

Large-scale world models trained on video enabled sim-to-real transfer for manipulation and locomotion. Physical intelligence companies deployed RL-trained robots in warehouses and kitchens using foundation world models.

RL escaped simulation. Physical tasks became trainable at scale.

2026Industry-wide

Current State

RL is the fine-tuning mechanism for frontier LLMs (GRPO, REINFORCE++), the training paradigm for humanoid robotics, and the optimization layer for scientific discovery. The field has converged: foundation models provide priors, RL provides optimization.

RL is no longer a research niche — it is core infrastructure for AI.

Paradigm Shifts

The field didn't evolve linearly. Five distinct paradigm shifts redefined what RL meant and what it could do.

2013–2017

Value-Based to Policy Gradient

From
DQN, discrete actions, replay buffers
To
A3C, PPO, continuous control, on-policy learning

Maximizing a value function is brittle. Directly optimizing the policy is more stable and scales to continuous action spaces.

2018–2020

Model-Free to Model-Based

From
Millions of environment interactions
To
Learned dynamics models, training in imagination

Sample efficiency matters. Learning a model of the world and planning inside it dramatically reduces real-world data requirements.

2021–2022

RL as Optimization to RL as Sequence Modeling

From
Bellman equations, temporal difference learning
To
Transformers conditioned on return, offline RL

RL problems can be recast as supervised learning on trajectory data. This unlocks the scaling properties of transformers.

2022–2024

Game AI to LLM Alignment

From
Atari, Go, Dota 2
To
RLHF, RLAIF, GRPO for language models

The biggest impact of RL shifted from playing games to shaping how billions of people interact with AI.

2024–2026

Simulation to Physical Reality

From
Virtual environments, simulated physics
To
Sim-to-real transfer, foundation world models, deployed robots

World models learned from video close the sim-to-real gap. RL-trained robots work in unstructured real environments.

Current SOTA: Atari

Agent57 (2020) was the first agent to beat human baselines on all 57 Atari games. Current agents achieve superhuman scores by enormous margins.

GameHumanBest AgentRatio
Breakout31.8864.027x
Pong14.621.01.4x
Space Invaders1,66954,57633x
Seaquest42,055999,99924x
Q*bert13,455999,99974x
Montezuma4,75312,2002.6x

Scores from published benchmarks. Montezuma's Revenge, once considered unsolvable for RL, has been cracked through exploration bonuses and Go-Explore.

Robotics: State of the Art

RL in robotics has shifted from sim-only curiosities to deployed systems. Three converging trends are driving this.

10x
less real data needed

Foundation World Models

Video prediction models trained on internet-scale data provide physics priors. Robots learn manipulation in these learned simulators with 10-100x less real-world data.

< 1hr
real-world fine-tuning

Sim-to-Real Transfer

Domain randomization, system identification, and learned adaptation modules close the reality gap. Policies trained in IsaacGym transfer to physical hardware with minimal fine-tuning.

2x
generalization vs RT-1

Language-Conditioned Policies

RT-2 and successors use VLMs as policy backbones. Natural language instructions map to motor commands. Robots generalize to novel objects and tasks zero-shot.

RL for LLMs: RLHF to GRPO

The most impactful application of RL in 2024-2026 isn't games or robots — it's making language models useful, safe, and capable of reasoning.

RLHF (2022)

Train reward model from human preferences, optimize with PPO

Strengths: Proven at scale (ChatGPT, Claude)
Limitations: Reward model collapse, complex pipeline, high compute

DPO (2023)

Direct preference optimization without explicit reward model

Strengths: Simpler pipeline, stable training
Limitations: Less flexible, struggles with complex reasoning

GRPO (2024)

Group samples, compute relative advantages within group, no critic needed

Strengths: Simple, scalable, enables reasoning (DeepSeek-R1)
Limitations: Requires verifiable rewards for best results

REINFORCE++ & Variants (2025-26)

Token-level credit assignment, process reward models, multi-turn RL

Strengths: Fine-grained optimization, agentic capabilities
Limitations: Active research, not fully standardized

What's Next

The frontiers of RL in 2026 and beyond. These are active research areas where breakthroughs are expected.

Multi-Agent Foundation Models

Active research

Training teams of agents that coordinate through emergent communication. Applications in traffic, supply chains, and collaborative robotics.

RL for Scientific Discovery

Early deployment

Optimizing molecular structures, protein folding strategies, and experimental designs. AlphaFold showed the potential; RL is the optimization layer.

Continuous Learning Agents

Active research

Agents that improve indefinitely in deployment without catastrophic forgetting. Combining RL with continual learning and memory architectures.

RL-Native Hardware

Emerging

Custom silicon for RL workloads: fast simulation, parallel rollouts, real-time inference for robotics control loops at 1kHz+.

Explore RL Benchmarks & Papers

Track the latest reinforcement learning results, compare methods, and find the papers that matter.