Reinforcement Learning

Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.

3 tasks3 datasets18 results

Reinforcement learning trains agents to make sequential decisions through interaction with environments. From game-playing breakthroughs to robotics control and RLHF for LLM alignment, RL has become a foundational technique across AI, though sample efficiency and sim-to-real transfer remain key challenges.

State of the Field (2025)

RLHF and RLVR (RL with Verifiable Rewards) are now standard for LLM alignment and reasoning: DeepSeek-R1, OpenAI o3, and Claude use RL-based training to improve instruction following and chain-of-thought reasoning
Offline RL matured significantly: Decision Transformer, IQL, and Cal-QL enable learning from static datasets without environment interaction, critical for healthcare, finance, and robotics where online exploration is costly or dangerous
Multi-agent RL scaled to complex coordination: OpenAI Five (Dota 2), DeepMind's AlphaStar (StarCraft II) demonstrated superhuman team coordination, while MAPPO and QMIX provide practical frameworks for cooperative multi-agent problems
Sim-to-real transfer improved through domain randomization and system identification, but reliable zero-shot transfer to real robots remains unsolved for contact-rich manipulation tasks

Quick Recommendations

Game playing and simulation benchmarks

PPO or SAC with vectorized environments

PPO provides robust on-policy training for discrete and continuous action spaces. SAC offers better sample efficiency for continuous control. Both well-supported in Stable-Baselines3 and CleanRL.

Robotics control (MuJoCo, real-world)

SAC for simulation, offline RL (IQL/Cal-QL) for real-world

SAC's entropy regularization provides robust exploration in simulation. For real robots, offline RL learns from demonstration data without risky online exploration.

LLM alignment and reasoning improvement

RLHF with PPO or DPO (Direct Preference Optimization)

PPO-based RLHF remains the standard for frontier models. DPO simplifies the pipeline by eliminating the reward model, achieving comparable results with less infrastructure.

Multi-agent coordination

MAPPO or QMIX

MAPPO scales PPO to multi-agent settings with centralized training and decentralized execution. QMIX provides value decomposition for cooperative tasks. Both handle partial observability.

Tasks & Benchmarks

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

1 datasets9 resultsSOTA tracked

Continuous Control

Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.

1 datasets9 resultsSOTA tracked

Offline RL

Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.

1 datasets0 results

Show all datasets and SOTA results

Atari Games

Atari 2600Arcade Learning Environment (Atari 2600)2013

40000(human-normalized-score)Go-Explore

Continuous Control

MuJoCoMulti-Joint dynamics with Contact2012

960(average-return)TD-MPC2 (317M params)

Offline RL

D4RL HalfCheetah-Medium-v2D4RL: Datasets for Deep Data-Driven Reinforcement Learning (halfcheetah-medium-v2)2020

Honest Takes

RL's biggest impact is inside LLMs, not robotics

The RL community spent decades on game-playing and robot control. The technology's largest real-world impact turned out to be RLHF for language model alignment. DeepSeek-R1 showed that RL alone (without supervised fine-tuning) can teach models to reason. This is where RL delivers the most value today.

Sample efficiency is still embarrassing

State-of-the-art RL agents need millions of environment interactions to learn tasks a human figures out in minutes. Offline RL and world models help, but the fundamental sample efficiency gap means RL remains impractical for most real-world applications without simulation.

Sim-to-real is the real bottleneck for robotics RL

Papers show impressive MuJoCo results that fail on real hardware. Domain randomization helps but doesn't solve contact dynamics, sensor noise, and actuator delays. Until sim-to-real transfer is reliable, RL for physical robots will remain a research endeavor for most teams.

Get notified when these results update

New models drop weekly. We track them so you don't have to.