Reinforcement Learningreinforcement-learning

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

1
Datasets
16
Results
human-normalized-score
Canonical metric
Canonical Benchmark

Atari 2600

Suite of 57 Atari 2600 games. Standard benchmark for deep reinforcement learning agents.

Primary metric: human-normalized-score
View full leaderboard

Top 10

Leading models on Atari 2600.

RankModelhuman-normalized-scoreYearSource
1
go-explore
400002025paper
2
agent57
47312025paper
3
MEME
40872026paper
4
bbos-1
11002025
5
gdi-h3
9502025
6
dreamerv3
8402025paper
7
muzero
7312025paper
8
EfficientZero V2
2432026paper
9
rainbow-dqn
2312025paper
10
BBF (Bigger, Better, Faster)
2252026paper

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Reinforcement Learning.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace