What is the best robotics AI benchmark in 2026?

Open X-Embodiment is the largest robotics benchmark — 60+ datasets from 34 research labs, data from 22 robot types, 527 skills. Best for cross-embodiment transfer.

What is sim-to-real transfer in robotics?

Sim-to-real is training robot policies in simulation and deploying on physical robots. The reality gap — friction, sensor noise, actuator limits — is the dominant failure mode.

What are RT-1 and RT-2 models?

RT-1 is Google's transformer architecture for robotic control, trained on 130k demonstrations across 700+ tasks. RT-2 is a vision-language model that emits robot actions as language tokens.

Which simulation platform should I use?

MuJoCo for accuracy on contact-rich manipulation. Isaac Gym for GPU-accelerated RL at scale. RoboSuite for standardised manipulation benchmarks. PyBullet for a gentle start.

Codesota · Robotics · Vol. IIFrom simulation to the factory floorIssue: April 22, 2026

Live register · simulation · real-world · VLA

Robotics, measured honestly.
From simulation to the factory floor.

The open register of robot-learning benchmarks — Open X-Embodiment, LIBERO, Habitat, ManiSkill, RoboSuite — read next to the vision-language-action models that are starting to cross the sim-to-real gap. Pick-and-place is essentially solved; long-horizon manipulation still sits below fifty percent.

Read the benchmarks →VLA modelsFree · no paywall · no signup

§ 01 · Simulation

Benchmarks, simulated.

Simulation is where most robot learning still happens — cheaper iteration, no broken gripper, and the only tractable path to billions of interaction steps. These are the registers the field actually agrees on.

Surface: Datasets · suites · competitions
Engines: MuJoCo · Isaac · PyBullet
Updated: April 2026

How we evaluate robot benchmarks →

.csv.json

Benchmark	Org	Type	Scale	Best-known result
Open X-Embodiment	Google DeepMind + 34 Labs	Multi-robot dataset	60+ datasets · 22 robot types · 527 skills	RT-2-X · +50% over single-embodiment
RoboSuite	Stanford / ARISE	Manipulation benchmark	8 robots · 12 tasks · MuJoCo	Diffusion Policy · 80%+ on complex tasks
LIBERO	UT Austin / Robotics	Long-horizon manipulation	5+ step task chains	Sub-50% on deepest chains
Habitat	Meta FAIR	Embodied navigation	Photorealistic 3D scenes	90%+ in structured envs
ManiSkill	UC San Diego	Manipulation suite	Dexterous + rigid-body tasks	Active research frontier
Meta-World	Stanford / Berkeley	Multi-task RL	50 manipulation tasks	Pick-and-place solved · long-horizon open
M3Bench	Research Community	Mobile manipulation	30k tasks · 119 household scenes	VLA + motion planning
BARN Challenge	IEEE ICRA	Navigation competition	300 environments	Hybrid learning + planning
DROID	Toyota Research / Berkeley	Demonstration dataset	76k trajectories · 564 scenes · 86 tasks	Training corpus for Octo, RT-X

Fig 2 · Simulation register. Shaded row marks the largest open dataset; Open X-Embodiment is the corpus that trained the current generation of generalist policies (RT-2-X, Octo, OpenVLA).

§ 02 · Real-world

Manipulation, on hardware.

The models that have crossed the sim-to-real gap on a physical arm. Most are transformer policies trained on tele-operated demonstrations; a handful have taken delivery on humanoids and dexterous hands.

RT-1 opened the era in 2022 (130k demos, 700+ tasks). RT-2 introduced VLA in 2023. 2024 added Octo and OpenVLA as the first open-source generalists; 2025 brought production deployment at Tesla Optimus and Figure AI scale.

Generalist policies · April 2026

.csv

Model	Org	Kind	Trained on	Params	Access
RT-2-X	Google DeepMind	Vision-Language-Action	Open X-Embodiment	—	Research only
Pi0	Physical Intelligence	Generalist robot policy	Private corpus	—	Commercial
Isaac GR00T N1	NVIDIA	Humanoid foundation model	NVIDIA ecosystem data	—	NVIDIA ecosystem
Octo	Berkeley AI Research	Open-source generalist	Open X-Embodiment · DROID	Octo-base / small	Apache 2.0
OpenVLA	Stanford / TRI	Vision-Language-Action	Open X-Embodiment	7B	Open source
RT-1	Google	Robot transformer	130k demos · 700+ tasks	—	Research

Fig 3 · Generalist robot policies. Shaded rows mark the three frontier systems (RT-2-X, Pi0, Isaac GR00T N1). Open-source recommendation: Octo or OpenVLA — both fine-tunable on a new robot with limited data.

§ 03 · VLA

Vision, language, action.

The architectural thesis of the foundation-model era in robotics. One transformer, three modalities, and a training corpus wide enough to cross embodiments.

OpenVLA (Stanford / TRI · 7B) and Octo (Berkeley · Apache 2.0) are the default open-source starting points; RT-2-X, Pi0 and Isaac GR00T N1 sit behind partner or research access.

Vision-Language-Action

What VLA means.

A single transformer that consumes camera frames and a natural-language instruction, and emits robot actions as tokens. RT-2 was the first at scale; OpenVLA (7B) is the open reference.

Cross-embodiment

Why it matters.

Training one model across 22 robot types and 527 skills — the Open X-Embodiment recipe — lifts a target-robot score by +50% over training on that robot alone. Scale transfers.

Long-horizon

Where it breaks.

Five-plus step tasks still land below 50% on LIBERO and CALVIN. Error propagates across steps faster than any current policy can recover.

§ 04 · Trends

Eight registers. One arc at a time.

Each panel traces the qualitative arc of a robotics sub-task — entry-level manipulation climbed first, dexterous rotation followed, long-horizon chains are still below half. The copper dot marks today's frontier ceiling.

Pick-and-place · sim

95%+RoboSuite · higher ↑

Complex manip. · sim

80%+Diffusion Pol. · higher ↑

Dexterous rotation

~70%Shadow Hand · higher ↑

Navigation · structured

90%+BARN · higher ↑

Contact-rich · peg-in-hole

60–80tolerance · higher ↑

Long-horizon · 5+ steps

<50LIBERO · higher ↑

Cross-embodiment lift

+50%RT-X vs single · higher ↑

Open-source VLA params

7BOpenVLA · B ↑

Fig 4 · Qualitative trends per robotics sub-task across recent years. Headline figure is the current frontier score (or qualitative ceiling) reported in the cited benchmark; line shape is indicative of the trajectory, not an exact measurement series.

§ 05 · Simulators

Engines, compared.

MuJoCo for contact-rich accuracy; Isaac for GPU-scale RL; PyBullet for the first week; Genesis for the differentiable frontier.

DeepMind acquired MuJoCo in 2021 and made it free — the single largest accelerant of the modern simulation era. Isaac Gym unlocked 1,000s of parallel envs per GPU.

Engine	Org	License	Strength	Weakness	Best for
MuJoCo	DeepMind	Apache 2.0	Accurate contact physics	CPU-only · steeper curve	Contact-rich manipulation · research
Isaac Gym	NVIDIA	NVIDIA (free research)	GPU-accelerated · 1,000s parallel envs	NVIDIA required · less accurate contacts	RL at scale · locomotion
PyBullet	Erwin Coumans	zlib	Easy to use · Python-native	Less accurate · slower	Beginners · prototyping
Genesis	Stanford / CMU	Apache 2.0	Differentiable · multi-GPU	New (2024) · smaller community	Cutting-edge research

Fig 5 · The four simulators that own the field. Anything exotic — Genesis, differentiable physics — sits alongside, not above.

§ 06 · Difficulty

From pick-and-place, to folding a shirt.

Entry-level manipulation is solved. Dexterous manipulation is within reach. Contact-rich tolerance work is still brittle. Long-horizon chains remain the honest ceiling of the field.

Deep dive · robot grasping benchmarks (GraspNet, Dex-Net, AnyGrasp) →

Task	Difficulty	Benchmarks	Best-known	Open challenge
Pick and Place	Entry	RoboSuite Lift · Meta-World	95%+ success	Generalisation to novel objects
Autonomous Navigation	Medium	BARN Challenge · Habitat	90%+ in structured envs	Dynamic obstacles
Dexterous Manipulation	Hard	DexMV · DexArt · Shadow Hand	~70% on complex rotation	High DoF · sim-to-real gap
Mobile Manipulation	Hard	M3Bench · BEHAVIOR-1K	Active frontier	Whole-body coordination
Contact-Rich Tasks	Hard	FurnitureBench · Peg Insertion	60–80% (tolerance-dependent)	Force sensing · compliance
Long-Horizon Tasks	Very Hard	CALVIN · LIBERO	<50% on 5+ step chains	Error propagation · memory

Fig 6 · Canonical task classes in robot learning, in rough order of how well current policies handle them.

§ 07

Methodology

How we read robotics numbers.

Robotics benchmarks are harder to trust than LLM ones. A score can depend on the gripper, the tabletop, the lighting — and on whether the model was trained on the exact same embodiment it is now being tested on. We report accordingly.

First, same-embodiment comparison. Cross-robot numbers are inflated by the Open X-Embodiment training distribution; we separate single-robot scores from generalist-policy scores.

Second, reported simulation seeds. A 95% pick-and-place on RoboSuite with one seed is not a 95% on ten seeds. When we cite a number, we prefer the multi-seed average; where only a headline is available we say so.

Third, sim-to-real honesty. A simulation score is not a hardware score. Policies that work perfectly in MuJoCo routinely fail on the physical arm — friction, contacts, sensor noise, and actuator delay all diverge. We flag which lane a number comes from.

Where the existing literature reports a qualitative ceiling rather than a reproduced number, we preserve the qualitative phrasing — “sub-50%”, “active frontier” — rather than invent precision.

§ 08 · Horizon

The IKEA test, still distant.

A concrete, contact-rich, long-horizon capability test the field has not yet cleared — a commercial general-purpose robot that assembles unmodified IKEA furniture in a normal home, under $100k, in under a day.

Sourced from a date forecast on Metaculus. Closes 2036-01-01; the community median sits in early 2031, with the middle 50% spanning four years on either side of that.

Read the question on Metaculus →

Community forecast · Metaculus · Q43262

11 forecasters · opens 2026-04-21 · resolves by 2036-01-01

Q43262

When will commercially available robots build IKEA furniture on their own?

Jan 2031

2026

2036

Median Jan 2031 · 50% interval Jan 2029 – Mar 2033

Resolution criteria

Commercially available, general-purpose system sold publicly.
Assembles ≥ 5 different IKEA items in ≥ 3 categories (table, chair, bed, storage, lighting).
Each assembly completes in less than 24 hours.
One or two robots, all with self-powered locomotion — no fixed industrial cells.
Total retail price under $100,000 (Jan 2025 USD).
Works in most home-like environments without task-specific fiducials. Up to 24 h on-site training allowed.
No human assistance once assembly begins. Items must be unmodified IKEA retail units.

Today

Early research stage

Berkeley plank-handling robot (Jan 2025) — single plank type, no screws, human still drives the screwdriver. Recent context: Pi 0.7 (Physical Intelligence), Gemini Robotics-ER 1.6.

Fig 7 · Date forecast on Metaculus Q43262. Bar shows the recency-weighted 50% community interval; copper tick marks the median. Snapshot 2026-04-28.

§ 09 · Related

Robotics, measured honestly.
From simulation to the factory floor.

Benchmarks, simulated.

Manipulation, on hardware.

Vision, language, action.

What VLA means.

Why it matters.

Where it breaks.

Eight registers. One arc at a time.

Engines, compared.

From pick-and-place, to folding a shirt.

How we read robotics numbers.

The IKEA test, still distant.

When will commercially available robots build IKEA furniture on their own?

Read next, around the register.

Grasping benchmarks →

LLM leaderboard →

Vision register →

Hardware register →

Agentic AI →

Task index →

Methodology →

Papers with Code →

Browse benchmarks →

Robotics, measured honestly.From simulation to the factory floor.

Benchmarks, simulated.

Manipulation, on hardware.

Vision, language, action.

What VLA means.

Why it matters.

Where it breaks.

Eight registers. One arc at a time.

Engines, compared.

From pick-and-place, to folding a shirt.

How we read robotics numbers.

The IKEA test, still distant.

When will commercially available robots build IKEA furniture on their own?

Read next, around the register.

Grasping benchmarks →

LLM leaderboard →

Vision register →

Hardware register →

Agentic AI →

Task index →

Methodology →

Papers with Code →

Browse benchmarks →

Robotics, measured honestly.
From simulation to the factory floor.