AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

arXiv:2606.05080Submitted Jun 4, 20260 benchmark results

Authors pending

Abstract

36 expert-curated tasks for ultra long-horizon closed-loop optimization; finds persistence—not initial quality—is the dominant success predictor.

Tasks

Results

No benchmark results recorded yet.

Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →

CodeSOTA extraction

AutoLab: task completion rate under wall-clock budget across 36 long-horizon optimization tasks

Add or update benchmark results

Logged-in editor · benchmark trail