Recent Papers / arXiv:2606.05080
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
Authors pending
Abstract
36 expert-curated tasks for ultra long-horizon closed-loop optimization; finds persistence—not initial quality—is the dominant success predictor.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- AutoLab: task completion rate under wall-clock budget across 36 long-horizon optimization tasks