Recent Papers / arXiv:2605.15846
RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades
Authors pending
Abstract
115 long-horizon coding tasks from real version upgrades across 17 repos
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- Verify that RoadmapBench Claude-Opus-4.7 resolves only 39.1% of tasks.