Recent Papers / arXiv:2606.07462
Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
Authors pending
Abstract
AARRI-Bench targets granular research behavior; best system (Mini-SWE-Agent + Claude Opus 4.7) achieves only 68.3%, frequently missing subtle details humans catch.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
Link this paper to benchmark rows, datasets, model cards, and reproduced results as evidence is extracted.