Recent Papers / arXiv:2605.12925
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
Authors pending
Abstract
Process-level analysis of 2,614 SWE-agent trajectories; 10.7% of passing runs are 'Lucky Passes' with chaotic behavior.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- Verify AgentLens-Bench 'Lucky Pass' rate (10.7%) and whether model rankings shift by 5+ positions when using quality score vs. pass rate.