Recent Papers / arXiv:2603.14465
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen et al.
Abstract
1,000 trajectories with 8,509 human step annotations (89.1% agreement).
Ternary labeling captures exploration; process signals complement outcome supervision for test-time scaling.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
- AgentProcessBench: Step-level accuracy of process reward models vs. outcome supervision (extract from Table 3).