AcademiClaw: When Students Set Challenges for AI Agents

arXiv:2605.02661Submitted May 4, 202635 benchmark results

Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo et al.

View PDF ↗arXiv page ↗Code ↗Edit

Abstract

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively.

Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis.

Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal.

We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

Tasks

edit

• Agentic AI

Results

35 results reproduced from this paper.

submit

MetricSort byDirectionSorted instantly in-page

Results

SOTA rows

Models

Datasets

#	Model	Vendor	Benchmark	Value	SOTA	Date	Source
01	Claude Opus 4.6	Anthropic	AcademiClaw	71.9%	#1	2026-05-04	source ↗
02	Claude Sonnet 4.6	Anthropic	AcademiClaw	68.3%	—	2026-05-04	source ↗
03	GPT-5.4	OpenAI	AcademiClaw	65.6%	—	2026-05-04	source ↗
04	Qwen3.5-397B-A17B†	Alibaba	AcademiClaw	64.7%	—	2026-05-04	source ↗
05	Gemini-3.1-Pro	Google	AcademiClaw	64.3%	—	2026-05-04	source ↗
06	MiniMax M2.7	MiniMax	AcademiClaw	63.1%	—	2026-05-04	source ↗

CodeSOTA extraction

Benchmark evidence

edit

Link this paper to benchmark rows, datasets, model cards, and reproduced results as evidence is extracted.

AcademiClaw: When Students Set Challenges for AI Agents

35 results reproduced from this paper.

Benchmark evidence

6 models from this paper.

1 dataset from this paper.

Three places to go from here.