AcademiClaw: When Students Set Challenges for AI Agents
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively.
Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis.
Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal.
We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.
35 results reproduced from this paper.
| # | Model | Vendor | Benchmark | Value | SOTA | Date | Source |
|---|---|---|---|---|---|---|---|
| 01 | Claude Opus 4.6 | Anthropic | AcademiClaw | 71.9% | #1 | 2026-05-04 | source ↗ |
| 02 | Claude Sonnet 4.6 | Anthropic | AcademiClaw | 68.3% | 2026-05-04 | source ↗ | |
| 03 | GPT-5.4 | OpenAI | AcademiClaw | 65.6% | 2026-05-04 | source ↗ | |
| 04 | Qwen3.5-397B-A17B† | Alibaba | AcademiClaw | 64.7% | 2026-05-04 | source ↗ | |
| 05 | Gemini 3.1 Pro | AcademiClaw | 64.3% | 2026-05-04 | source ↗ | ||
| 06 | MiniMax M2.7 | MiniMax | AcademiClaw | 63.1% | 2026-05-04 | source ↗ |