Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->

This day in AI

Recent paper calendar for people tracking useful AI shifts.

CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.

New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.

Big picture

  • Agentic routing and delegation benchmarks surface fundamental limitations in current orchestration methods
  • Long-horizon software development and memory evaluation tasks expose sharp drops in performance as context grows
  • Domain-specific benchmarks for healthcare and engineering construction highlight the need for specialized evaluation
  • System-level efficiency gains from dynamic layer routing demonstrate progress in adaptive inference

Benchmarks to extract

  • Verify that TwinRouterBench highest success rate is 64.8% for computer-use models.
  • Verify that DecisionBench routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions.
  • Verify that RoadmapBench Claude-Opus-4.7 resolves only 39.1% of tasks.
  • Verify that MINTEval average accuracy across all systems is 27.9%.

Papers and links

Benchmark2605.18859

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Step-level LLM routing benchmark with static and dynamic tracks for agentic workflows

Benchmark2605.19099

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Emergent delegation evaluation across GAIA, BFCL, and tau-bench

Benchmark2605.15846

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

115 long-horizon coding tasks from real version upgrades across 17 repos

Benchmark2605.18565

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

15.6k QA pairs over long contexts averaging 138.8k tokens for multi-target memory

Benchmark2605.16679

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

End-to-end healthcare workflow automation with 20 apps and 87 MCP tools

System2510.12773

Dr.LLM: Dynamic Layer Routing in LLMs

Dynamic layer routing with MCTS-supervised per-layer routers for efficient LLM inference

Method note

Sampled 60 of 460 entries, prioritizing benchmarks and systems with the strongest quantitative signal from a deterministic candidate ranking. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

Big picture

  • Runtime agent safety is moving from prompt policies to action interception, MCP monitoring, and host-side controls.
  • Evaluation is shifting toward process-aware tasks: tool trajectories, delivered artifacts, multimodal verification, and human-validated rubrics.
  • Self-improving agents are becoming governed systems with rollback, canary tests, experience graphs, and explicit lifecycle controls.
  • Reasoning work is converging on sparse credit assignment: find the decision tokens or reasoning steps that actually steer the answer.

Benchmarks to extract

  • TOBench for tool-using agent rows
  • ADR-Bench and SLEIGHT-Bench for agent security
  • WebGameBench for coding-agent delivery
  • LinAlg-Bench, CAM-Bench, and GIM for reasoning diagnostics

Papers and links

Benchmark2605.16909

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.

Safety2605.17380

ADR: An Agentic Detection and Response System

Production-style monitoring for MCP agent activity with ADR-Bench covering 302 tasks and 17 attack techniques.

Benchmark2605.17637

WebGameBench: Requirement-to-Application Evaluation for Coding Agents

Evaluates browser-accessible delivered games, separating minimum working delivery from excellent requirement satisfaction.

Benchmark2605.16675

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes

660 SymPy-certified linear algebra problems plus a failure taxonomy for diagnosing mathematical reasoning.

Reasoning2605.16874

Reasoning Can Be Restored by Correcting a Few Decision Tokens

Claims reasoning failures concentrate in a small number of early tokens, useful for intervention and evaluation design.

Agent2605.17721

EXG: Self-Evolving Agents with Experience Graphs

Turns successes and failures into graph memory, giving self-evolving agents a more inspectable substrate.

Method note

Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

Big picture

  • Real-world web agents are getting eval environments that look more like actual SaaS and commerce workflows.
  • Agent architecture work is becoming cost-aware: context, hierarchy, reasoning depth, and monitoring are treated as budgeted design choices.
  • Formal methods are appearing as runtime guardrails for LLM systems rather than only offline verification work.
  • Medical and robotics papers are packaging open systems around concrete downstream workflows instead of generic model releases.

Benchmarks to extract

  • ShopGym and SaaS-Bench for web-agent task pages
  • PAGER/PAGE Bench for long-form or page-level agent evaluation
  • ToxiAlert-Bench and RoadmapBench from the deterministic screen
  • VLA-AD and RTL-BenchMT for embodied and hardware-facing rows

Papers and links

Benchmark2605.15777

SaaS-Bench

A practical benchmark direction for agents operating across SaaS workflows, useful for procurement-style agent evaluation.

Benchmark2605.16116

ShopGym

High-priority e-commerce agent environment surfaced by both the scout and deterministic screen.

Safety2605.16198

Formal Methods Meet LLMs

Runtime monitoring and formal constraints as a concrete control layer for LLM agents.

Agent2605.16217

Argus Deep Research Agents

Deep-research agent work that matches CodeSOTA's interest in paper-to-evidence workflows.

System2605.16215

Fully Open Meditron

Open medical AI system surfaced as a practical extraction target for model, data, and benchmark claims.

Method note

The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

Big picture

  • Agent orchestration papers are moving toward explicit graphs, patterns, and inspectable memory instead of one-off prompt chains.
  • Education and industrial benchmarks are becoming stronger examples of domain-specific agent evaluation.
  • Safety and governance papers are trying to separate the task being performed from the governance process wrapped around it.
  • Reasoning papers continue to probe symbolic structure, attributes, and limits of model-based inference.

Benchmarks to extract

  • EntityBench, ClawForge, EduAgentBench, and Herculean
  • PDI-Bench, Collider-Bench, and XDomainBench
  • EduFrameTrap for sycophancy and education-agent failure modes
  • SimPersona for persona or simulation-agent evaluation

Papers and links

Safety2605.14744

Governance-task decoupling

Useful framing for separating operational task success from oversight and governance quality.

Method note

This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.

Next extraction pass

Turn the calendar into benchmark evidence, not just reading notes.

The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.