TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
Step-level LLM routing benchmark with static and dynamic tracks for agentic workflows
This day in AI
CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.
Calendar
Wednesday
New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.
460 entries, including 70 new submissions
Tuesday
The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.
900 entries, including 142 new submissions
Monday
Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.
200 entries, sampled LLM scout plus full deterministic screen
Friday
Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.
445 entries, including 100 new submissions
New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.
Big picture
Benchmarks to extract
Papers and links
Step-level LLM routing benchmark with static and dynamic tracks for agentic workflows
Emergent delegation evaluation across GAIA, BFCL, and tau-bench
115 long-horizon coding tasks from real version upgrades across 17 repos
15.6k QA pairs over long contexts averaging 138.8k tokens for multi-target memory
End-to-end healthcare workflow automation with 20 apps and 87 MCP tools
Dynamic layer routing with MCTS-supervised per-layer routers for efficient LLM inference
Method note
Sampled 60 of 460 entries, prioritizing benchmarks and systems with the strongest quantitative signal from a deterministic candidate ranking. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.
Big picture
Benchmarks to extract
Papers and links
100 executable tasks, 27 MCP servers, 324 tools, and closed-loop multimodal verification for end-to-end tool use.
Production-style monitoring for MCP agent activity with ADR-Bench covering 302 tasks and 17 attack techniques.
Evaluates browser-accessible delivered games, separating minimum working delivery from excellent requirement satisfaction.
660 SymPy-certified linear algebra problems plus a failure taxonomy for diagnosing mathematical reasoning.
Claims reasoning failures concentrate in a small number of early tokens, useful for intervention and evaluation design.
Turns successes and failures into graph memory, giving self-evolving agents a more inspectable substrate.
Method note
Full arXiv /new batch collected on May 19. The LLM scout covered all new submissions; deterministic benchmark detection covered the full batch. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.
Big picture
Benchmarks to extract
Papers and links
A practical benchmark direction for agents operating across SaaS workflows, useful for procurement-style agent evaluation.
High-priority e-commerce agent environment surfaced by both the scout and deterministic screen.
Agent architecture signal around designing and evaluating practical multi-step systems.
Runtime monitoring and formal constraints as a concrete control layer for LLM agents.
Deep-research agent work that matches CodeSOTA's interest in paper-to-evidence workflows.
Open medical AI system surfaced as a practical extraction target for model, data, and benchmark claims.
Method note
The dated May 18 arXiv /recent section was collected in full. We stopped the full LLM run after your subsample instruction and used a 60-paper scout plus deterministic screening across all 200 summaries. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.
Big picture
Benchmarks to extract
Papers and links
Agent orchestration and reliability signal around graph-structured coordination.
Agent memory cold-start work; useful for understanding reusable memory in deployed agents.
High-priority benchmark candidate from the Friday benchmark report.
Education-agent evaluation target with obvious CodeSOTA task-page value.
Entity-centered benchmark surfaced as a strong extraction candidate.
Useful framing for separating operational task success from oversight and governance quality.
Method note
This uses the existing May 15 batch and reports already in the local paper pipeline. Claims here are abstract/report level until tables are extracted from individual PDFs. Abstract-level summaries are triage, not final benchmark claims; papers with tables or model rankings still need PDF/source extraction before they become CodeSOTA evidence rows.
Next extraction pass
The valuable follow-up is to pull benchmark tables, model lists, task definitions, and failure taxonomies from the strongest papers. That gives CodeSOTA rows users can compare, not merely links they can browse.