This day in AI

Recent paper calendar for people tracking useful AI shifts.

CodeSOTA scans arXiv days for papers that could become useful benchmark rows, model-selection evidence, or product-facing research notes. This page keeps the daily trail visible instead of burying it in one-off reports.

Latest day May 19 Calendar index

Calendar

Research days worth revisiting

Open recent-paper calendar

Wednesday

May 20, 2026

New benchmarks for agentic routing, long-horizon software development, memory evaluation, and healthcare workflows reveal persistent gaps in frontier models, while dynamic layer routing offers a path to more efficient LLM inference.

460 entries, including 70 new submissions

Tuesday

May 19, 2026

The day was dominated by agent runtime security, process-aware benchmarks, self-improving agent systems, and sparse reasoning credit assignment.

900 entries, including 142 new submissions

Monday

May 18, 2026

Monday's useful signal was practical agent infrastructure: SaaS and shopping environments, formal monitoring, research-agent scaffolding, and vertical medical systems.

200 entries, sampled LLM scout plus full deterministic screen

Friday

May 15, 2026

Friday's signal was about making agent behavior legible: orchestration graphs, memory cold starts, design patterns, education agents, and benchmarked governance workflows.

445 entries, including 100 new submissions

Big picture

Agentic routing and delegation benchmarks surface fundamental limitations in current orchestration methods
Long-horizon software development and memory evaluation tasks expose sharp drops in performance as context grows
Domain-specific benchmarks for healthcare and engineering construction highlight the need for specialized evaluation
System-level efficiency gains from dynamic layer routing demonstrate progress in adaptive inference

Benchmarks to extract

Verify that TwinRouterBench highest success rate is 64.8% for computer-use models.
Verify that DecisionBench routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions.
Verify that RoadmapBench Claude-Opus-4.7 resolves only 39.1% of tasks.
Verify that MINTEval average accuracy across all systems is 27.9%.

Papers and links

Benchmark2605.18859

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Step-level LLM routing benchmark with static and dynamic tracks for agentic workflows

Recent paper calendar for people tracking useful AI shifts.

Research days worth revisiting

May 20, 2026

May 19, 2026

May 18, 2026

May 15, 2026

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Dr.LLM: Dynamic Layer Routing in LLMs

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

ADR: An Agentic Detection and Response System

WebGameBench: Requirement-to-Application Evaluation for Coding Agents

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes

Reasoning Can Be Restored by Correcting a Few Decision Tokens

EXG: Self-Evolving Agents with Experience Graphs

SaaS-Bench

ShopGym

FORGE

Formal Methods Meet LLMs

Argus Deep Research Agents

Fully Open Meditron

GraphBit

PREPING

ClawForge

EduAgentBench

EntityBench

Governance-task decoupling

Turn the calendar into benchmark evidence, not just reading notes.