Recent Papers / arXiv:2606.03203
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
Authors pending
Abstract
18 clinical scenarios across 10 domains; best closed-source model reaches 54.2% strict success, open-source agents average 2.5%.
Tasks
editResults
No benchmark results recorded yet.
Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →
CodeSOTA extraction
Benchmark evidence
Link this paper to benchmark rows, datasets, model cards, and reproduced results as evidence is extracted.