Medical

Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.

4 tasks15 datasets71 results

Medical AI has reached an inflection point with 22% of healthcare organizations deploying domain-specific AI (7x YoY growth). The landscape spans diagnostic imaging, clinical NLP, drug discovery, and FDA-approved applications, with foundation models and transformers achieving clinician-level performance while facing critical generalizability challenges.

State of the Field (2025)

  • GPT-4o achieves 90.4% accuracy on USMLE questions, Med-PaLM 2 reaches 92.6% expert evaluation score, matching clinician-level performance on medical question answering
  • Vision Transformers with Grad-CAM explainability outperform CNNs across breast cancer, brain tumors, and retinal imaging; Atten-Nonlocal Unet achieves 84-91% Dice scores on multi-organ segmentation
  • BoltzGen enables generative protein design for drug discovery; AlphaFold predicted 200M+ protein structures; FDA approved 40+ AI devices in 2024-2025 including radiology, pathology, and ultrasound tools
  • External validation reveals concerning generalizability issues: models maintain 85%+ sensitivity but specificity drops 24 percentage points across sites, with GPT-4V hallucinating 46.8% on pathology detection

Quick Recommendations

Medical imaging segmentation (organs, tumors, anatomical structures)

Atten-Nonlocal Unet or MedSAM foundation model

Atten-Nonlocal Unet achieves 84-91% Dice scores across Synapse/ACDC/AVT with attention mechanisms for long-range dependencies. For broader generalization, MedSAM provides general-purpose segmentation across modalities with minimal task-specific fine-tuning.

Clinical question answering and decision support

GPT-4o or Med-PaLM 2

GPT-4o achieves 90.4% USMLE accuracy (92.7% diagnostic, 88.8% management) with multimodal capabilities. Med-PaLM 2 reaches 92.6% expert evaluation vs 92.9% for clinicians. Both outperform medical student baseline (59.3%) and handle complex clinical vignettes.

Medical image classification (cancer detection, disease identification)

DINO Vision Transformer with Grad-CAM

Self-supervised DINO outperforms CNNs across breast cancer, skin lesions, brain tumors, and retinal imaging. Grad-CAM provides spatially precise, class-discriminative explanations essential for clinical adoption. Works even with limited labeled data via transfer learning.

Clinical NLP and information extraction from unstructured notes

LLM-Augmented BiLSTM-BERT framework

Structured LLM augmentation improves strict NER F1 from 81.2% to 81.8% on i2b2-2012, relation extraction from 82.8% to 84.1% on N2C2-2018. Handles lengthy clinical documents exceeding standard transformer limits while preserving drug-dosage and condition-symptom relationships.

Drug discovery and protein design

BoltzGen for binder design, AlphaFold for structure prediction

BoltzGen enables generative design of novel protein binders for arbitrary targets (validated across 26 diverse cases). AlphaFold provides foundational structure prediction for 200M+ proteins with custom annotations. Represents step change from prediction to ab initio functional protein design.

Radiology workflow automation (imaging analysis, reporting)

FDA-cleared AI-Rad Companion or Claude 3.5 Sonnet for clinical tasks

AI-Rad Companion (FDA-cleared March 2025) handles organ segmentation for radiotherapy planning. For broader clinical workflows, Claude 3.5 Sonnet achieves 70% success on Stanford MedAgentBench (retrieving patient data, ordering tests, prescribing medications via FHIR APIs).

Pathology and whole-slide image analysis

Graph Neural Networks (DeepTFtyper architecture)

GNNs model spatial tissue relationships and topology critical for histopathology. DeepTFtyper predicts molecular subtypes (SCLC-A/N/P/Y) from H&E slides alone with AUC >0.70, enabling molecular-informed treatment selection without separate molecular testing.

Privacy-preserving multi-institutional model development

Federated Learning with Bayesian uncertainty quantification

Enables collaborative training while keeping patient data at local institutions. Bayesian approaches provide predictive uncertainty across federated settings, improving inference quality vs standard aggregation. Essential for GDPR/HIPAA compliance and rare disease research with distributed data.

Rare disease diagnosis with limited training data

RareScale LLM framework

Achieves 88.8% candidate generation performance and 17%+ improvement in Top-5 accuracy across 575 rare diseases vs baseline black-box LLMs. Specialized prompt engineering and evaluation strategies work even with inherently limited training examples.

Patient deterioration and survival prediction

DeepHit model for survival analysis

Achieves concordance index 0.94 and one-year AUC 0.89, substantially outperforming Cox proportional hazards. Effectively integrates temporal information and dynamic patient characteristics for longitudinal outcome prediction with missing data.

Tasks & Benchmarks

Show all datasets and SOTA results

Disease Classification

ABIDE I2012
98(accuracy)Plymouth DL Model
ABIDE II2017
93(auc)DeepASD
COVID-19 Image Data Collection2020
94.7(auroc)DenseNet-121 (Chest X-ray)
CheXpert2019
93(auroc)CheXpert AUC Maximizer
MIMIC-CXR2019
89.2(auroc)CheXzero
NIH ChestX-ray142017
85.8(auroc)TorchXRayVision
PadChest2020
84.6(auroc)TorchXRayVision
RSNA Pneumonia Detection2018
88.5(auroc)DenseNet-121 (Chest X-ray)
VinDr-CXR2022
91.2(auroc)RAD-DINO

Medical Image Segmentation

ACDC2017
92.65(mean-dsc)MedNeXt-L
BTCV2015
85.38(mean-dsc)STU-Net-H
BraTS 20232023
0.90(mean-dice-wt-tc-et)MedNeXt-L
Synapse Multi-Organ CT2015
86.45(mean-dsc)SegMamba

Clinical NLP

Drug Discovery

Honest Takes

Multimodal models are failing medical imaging

GPT-4V achieves 100% on imaging modality identification but only 35.2% on pathology detection with 46.8% hallucination rates. Adding images to text-optimized models sometimes decreases accuracy. Current vision-language architectures don't preserve the specialized visual-spatial reasoning required for clinical diagnosis.

FDA approval doesn't mean it works in your hospital

External validation shows median AUC drops 0.03 with specificity degrading up to 24 percentage points despite FDA clearance. Most approved devices lack comprehensive multi-site validation and age/sex subgroup performance data. One trauma CNN maintained 85% sensitivity but specificity crashed from 94% to 70% on older patients.

We have 4 real-world LLM deployments vs thousands of papers

Only four published studies (2024-2025) describe actual LLM implementation in clinical workflows despite thousands of bench research papers. The translation gap is massive. Most papers show what models could do, not what they actually do reliably in clinical environments with missing data, workflow constraints, and liability.

Foundation models are becoming clinical infrastructure

Rather than building task-specific models, organizations are deploying general foundation models like MedSAM and adapting them locally through fine-tuning. Open-source releases like DeepSeek-V3 (62.67% clinical task accuracy) and BoltzGen are democratizing access, reducing vendor lock-in for resource-limited settings.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Medical Benchmarks - CodeSOTA | CodeSOTA