Medical

Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.

4 tasks15 datasets71 results

Medical AI has reached an inflection point with 22% of healthcare organizations deploying domain-specific AI (7x YoY growth). The landscape spans diagnostic imaging, clinical NLP, drug discovery, and FDA-approved applications, with foundation models and transformers achieving clinician-level performance while facing critical generalizability challenges.

State of the Field (2025)

GPT-4o achieves 90.4% accuracy on USMLE questions, Med-PaLM 2 reaches 92.6% expert evaluation score, matching clinician-level performance on medical question answering
Vision Transformers with Grad-CAM explainability outperform CNNs across breast cancer, brain tumors, and retinal imaging; Atten-Nonlocal Unet achieves 84-91% Dice scores on multi-organ segmentation
BoltzGen enables generative protein design for drug discovery; AlphaFold predicted 200M+ protein structures; FDA approved 40+ AI devices in 2024-2025 including radiology, pathology, and ultrasound tools
External validation reveals concerning generalizability issues: models maintain 85%+ sensitivity but specificity drops 24 percentage points across sites, with GPT-4V hallucinating 46.8% on pathology detection

Quick Recommendations

Medical imaging segmentation (organs, tumors, anatomical structures)

Atten-Nonlocal Unet or MedSAM foundation model

Atten-Nonlocal Unet achieves 84-91% Dice scores across Synapse/ACDC/AVT with attention mechanisms for long-range dependencies. For broader generalization, MedSAM provides general-purpose segmentation across modalities with minimal task-specific fine-tuning.

Clinical question answering and decision support

GPT-4o or Med-PaLM 2

GPT-4o achieves 90.4% USMLE accuracy (92.7% diagnostic, 88.8% management) with multimodal capabilities. Med-PaLM 2 reaches 92.6% expert evaluation vs 92.9% for clinicians. Both outperform medical student baseline (59.3%) and handle complex clinical vignettes.

Medical image classification (cancer detection, disease identification)

DINO Vision Transformer with Grad-CAM

Self-supervised DINO outperforms CNNs across breast cancer, skin lesions, brain tumors, and retinal imaging. Grad-CAM provides spatially precise, class-discriminative explanations essential for clinical adoption. Works even with limited labeled data via transfer learning.

Clinical NLP and information extraction from unstructured notes

LLM-Augmented BiLSTM-BERT framework

Structured LLM augmentation improves strict NER F1 from 81.2% to 81.8% on i2b2-2012, relation extraction from 82.8% to 84.1% on N2C2-2018. Handles lengthy clinical documents exceeding standard transformer limits while preserving drug-dosage and condition-symptom relationships.

Drug discovery and protein design

BoltzGen for binder design, AlphaFold for structure prediction

BoltzGen enables generative design of novel protein binders for arbitrary targets (validated across 26 diverse cases). AlphaFold provides foundational structure prediction for 200M+ proteins with custom annotations. Represents step change from prediction to ab initio functional protein design.

Radiology workflow automation (imaging analysis, reporting)

FDA-cleared AI-Rad Companion or Claude 3.5 Sonnet for clinical tasks

AI-Rad Companion (FDA-cleared March 2025) handles organ segmentation for radiotherapy planning. For broader clinical workflows, Claude 3.5 Sonnet achieves 70% success on Stanford MedAgentBench (retrieving patient data, ordering tests, prescribing medications via FHIR APIs).

Pathology and whole-slide image analysis

Graph Neural Networks (DeepTFtyper architecture)

GNNs model spatial tissue relationships and topology critical for histopathology. DeepTFtyper predicts molecular subtypes (SCLC-A/N/P/Y) from H&E slides alone with AUC >0.70, enabling molecular-informed treatment selection without separate molecular testing.

Privacy-preserving multi-institutional model development

Federated Learning with Bayesian uncertainty quantification

Enables collaborative training while keeping patient data at local institutions. Bayesian approaches provide predictive uncertainty across federated settings, improving inference quality vs standard aggregation. Essential for GDPR/HIPAA compliance and rare disease research with distributed data.

Rare disease diagnosis with limited training data

RareScale LLM framework

Achieves 88.8% candidate generation performance and 17%+ improvement in Top-5 accuracy across 575 rare diseases vs baseline black-box LLMs. Specialized prompt engineering and evaluation strategies work even with inherently limited training examples.

Patient deterioration and survival prediction

DeepHit model for survival analysis

Achieves concordance index 0.94 and one-year AUC 0.89, substantially outperforming Cox proportional hazards. Effectively integrates temporal information and dynamic patient characteristics for longitudinal outcome prediction with missing data.

Tasks & Benchmarks

Disease Classification

Diagnosing diseases from medical images or data.

9 datasets45 resultsSOTA tracked

Medical Image Segmentation

Segmenting organs and abnormalities in medical images.

4 datasets26 resultsSOTA tracked

Clinical NLP

Processing clinical notes and medical text.

1 datasets0 results

Drug Discovery

Predicting molecular properties and drug interactions.

1 datasets0 results

Show all datasets and SOTA results

Disease Classification

ABIDE IAutism Brain Imaging Data Exchange I2012

98(accuracy)Plymouth DL Model

ABIDE IIAutism Brain Imaging Data Exchange II2017

93(auc)DeepASD

COVID-19 Image Data CollectionCOVID-19 Image Data Collection2020

94.7(auroc)DenseNet-121 (Chest X-ray)

CheXpertCheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels2019

93(auroc)CheXpert AUC Maximizer

MIMIC-CXRMIMIC-CXR: Medical Information Mart for Intensive Care - Chest X-ray2019

89.2(auroc)CheXzero

NIH ChestX-ray14NIH Clinical Center Chest X-ray Dataset2017

85.8(auroc)TorchXRayVision

PadChestPadChest: A Large Chest X-ray Image Dataset2020

84.6(auroc)TorchXRayVision

RSNA Pneumonia DetectionRSNA Pneumonia Detection Challenge2018

88.5(auroc)DenseNet-121 (Chest X-ray)

VinDr-CXRVinDr-CXR: Vietnamese Dataset for Chest Radiograph2022

91.2(auroc)RAD-DINO

Medical Image Segmentation

ACDCAutomated Cardiac Diagnosis Challenge2017

92.65(mean-dsc)MedNeXt-L

BTCVBeyond The Cranial Vault Multi-Organ CT Segmentation2015

85.38(mean-dsc)STU-Net-H

BraTS 2023Brain Tumor Segmentation Challenge 20232023

0.90(mean-dice-wt-tc-et)MedNeXt-L

Synapse Multi-Organ CTSynapse Multi-Organ Abdominal CT Segmentation2015

86.45(mean-dsc)SegMamba

Clinical NLP

MedQA (USMLE)2020

Drug Discovery

MoleculeNet BBBP2017

Honest Takes

Multimodal models are failing medical imaging

GPT-4V achieves 100% on imaging modality identification but only 35.2% on pathology detection with 46.8% hallucination rates. Adding images to text-optimized models sometimes decreases accuracy. Current vision-language architectures don't preserve the specialized visual-spatial reasoning required for clinical diagnosis.

FDA approval doesn't mean it works in your hospital

External validation shows median AUC drops 0.03 with specificity degrading up to 24 percentage points despite FDA clearance. Most approved devices lack comprehensive multi-site validation and age/sex subgroup performance data. One trauma CNN maintained 85% sensitivity but specificity crashed from 94% to 70% on older patients.

We have 4 real-world LLM deployments vs thousands of papers

Only four published studies (2024-2025) describe actual LLM implementation in clinical workflows despite thousands of bench research papers. The translation gap is massive. Most papers show what models could do, not what they actually do reliably in clinical environments with missing data, workflow constraints, and liability.

Foundation models are becoming clinical infrastructure

Rather than building task-specific models, organizations are deploying general foundation models like MedSAM and adapting them locally through fine-tuning. Open-source releases like DeepSeek-V3 (62.67% clinical task accuracy) and BoltzGen are democratizing access, reducing vendor lock-in for resource-limited settings.

Get notified when these results update

New models drop weekly. We track them so you don't have to.