Medical
Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.
Medical AI has reached an inflection point with 22% of healthcare organizations deploying domain-specific AI (7x YoY growth). The landscape spans diagnostic imaging, clinical NLP, drug discovery, and FDA-approved applications, with foundation models and transformers achieving clinician-level performance while facing critical generalizability challenges.
State of the Field (2025)
- GPT-4o achieves 90.4% accuracy on USMLE questions, Med-PaLM 2 reaches 92.6% expert evaluation score, matching clinician-level performance on medical question answering
- Vision Transformers with Grad-CAM explainability outperform CNNs across breast cancer, brain tumors, and retinal imaging; Atten-Nonlocal Unet achieves 84-91% Dice scores on multi-organ segmentation
- BoltzGen enables generative protein design for drug discovery; AlphaFold predicted 200M+ protein structures; FDA approved 40+ AI devices in 2024-2025 including radiology, pathology, and ultrasound tools
- External validation reveals concerning generalizability issues: models maintain 85%+ sensitivity but specificity drops 24 percentage points across sites, with GPT-4V hallucinating 46.8% on pathology detection
Quick Recommendations
Medical imaging segmentation (organs, tumors, anatomical structures)
Atten-Nonlocal Unet or MedSAM foundation model
Atten-Nonlocal Unet achieves 84-91% Dice scores across Synapse/ACDC/AVT with attention mechanisms for long-range dependencies. For broader generalization, MedSAM provides general-purpose segmentation across modalities with minimal task-specific fine-tuning.
Clinical question answering and decision support
GPT-4o or Med-PaLM 2
GPT-4o achieves 90.4% USMLE accuracy (92.7% diagnostic, 88.8% management) with multimodal capabilities. Med-PaLM 2 reaches 92.6% expert evaluation vs 92.9% for clinicians. Both outperform medical student baseline (59.3%) and handle complex clinical vignettes.
Medical image classification (cancer detection, disease identification)
DINO Vision Transformer with Grad-CAM
Self-supervised DINO outperforms CNNs across breast cancer, skin lesions, brain tumors, and retinal imaging. Grad-CAM provides spatially precise, class-discriminative explanations essential for clinical adoption. Works even with limited labeled data via transfer learning.
Clinical NLP and information extraction from unstructured notes
LLM-Augmented BiLSTM-BERT framework
Structured LLM augmentation improves strict NER F1 from 81.2% to 81.8% on i2b2-2012, relation extraction from 82.8% to 84.1% on N2C2-2018. Handles lengthy clinical documents exceeding standard transformer limits while preserving drug-dosage and condition-symptom relationships.
Drug discovery and protein design
BoltzGen for binder design, AlphaFold for structure prediction
BoltzGen enables generative design of novel protein binders for arbitrary targets (validated across 26 diverse cases). AlphaFold provides foundational structure prediction for 200M+ proteins with custom annotations. Represents step change from prediction to ab initio functional protein design.
Radiology workflow automation (imaging analysis, reporting)
FDA-cleared AI-Rad Companion or Claude 3.5 Sonnet for clinical tasks
AI-Rad Companion (FDA-cleared March 2025) handles organ segmentation for radiotherapy planning. For broader clinical workflows, Claude 3.5 Sonnet achieves 70% success on Stanford MedAgentBench (retrieving patient data, ordering tests, prescribing medications via FHIR APIs).
Pathology and whole-slide image analysis
Graph Neural Networks (DeepTFtyper architecture)
GNNs model spatial tissue relationships and topology critical for histopathology. DeepTFtyper predicts molecular subtypes (SCLC-A/N/P/Y) from H&E slides alone with AUC >0.70, enabling molecular-informed treatment selection without separate molecular testing.
Privacy-preserving multi-institutional model development
Federated Learning with Bayesian uncertainty quantification
Enables collaborative training while keeping patient data at local institutions. Bayesian approaches provide predictive uncertainty across federated settings, improving inference quality vs standard aggregation. Essential for GDPR/HIPAA compliance and rare disease research with distributed data.
Rare disease diagnosis with limited training data
RareScale LLM framework
Achieves 88.8% candidate generation performance and 17%+ improvement in Top-5 accuracy across 575 rare diseases vs baseline black-box LLMs. Specialized prompt engineering and evaluation strategies work even with inherently limited training examples.
Patient deterioration and survival prediction
DeepHit model for survival analysis
Achieves concordance index 0.94 and one-year AUC 0.89, substantially outperforming Cox proportional hazards. Effectively integrates temporal information and dynamic patient characteristics for longitudinal outcome prediction with missing data.
Tasks & Benchmarks
Disease Classification
Diagnosing diseases from medical images or data.
Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Clinical NLP
Processing clinical notes and medical text.
Drug Discovery
Predicting molecular properties and drug interactions.
Show all datasets and SOTA results
Disease Classification
Medical Image Segmentation
Clinical NLP
Drug Discovery
Honest Takes
Multimodal models are failing medical imaging
GPT-4V achieves 100% on imaging modality identification but only 35.2% on pathology detection with 46.8% hallucination rates. Adding images to text-optimized models sometimes decreases accuracy. Current vision-language architectures don't preserve the specialized visual-spatial reasoning required for clinical diagnosis.
FDA approval doesn't mean it works in your hospital
External validation shows median AUC drops 0.03 with specificity degrading up to 24 percentage points despite FDA clearance. Most approved devices lack comprehensive multi-site validation and age/sex subgroup performance data. One trauma CNN maintained 85% sensitivity but specificity crashed from 94% to 70% on older patients.
We have 4 real-world LLM deployments vs thousands of papers
Only four published studies (2024-2025) describe actual LLM implementation in clinical workflows despite thousands of bench research papers. The translation gap is massive. Most papers show what models could do, not what they actually do reliably in clinical environments with missing data, workflow constraints, and liability.
Foundation models are becoming clinical infrastructure
Rather than building task-specific models, organizations are deploying general foundation models like MedSAM and adapting them locally through fine-tuning. Open-source releases like DeepSeek-V3 (62.67% clinical task accuracy) and BoltzGen are democratizing access, reducing vendor lock-in for resource-limited settings.
Get notified when these results update
New models drop weekly. We track them so you don't have to.