Content is user-generated and unverified.

MedHELM: Testing AI Medical Assistants in Real-World Healthcare Tasks

Published: Nature Medicine, January 2026
Link: https://doi.org/10.1038/s41591-025-04151-2

Executive Summary (The Dinner Table Version)

Researchers at Stanford created the first comprehensive "real-world driving test" for medical AI chatbots, testing them on 37 actual healthcare tasks like writing clinical notes, answering patient questions, and coding medical bills—not just multiple-choice medical exams. The advanced "reasoning" models (DeepSeek R1 and OpenAI's o3-mini) performed best overall, but Claude 3.5 achieved nearly identical results at 15% lower cost, while all models struggled with quantitative medical calculations and administrative tasks like billing. Perhaps most concerning: some AI models that excel at general tasks showed dramatic performance drops in medical contexts—Gemini 2.0 Flash fell 42 percentile points and GPT-4o dropped 24 points when switching from general benchmarks to medical ones.

Authors & Institutions

Lead Authors (equal contribution):

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell - Stanford University School of Medicine

Senior Authors:

Nigam H. Shah (Stanford) - supervising author
Michael A. Pfeffer (Stanford Health Care)
Percy Liang (Stanford Computer Science)
Other senior contributors from Microsoft, Anthropic connections

Participating Institutions:

Stanford University School of Medicine
Stanford Health Care
Microsoft Corporation
Multiple academic medical centers across 4 institutions

Clinician Validators: 29 practicing physicians across 14 medical specialties

Conflicts of Interest

Notable Disclosures:

Multiple Microsoft employees as co-authors (Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini)
Eric Horvitz - senior Microsoft executive with stock holdings
Several authors have consulting relationships with AI companies (OpenAI, Google, health tech startups)
Some authors own stock/options in healthcare AI companies
Philip Chung consults for OpenAI (unrelated work)
Funding from NIH, Stanford Bio-X, and various grants

Important Note: The study evaluated models from companies where some authors have relationships, but the evaluation was systematic and comparative.

Strengths (What They Did Well)

Clinician validation of taxonomy: Had 29 practicing doctors from 14 specialties validate that the 121 tasks actually reflect real medical work (96.7% agreement)—not just academic test questions.
Real patient data: Used actual electronic health records from MIMIC and Stanford Health Care in 14 benchmarks, capturing the messy reality of clinical documentation rather than sanitized textbook cases.
Comprehensive coverage: First benchmark to test all 5 major healthcare domains including the often-ignored "administration and workflow" (billing, scheduling, referrals) where humans spend massive time.
Novel evaluation method: Created an "LLM jury" of 3 AI judges to evaluate open-ended tasks (like writing clinical notes) and validated it against human clinician ratings—matched or exceeded agreement between human doctors themselves.
Cost-performance analysis: Included real-world deployment costs ($800-1,850 per model evaluation), showing Claude 3.5 delivers near-top performance at 15% lower cost than reasoning models.
Transparency about limitations: 14 of 37 datasets kept private to prevent AI companies from training on test data—they'll evaluate submitted models in their secure environment to maintain integrity.
Private hold-out sets: Following machine learning best practices by keeping some test data completely private to measure true generalization, not just memorization of public benchmarks.

Weaknesses (Where to Push Back)

Uneven coverage: 15 of 22 task categories have only one benchmark each, so can't really confirm findings across different tests—like judging someone's driving with only one parallel parking attempt.
LLM-jury only validated on 2 benchmarks: They tested their AI evaluation method against human doctors on just 56 examples from 2 tasks, then applied it across all 13 open-ended benchmarks—would you trust a grading system proven on only 2 assignments?
Cherry-picked validation tasks: The two benchmarks they validated (clinical notes and patient Q&A) are public, potentially easier tasks—didn't validate on the harder private benchmarks where models might disagree more with humans.
No human performance baselines: For most tasks, they don't tell us how well human doctors, nurses, or administrators perform—is 66% win rate good or terrible? We're comparing AI to AI, not AI to human standards.
Modest clinician agreement (ICC=0.47): The AI jury's agreement with human doctors was barely better than doctors agreeing with each other (ICC=0.43)—reflects genuine subjectivity but means "ground truth" is fuzzy in medical judgments.
Limited to text tasks only: Completely excludes image interpretation (radiology, pathology, dermatology), physical examination, or multimodal diagnostic reasoning that defines much of clinical medicine.
Reasoning models cost 2-3x more: DeepSeek R1 and o3-mini performed best but cost $1,850 and $1,761 vs $815-940 for faster models—paper doesn't adequately explore whether the performance gain justifies doubling costs for most use cases.
Statistical power varies wildly: Some benchmarks have 86 questions, others have 1,000—the paper acknowledges this but doesn't adjust interpretation, so "weak performance" on small benchmarks might just be noise.
All models failed on quantitative tasks: Every model struggled with medical calculations and billing codes, but paper doesn't deeply investigate why—is it training data, model architecture, or something fundamental about current LLMs?
No temporal validation: All evaluations done at one point in time—doesn't test if models maintain performance over time or with different patient populations (age, ethnicity, language, socioeconomic status).
Billing code benchmark may be obsolete: Using ICD-10 codes from discharge summaries—in practice, certified medical coders do this job with specialized software, not doctors using chatbots, so unclear if this task is the right test.

The Bottom Line

This is the most comprehensive real-world test of medical AI to date, moving beyond "can it pass medical school exams?" to "can it do the actual job?" The validation with practicing clinicians and use of real patient data are genuine strengths. However, the uneven coverage, limited validation of their evaluation method, and absence of human performance comparisons mean we still can't definitively say whether these AIs are "good enough" for clinical use—just which ones are better than others.

Dinner Table Talking Point: "It's like Consumer Reports finally testing cars on real roads instead of just checking if they know traffic laws—but they only tested a few road types and didn't tell us how human drivers perform on the same routes."

Content is user-generated and unverified.