Mass General Brigham Study Finds 21 Frontier LLMs Fail Early Clinical Reasoning More Than 80 Percent of the Time

Overview

A new study from Mass General Brigham’s MESH Incubator reports that 21 frontier large language models, including the latest versions of ChatGPT, Claude, DeepSeek, Gemini, and Grok, failed to generate appropriate differential diagnoses more than 80 percent of the time when fed patient information the way a clinician would actually encounter it. The paper, published in JAMA Network Open on April 13, 2026, introduces a new evaluation framework called PrIME-LLM that scores models across the full arc of clinical reasoning rather than only on final-answer accuracy, according to the Mass General Brigham announcement distributed through EurekAlert.

What We Know

The research was led by Arya Rao and corresponding author Marc Succi of the MESH Incubator at Mass General Brigham, with Harvard Medical School affiliations, and appears in JAMA Network Open under DOI 10.1001/jamanetworkopen.2026.4003, as detailed in the EurekAlert release. The team tested 21 general-purpose frontier models on 29 published clinical cases, disclosing information in stages — history, physical examination, then laboratory results — to mirror how a physician sees a patient, according to Euronews Health.

PrIME-LLM scores models across four stages of clinical reasoning: generating a differential diagnosis, ordering appropriate tests, arriving at a final diagnosis, and planning treatment, per the Mass General Brigham release on EurekAlert. All models achieved more than 90 percent accuracy on final diagnoses when provided with complete case information, but every model failed to produce appropriate differential diagnoses in more than 80 percent of cases during the early, information-poor phase, the release states.

Aggregate PrIME-LLM scores, which combine performance across all four reasoning stages, ranged from 64 percent for Gemini 1.5 Flash at the low end to 78 percent for Grok 4 and GPT-5 at the top, according to the EurekAlert summary. Newer model versions outperformed older ones incrementally, but none reached a threshold the authors consider acceptable for unsupervised clinical-grade deployment, the same release notes.

In comments reported by Euronews, Succi said that “off-the-shelf large language models are not ready for unsupervised clinical-grade deployment” and emphasized the need for a human in the loop and close supervision before any clinical use, as quoted by Euronews Health. Rao’s work on the study was supported by a National Institute of General Medical Sciences training grant (T32GM144273), the EurekAlert release discloses.

What We Don’t Know

The study used 29 standardized vignettes drawn from published clinical cases rather than real electronic health record data, and the authors have not released per-model breakdowns of differential-diagnosis failure rates beyond the headline figure, based on the information available in the EurekAlert release. It is also unclear from the available coverage how much of the gap could be closed by retrieval-augmented or fine-tuned medical variants of the same base models, since PrIME-LLM was applied to off-the-shelf general-purpose versions, per the Mass General Brigham announcement.

Euronews notes the authors’ framing that the systems tend to rush to conclusions without sufficient context, but the coverage does not quantify how differently the models behave when explicitly instructed to withhold judgment until more data arrives, according to Euronews Health.

Context

The finding lands amid a broader policy push around AI oversight in high-stakes domains. The Machine Herald has previously reported on California’s executive order requiring AI safety certifications for state vendors, and the PrIME-LLM benchmark gives regulators and hospital systems a concrete, staged methodology they can cite when drawing lines between decision-support tools and autonomous clinical agents.