Clinical AI Has Boomed
A New Stanford-Harvard State of Clinical AI Report Shows What Holds Up in Practice.
January 15, 2026 - By Rebecca Handler
Artificial intelligence is no longer a speculative force in medicine. It is already embedded in everyday care. AI systems flag hospitalized patients at risk of deterioration, assist radiologists reading mammograms, draft clinicians’ notes, route patient messages, and increasingly interact directly with patients through chatbots and digital assistants.
In recent months, the pace and visibility of these deployments have accelerated sharply. OpenAI announced ChatGPT for Health, positioning a general-purpose language model as a tool for health-related information and patient interaction. Utah just began piloting AI-supported prescribing and clinical decision systems, raising questions about how algorithmic recommendations intersect with clinician judgment and liability. OpenEvidence, an AI-powered medical evidence platform designed primarily for clinicians and health professionals, has become a dominant player in point-of-care decisions, underscoring the fact that doctors are often bypassing traditional IT gatekeepers to use AI in clinical care. At the federal level, the FDA signaled a loosening of regulatory oversight for certain categories of clinical decision support software, shifting more responsibility to developers and health systems to ensure safety and effectiveness.
Taken together, these developments mark a turning point. AI is moving rapidly from background clinical infrastructure into higher-stakes roles that influence decisions, workflows, and patient behavior, often faster than the evidence base can be clearly interpreted by clinicians, policymakers, or the public.
The scale of adoption is substantial. More than 1,200 AI-enabled medical tools have already been cleared by the FDA. Hundreds of thousands of consumer health applications now rely on machine learning. Health care AI is a multibillion-dollar industry, expanding simultaneously across hospitals, clinics, and patients’ phones.
Yet amid this acceleration, a fundamental question remains unresolved: how much of what looks impressive in announcements and studies actually holds up in real clinical practice? Many claims of physician-level or “superhuman” performance rely on narrow benchmarks or controlled evaluations that do not reflect the uncertainty, incomplete information, and workflow complexity of everyday care. As deployment outpaces synthesis, separating durable clinical value from hype has become increasingly difficult.
The report aims to help clinicians, health system leaders, policymakers, and the public distinguish real-world progress from technological momentum, and to paint a realistic landscape for what innovations might come in 2026 and beyond.
That gap is the focus of The State of Clinical AI (2026), a report released in January 2026 by the ARISE network. Led by Peter Brodeur, MD, MA, Ethan Goh, MD, Adam Rodman, MD, and Jonathan H. Chen, MD, PhD, the report reflects contributions from a multidisciplinary group of experts across Stanford, Harvard, and affiliated health systems, spanning clinical medicine, computer science, and health policy.
Rather than spotlighting individual tools or reacting to headline-grabbing launches, the report steps back to provide a grounded synthesis of the evidence. It reviews the most influential clinical AI studies published in 2025 to ask a more practical question: where does AI meaningfully improve care once it leaves controlled research settings, where does performance break down, and where do risks remain underexamined? In doing so, the report aims to help clinicians, health system leaders, policymakers, and the public distinguish real-world progress from technological momentum, and to paint a realistic landscape for what innovations might come in 2026 and beyond.
Impressive results in narrow evaluations
In research settings, modern AI systems often perform well on paper. Several studies published in 2025 showed large language models matching or outperforming physicians on diagnostic reasoning and treatment planning when evaluated on fixed clinical cases (Brodeur et al., 2025; Buckley et al., 2025). In some papers, this performance was described as “superhuman.”
In one study, an AI system analyzed complex emergency department cases and selected correct diagnoses more often than attending physicians when tested at specific decision points. In another, an AI trained on decades of published medical case discussions generated explanations that clinicians rated as comparable to those produced by human experts.
But The State of Clinical AI shows that these results often depend on how narrowly the problem is framed.
In one experiment, researchers modified standard medical multiple-choice questions so that the correct answer became “none of the other answers.” The clinical reasoning required to solve the question did not change. Model performance did. Accuracy dropped sharply across leading AI systems, in some cases by more than a third (Bedi et al., JAMA Network Open, 2025).
Other studies found similar declines when AI systems were tested in settings that more closely resembled real clinical work. When models had to ask follow-up questions, manage incomplete information, or revise decisions as new details emerged, performance fell (Johri et al., Nature Medicine, 2025). On tests designed to measure reasoning under uncertainty, AI systems performed closer to medical students than to experienced physicians and tended to commit strongly to an answer even when ambiguity was high (McCoy et al., NEJM AI, 2025).
In everyday medicine, uncertainty is common. The report shows that this remains one of the most consistent challenges for current AI systems.
Where AI clearly helps: Prediction at scale
If diagnostic reasoning shows mixed results, the evidence is more consistent in another area: prediction.
Several studies reviewed in the report demonstrate that AI systems excel at identifying early warning signals across large and complex datasets. In one hospital-based study, a model trained on continuous wearable vital signs predicted patient deterioration up to 8 to 24 hours before standard hospital alerts, identifying patients at risk for ICU transfer, cardiac arrest, or death while there was still time to intervene (Scheid et al., Nature Communications, 2025).
In another study, researchers used AI to estimate “biological age” from routine health records across millions of individuals. The AI-derived age measure predicted mortality more accurately than commonly used aging markers, including epigenetic clocks and frailty scores (Li et al., Nature Medicine, 2025).
These systems perform best when they address problems where humans are limited by scale rather than judgment.
Large-scale models trained on tens of millions of electronic health records have also shown the ability to forecast future diagnoses and disease trajectories without being retrained for each condition (Shmatko et al., Nature, 2025; Waxler et al., 2025).
These systems perform best when they address problems where humans are limited by scale rather than judgment.
Most studies still do not resemble everyday health care
One of the report’s most consequential findings concerns how clinical AI is evaluated.
A review of more than 500 medical AI studies found that nearly half tested models using medical exam-style questions. Only five percent used real patient data. Very few measured whether models recognized uncertainty, and even fewer examined bias or fairness (Bedi et al., JAMA, 2025).
This is important because much of clinical work has little to do with answering exam questions. Clinicians spend large portions of their day reviewing charts, managing inbox messages, coordinating care, and deciding when not to intervene.
In 2025, researchers began developing evaluation methods that better reflect this reality. Some placed AI systems into simulated electronic health records and asked them to retrieve information, place orders, and complete multi-step workflows (Jiang et al., NEJM AI, 2025). Others evaluated AI through thousands of realistic patient conversations graded by physicians (Arora et al., 2025).
In these settings, reasoning models show performance gains, but the failures were more informative. They showed where models lost context, overlooked critical information, or pursued incorrect paths with confidence, offering clearer insight into how errors arise in practice.
Reaffirming that AI works best as teammate, not as replacement
Across clinical settings, the report finds the most consistent benefits when AI supports clinicians rather than replaces them.
In Germany, radiologists who could optionally consult an AI system detected more breast cancers without increasing false alarms (Eisemann et al., Nature Medicine, 2025). In primary care, clinicians interpreted lung function tests more accurately with AI assistance (Doe et al., NEJM AI, 2025).
Randomized trials showed that physicians using AI alongside standard medical resources made better treatment decisions than those relying on conventional tools alone (Goh et al., Nature Medicine, 2025). In Kenya, a collaboration between Penda Health and OpenAI deployed a background AI system to review urgent care visits that reduced diagnostic and treatment errors across tens of thousands of patients (Korom et al., 2025).
The lesson is not that clinicians should avoid AI, but that how AI is introduced and integrated matters as much as its technical performance.
At the same time, the report documents real risks regarding over-reliance. In several studies, clinicians followed incorrect AI recommendations even when errors were detectable, leading to worse decisions than if AI had not been used at all (Qazi et al., 2025). Other research raised concerns about reduced vigilance after prolonged AI use in procedural tasks (Budzyn et al., 2025).
The lesson is not that clinicians should avoid AI, but that how AI is introduced and integrated matters as much as its technical performance.
Patient-facing AI provides a new landscape for engagement with caution
AI systems that interact directly with patients are spreading faster than almost any other form of clinical AI. Chatbots now triage symptoms, answer medication questions, provide chronic disease coaching, and guide patients through care pathways. These tools promise scale and access in a health system that is often difficult to navigate.
Some early studies suggest potential. In simulated primary care scenarios, conversational AI systems outperformed on par with physicians when evaluated on axes such as honesty, empathy, and confidence (Tu et al., Nature, 2025). Market forces may push vendors to evaluate patient-facing AI using simulations, engagement metrics, or short-term process measures rather than patient outcomes. Few studies track whether these tools reduce missed diagnoses, improve health over time, or help patients navigate care more effectively.
The risks are also distinct. Patients may place too much trust in systems that sound confident but lack full clinical context (Shekar et al., NEJM AI, 2025). Escalation to human care can be delayed or unclear, especially when guardrails are poorly defined. Unlike clinician-facing tools, patient-facing AI operates without professional oversight at the moment decisions are made, raising the stakes for error.
The report does not dismiss patient-facing AI, but it urges caution. It emphasizes the need for clearer evidence, stronger escalation pathways, and evaluation frameworks that focus on outcomes rather than engagement alone.
What this report leaves us with
Taken as a whole, The State of Clinical AI (2026) offers a rare, evidence-driven overview of a field moving faster than its evaluation practices. By synthesizing a year of influential research, the report draws a clear distinction between what performs well in controlled studies and what holds up in real clinical settings. It shows where AI already adds value, where performance may break down once systems leave the lab, and where risks remain insufficiently examined.
AI is already embedded in health care, and that is unlikely to change. What this report makes clear is that the next phase will not be driven by newer models alone.
The report’s impact lies in how it reframes the conversation. Instead of focusing on isolated demonstrations or model capabilities, it centers questions of evidence, accountability, and clinical relevance. It argues for evaluation methods that reflect everyday practice, for systems designed to support rather than override human judgment, and for evidence gathered after deployment to determine whether these tools meaningfully improve care.
AI is already embedded in health care, and that is unlikely to change. What this report makes clear is that the next phase will not be driven by newer models alone. It will depend on whether health systems, researchers, and regulators are willing to apply the same standards of evidence to AI that they expect of any other clinical intervention.
The technology has moved quickly and because of the pace of academic publishing, many of the studies on our report include models that have been superseded by new technological innovations. But the State of Clinical AI (2026) makes the case that regardless of the speed of technological change, the field needs to move more deliberately, with measurement focused on outcomes that matter in real-world care.
Acknowledgments
This report was made possible through the review and contributions of Rebecca Handler, Jason Hom, Eric Horvitz, Laura Zwaan, Vishnu Ravi, Brian Han, Kevin Schulman, Kathleen Lacar, Kameron Black, Liam McCoy, David Wu, Priyank Jain, Emily Tat (design and accessibility), and Adrian Haimovich.
Support was provided by Stanford Division of Computational Medicine, Stanford Medicine, Stanford University (Clinical Excellence Research Center and Division of Hospital Medicine), Harvard Medical School, the Shapiro Institute, Beth Israel Deaconess Medical Center, and the Blavatnik Institute for Biomedical Informatics.
The State of Clinical AI Report 2026
The State of Clinical AI Report is the inaugural annual synthesis of the most significant developments, evidence, and emerging challenges in clinical AI. It brings together a comprehensive, carefully curated view of where the field meaningfully advanced this year—spanning model performance, evaluation, workflows, and patient-facing tools—while also highlighting the gaps that remain.
Produced by ARISE (AI Research and Science Evaluation), a Stanford-Harvard Research Network, the aim is to make the landscape easier to navigate, support responsible adoption, and offer a shared reference point for clinicians, researchers, and health leaders as the field continues to evolve.