Skip to main content
trillion-problem-feature
Artwork courtesy of Jennie Ellison

News

Artificial Intelligence (AI) in Medicine April 15, 2026

The $1 Trillion Problem AI Still Can’t Yet Solve

By Rebecca Handler

AI is transforming medicine, but not yet the $1 trillion administrative burden. New research reveals why healthcare’s most critical workflows remain unsolved.

Across research papers and public conversation, discussions of artificial intelligence in healthcare often center around the following topics:

AI that can diagnose disease.
AI that can read medical images.
AI that can predict illness before symptoms appear.

Although important and highly relevant conversations, this also misses something fundamental about how healthcare actually works.

Before a patient ever receives care, there is another system quietly at work that has nothing to do with biology and everything to do with bureaucracy.

Insurance approvals. Referral forms. Appeals. Faxed documents. Hours spent navigating software systems that don’t talk to each other.

This invisible layer of healthcare, known as the administrative spending, costs the United States more than $1 trillion every year. Almost 25 cents of every dollar in healthcare. 

Despite all the excitement around AI, administrative spending remains largely untouched.

Testing AI on the Work No One Romanticizes

Many prominent healthcare AI benchmarks focus on diagnosis, medical QA, clinical reasoning, or other text-based tasks, while administrative workflows remain relatively underexplored.

To study that challenge, a research team, led by PhD students Suhana BediRyan Welch, and senior author Nigam Shah, MBBS, PhD, have introduced "HealthAdminBench." In this study, they built four realistic software environments modeled on the kinds of tools healthcare workers use every day: an electronic health record, two insurer portals, and a fax system. (Read the study here). Inside those environments, the team created 135 expert-designed tasks drawn from common administrative workflows, including prior authorization, appeals and denials management, and durable medical equipment ordering. Altogether, the benchmark contains 1,698 evaluation points across these tasks, allowing the team to track not just whether an AI finished the job, but where it broke down along the way.

“We were inspired to study this because we saw firsthand the challenges real people faced while shadowing revenue cycle workers through their actual day-to-day work,” shares Bedi. “We watched them toggle between systems, chase down documents, and navigate approval processes step by step. The work is operationally critical, and yet almost no one was measuring whether AI could actually do it.”

A woman looks at the HealthAdminBench dashboard on a computer.
"The work is operationally critical, and yet almost no one was measuring whether AI could actually do it.”

The Results: Impressive in Pieces, Weak as a Whole (or Less Than the Sum of its Parts)

The best-performing agent completed only 36.3 percent of full tasks successfully. Another system achieved the strongest subtask success rate, correctly completing 82.8 percent of individual subtasks, yet still fell far short on end-to-end task completion. “That surprised me”, reflects Shah. “That is a huge gap between high subtask performance and end-to-end task completion. Almost a 50% drop.”

Consider a real task from the benchmark: submitting a prior authorization request. To complete it, the AI must move step-by-step through multiple systems: first opening a patient’s chart, extracting diagnosis (ICD-10) and procedure (CPT) codes, downloading the required clinical documents, then navigating to an insurer’s portal to enter patient details, attach those documents, and submit the request. Finally, it must return to the medical record and log the authorization confirmation.

Each of these steps is explicitly required. Missing even one means failure.

Another task is even more revealing. In a durable medical equipment workflow, the agent is asked to process an order for a feeding pump. When it opens the patient’s chart, it finds a required “face-to-face evaluation” document — but the date shows it is more than six months old. According to policy, that makes the order invalid. The correct action is not to proceed, but to stop the process and document why the order cannot be completed.

Many AI systems fail here, not because they can’t read the date, but because they don’t recognize that the workflow should halt. Instead, they continue the process, sometimes even attempting to submit or fax incomplete documentation.

Why the Last Mile Is So Hard

Healthcare administration may not seem glamorous, but it is a punishing test of machine reliability.

Many tasks in HealthAdminBench required the AI to move between multiple systems, gather information in one place and use it later in another, all while keeping track of a long sequence of steps. Prior authorization and appeals tasks were especially difficult, in part because they required more clinical reasoning and more information retrieval than some of the simpler equipment-order workflows.

One of the biggest trouble spots was document handling: downloading a file from one system, then attaching it correctly somewhere else. Across models, this emerged as a major source of failure, even though it is routine work for human staff.

That finding is important, because so much of healthcare administration depends on exactly this kind of cross-system choreography. A clinician’s note may need to be pulled from the EHR, attached to a payer portal, and then documented back in the chart. The logic is simple enough for a trained worker. But for current AI agents, the study suggests, the sequence is brittle.

The Problem Is Not Just “Thinking” but Remembering

One of the more interesting lessons from the paper is that failure often had less to do with abstract reasoning than with execution.

The agents struggled with remembering hidden long-term dependencies, lost track of important information over time, and often failed to use the scratchpad memory space the researchers provided for storing key facts across steps. They also frequently avoided file operations such as downloads and uploads, even when those actions were essential to completing the task.

(Healthcare happens to be full of exactly those conditions).

This points to a broader truth about AI in the real world: intelligence is not just knowing what to do. It is being able to do it reliably in messy environments where interfaces are clunky, tasks unfold over time, and small omissions can invalidate the whole effort.

Why This Matters Beyond the Benchmark

Administrative burden shapes whether patients get tests approved, whether claims are denied, and how much time clinicians spend on bureaucracy instead of care. It is also one of the engines of burnout.

So the appeal of administrative AI is obvious. A system that could reliably handle prior authorizations or equipment orders might reduce delays, save money, and free up staff for higher-value work.

But HealthAdminBench suggests that current systems are not ready to be trusted on their own. Even in these controlled environments, where researchers simplified some real-world complications such as logins, CAPTCHAs, and session timeouts, performance remained limited.

That is an important reality check for a field full of hype. Doing well on medical knowledge tests is not the same as functioning safely inside the chaotic software ecosystem of real healthcare.

A Reason for Cautious Optimism

The study is not all bad news. In a separate fine-tuning experiment, an open-source model trained on just 100 tasks improved held-out task success by 23 percentage points and outperformed the strongest frontier model in that evaluation.

So how close are we to having reliable AI systems that can reliably handle administrative workflows in real healthcare settings? “Within reach,” Shah says. “The models only get better, and there is now a focus on capturing the workflow traces, which can then be used to teach models to become far more capable.”

Welch shares, “We think the most tangible impact is that HealthAdminBench gives the field a realistic testing bed for healthcare agents, along with a rigorous way to measure progress. By building environments that reflect those workflows, we can start to test agents in a way that is much closer to deployment reality.” 

The Bottom Line

AI may be transforming medicine. But when it comes to the bureaucratic machinery that quietly governs who gets care, when, and how, today’s systems still have a long way to go.

For now, the trillion-dollar administrative burden remains stubbornly human.

When asked about the future, Bedi reflects, “The immediate goal is not to replace staff, it is to create a realistic place to evaluate, stress-test, and improve these systems before they are used in practice. Longer term, I think this opens the door not just to better automation, but to better training, stronger safety evaluations, and better human-AI tools for the teams doing this work every day.”

About Stanford Department of Medicine

Stanford Department of Medicine is an academic department within the Stanford School of Medicine dedicated to advancing patient care, education, and research across internal medicine and its subspecialties. We provide high‑quality patient care, train doctors and scientists, and do research to prevent illness, improve diagnosis and treatment, and help people live healthier lives. We serve diverse communities and work to make health care better for today and tomorrow. For more information, visit medicine.stanford.edu

RebeccaHandler

Rebecca Handler

Rebecca Handler, MsC is a science writer and researcher at Stanford’s Department of Medicine, where she translates complex research into accessible narratives for clinicians, patients, and the public. She serves as Manager of Science Communications, partnering closely with clinicians and investigators to highlight advances across multiple specialties and disciplines. 

Both her writing and research focus on rapid developments in clinical AI, computational medicine, and public health. Rebecca holds a Master of Science from Boston University, where she studied epidemiology and science communication, and a Bachelor of Science in cognitive science. Rebecca is originally from Connecticut and moved to California in 2024, and when she isn’t head-down in a research paper, she enjoys sunshine, reading, and horseback riding.