A new study in npj Artificial Intelligence, a Nature Portfolio Journal, offers one of the most comprehensive evaluations to date of large language models (LLMs) for emergency care, examining how these systems perform under dynamic scenarios that mirror real emergency department (ED) workflows. The study, “The Role of Large Language Models in Emergency Care: A Comprehensive Benchmarking Study,” moves beyond standard medical question‑answering to assess performance as patient information evolves in real time.
Why it matters
Emergency departments nationwide continue to face rising volumes, staffing shortages, and increasing clinical complexity, highlighting the need for trustworthy, well‑tested AI decision‑support tools. “Emergency medicine is not a static knowledge test,” said Dr. Alexander Fortenko, Director of Clinical Innovation in the Weill Cornell Medicine Department of Emergency Medicine. “When we pressure test models for clinical use, our testing must mimic real clinical scenarios as much as possible. In this study, we put these leading models to the test in terms of their ability to perform dynamic reasoning and provide safe and accurate clinical guidance.”
“Real‑world benchmarking like this is essential if clinicians and patients are going to trust AI as part of clinical decision‑making,” said Dr. Rahul Sharma, the Barbara and Stephen Friedman Professor and Chair of the Department of Emergency Medicine, Weill Cornell Medicine and NewYork-Presbyterian/Weill Cornell Medical Center. “Our focus is always on innovation that strengthens clinical decision‑making, enhances safety, and delivers the highest quality care to every patient we serve.”
How the study tested models
Researchers developed a two‑part evaluation: a large knowledge benchmark tailored to emergency medicine and a set of simulated clinical cases that reveal information step‑by‑step from triage through diagnostics to mirror actual ED workflows. Eight emergency physicians scored outputs for accuracy, safety, and clinical relevance across core ED tasks such as triage scoring, case summarization, investigative questioning, management planning, and differential diagnosis.
Key findings
- Among top‑tier models, factual medical knowledge is beginning to converge.
- Performance diverges when adaptive, multi‑stage clinical reasoning is required.
- One model maintained or improved performance as case complexity increased.
- Most models showed a tendency toward under‑triage in high‑acuity cases.
“Our goal is not autonomous AI in the emergency department. It is dependable decision support. That means models that reason transparently, recognize their limits, and stay within guardrails designed to protect patients,” said Longsha Liu, CEO of Vita Innovations, who collaborated on the study.
Implications and collaboration
The authors emphasize these systems are not designed to replace clinicians; rather, when deployed with structured oversight, they may support triage consistency, documentation efficiency, and diagnostic hypothesis generation. The study demonstrates the importance of rigorous, domain‑specific evaluation before clinical integration. The work represents a broad collaboration among investigators from Vita Innovations, Weill Cornell Medicine and NewYork-Presbyterian/Weill Cornell Medical Center, Stanford University School of Medicine, the University of Virginia School of Medicine, the University of Minnesota, Atrium Health Wake Forest Baptist, Hawaii Pacific Health, Summus Global, Qualified Health, the University of British Columbia, and the University of California Irvine. Future efforts will pursue prospective, real‑time clinical evaluations and explore hybrid safety architectures that pair AI reasoning with structured oversight.
Read the study: https://doi.org/10.1038/s44387-026-00078-2

