Simulated evaluation of large language model stepwise diagnostic reasoning with real-world chest pain encounters and Bayesian networks

Scritto il 24/02/2026
da Conrad W Safranek

BMC Med Inform Decis Mak. 2026 Feb 24. doi: 10.1186/s12911-026-03381-9. Online ahead of print.

ABSTRACT

BACKGROUND: Real-world evaluation of large language models (LLMs) as clinical diagnostic aids is limited by the reliance on static vignettes and retrospective data, which inadequately reflect the dynamic, iterative nature of clinical decision-making and may overestimate LLMs' performance. Here, we benchmark GPT-4o in a stepwise simulated diagnostic setting with real-world clinical data, comparing its diagnostic accuracy and information-seeking strategy with Bayesian-network-derived optimal policies and observed physician practice.

METHODS: We assessed GPT-4o across 500 emergency department (ED) chest-pain encounters, drawn from a cohort of 202,632 cases spanning three EDs. A Bayesian network (BN) trained on the structured cohort data imputed clinical data not collected in the original encounter to create a more robust simulation environment. The BN furthermore enabled derivation of mutual-information-optimal query pathways. GPT-4o sequentially requested information from 136 structured clinical variables under three prompting regimes that varied in disease-prevalence cues and diagnostic category constraints. Diagnostic decisions encompassed one of seven predefined emergent conditions or Other Diagnosis. We measured diagnostic accuracy under each prompting strategy, as well as calculated rank-based overlap with the BN optimal pathway to benchmark the LLM's information-seeking behavior.

RESULTS: Across the full chest-pain cohort, life-threatening etiologies accounted for only 2.14% of encounters (from 1.04% acute coronary syndrome to 0.01% esophageal rupture). With baseline prompting, GPT-4o systematically over-predicted rare conditions (sensitivity 79.3%; specificity 45.2%); adding prevalence cues or removing diagnostic category constraints respectively increased specificity (83.0% and 94.7%) while reducing false alarms by 107 and 140 per 500, but at the cost of poor sensitivity (30.4% and 8.8%). Rank-biased overlap between GPT-4o's information-seeking sequence and the Bayesian-network mutual-information optimum was low across diagnoses (range 0.060-0.097), and the model diverged from clinician behavior by requesting fewer vitals ([Formula: see text]-fold) and labs ([Formula: see text]-fold), while requesting 30%+ more imaging data.

CONCLUSIONS: In this simulated assessment, GPT-4o demonstrated diagnostic biases toward rare conditions and differed substantially from normative probabilistic models and physician practice patterns. These discrepancies could lead to unnecessary over-triage and resource utilization. Integrating LLMs within more rigorous probabilistic frameworks and calibrating them to realistic disease prevalences may be essential for effectively harnessing their potential as clinical decision-support tools.

PMID:41735989 | DOI:10.1186/s12911-026-03381-9