EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench:一个用于大语言模型临床决策的自动化且可靠的电子健康记录基准


Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood.

临床决策(CDM)是现实临床工作流程的核心,临床医生需要在证据不完整的情况下推断诊断、选择治疗方案或预测未来的健康结果。由于大语言模型(LLM)具备强大的语言能力、广泛的生物医学知识和高效率,它们正越来越多地被用于支持这些决策。然而,LLM 在现实临床决策任务中的可靠性仍未得到充分理解。

To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference.

为了评估 CDM 模型(尤其是基于 LLM 的模型),一个理想且实用的医疗决策基准应通过自动化且可靠的流程构建,以确保规模和质量。此外,将 CDM 基准建立在真实的患者电子健康记录(EHR)之上,可以更好地支持对需要大量生物医学知识和临床推理的实际 CDM 任务进行评估。

To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB (knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items.

为了填补这些空白,我们推出了 EHRBench,这是一个自动化且可靠的、基于 EHR 的基准,用于大规模评估基于 LLM 的临床决策。为确保可扩展性和可靠性,EHRBench 是通过“EHR-LLM-知识库(KB)”交互流程构建的。为了提高效率,我们使用专门的 LLM 自动将就诊层面的 EHR 轨迹转换为结构化模板,并确定性地将这些模板实例化为问答(QA)条目。

In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis.

同时,我们应用基于知识库的系统性验证和增强,以过滤幻觉或模糊的关系,从而提高可靠性。利用这一流程,我们构建了近 100 万(960,067)个问答条目,涵盖了三个需要推理的核心临床决策任务:诊断、治疗和预后。

We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.

我们在 EHRBench 上对 30 多个代表性 LLM 进行了基准测试,并提供了关于性能和鲁棒性的详细分析。结果显示,在不同设置下能力趋势保持一致,这进一步验证了 EHRBench 的可靠性,并指出了实现临床可靠 LLM 系统所需弥补的可行性差距。