MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

MedicalBench：评估大语言模型以改进医学概念提取

Abstract: Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. 摘要： 从电子健康记录中提取医学概念是许多下游应用的基础，但由于医学叙述中具有医学意义的概念往往是隐含的而非明确陈述的，因此这项任务仍然充满挑战。

Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. 现有的带有专家标注证据范围的基准测试强调了将提取的概念扎根于医学文本中的重要性。然而，它们主要关注明确陈述的概念，而非隐含概念。

We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. 我们提出了 MedicalBench，这是一个用于医学概念提取的基准测试，它通过证据扎根来评估隐含的医学推理能力。MedicalBench 将医学概念提取表述为针对“医疗记录-概念”对的验证任务，并结合了句子级的证据识别。

Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. 该数据集基于 MIMIC-IV 出院小结和经人工验证的 ICD-10 代码构建，通过多阶段大语言模型（LLM）筛选流程，辅以医学标注和专家评审进行整理。它特意包含了隐含的正例、语义上易混淆的负例，以及 LLM 判断与医学专家评估不一致的情况。

We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. 我们定义了两个互补的评估任务：(1) 医学概念提取和 (2) 句子级证据检索，从而能够同时评估准确性和可解释性。对最先进的 LLM 进行基准测试显示，其性能仍然有限，凸显了提取隐含表达概念的难度。

We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner. 我们进一步证明，模型性能在很大程度上不受记录长度的影响，这表明 MedicalBench 隔离了推理难度，而非受表面混杂因素干扰。MedicalBench 为隐含的、基于证据的医学概念提取提供了首个系统性基准，为开发能够识别医学相关概念并以透明且符合医学事实的方式证明其预测结果的医学语言模型奠定了基础。