ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

ClinicalBench：针对 MIMIC-IV 跨入院临床问答的断言感知检索压力测试

Abstract: Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent.

摘要： 推理基准测试通常在干净的输入数据上衡量临床表现。我们评估了推理之前的步骤：对真实电子健康记录（EHR）笔记的检索。在这些笔记中，否定、时间性以及家庭成员与患者身份的归属差异，都可能导致正确答案变为错误答案。EpiKG 在患者知识图谱的每个事实中都携带了断言标签和时间性标记，并根据问题的意图来引导检索。

ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items.

ClinicalBench 是一个包含 400 个问题的测试集，涵盖了 43 名 MIMIC-IV 患者，涉及 9 个对断言敏感的类别。通过 7 种条件的消融实验，我们在六种大语言模型（Claude Opus 4.6、GPT-OSS 20B、MedGemma 27B、Gemma 4 31B、MedGemma 1.5 4B、Qwen 3.5 35B）上测试了 EpiKG 的各个组成部分。三名医生对 100 个配对项目进行了盲法裁定。

The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent.

作者盲法的主要终点指标——在两名外部医生评分的 50 个严格一致项目上进行的留一作者配对精确 McNemar 检验——显示提升了 22.0 个百分点（95% Newcombe 置信区间 [+5.1, +31.5]，p=0.0192）。该架构的创新点在于，相较于 Contriever 密集检索基准（在排除变更后的 n=362 终点上，从 C2b 到 C4g_kw），意图感知 KG-RAG 提升了 8.84 个百分点（配对 McNemar p=1.79e-3）；在预言机（Oracle）意图下提升了 12.43 个百分点。

Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size.

敏感性分析在方向上是一致的：三名医生多数投票结果提升了 24.0 个百分点（受限于单作者循环论证）；确定性关键词可重复性代理指标提升了 39.5 个百分点。在六个模型中，随着仅使用 LLM 的基准性能提升，增益会缩小（beta=-1.123, r=-0.921, p=0.009）。由于样本量 n=6，这看起来更像是回归平均值，而非编码替代了模型规模。

Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.

医生裁定发现，56% 的自动生成参考答案存在缺陷。这一方法论发现表明，NLP 流水线临床问答基准测试必须经过医生裁定才具有可用性。ClinicalBench、冻结评估器、三名医生的裁定数据以及 EpiKG 输出栈现已公开。