CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

CLEAR:揭示噪声与歧义如何降低医学大语言模型的可靠性

Abstract: Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs’ reasoning on medical benchmarks.

摘要: 医学大语言模型(LLM)的评估目前依赖于简化的考试类基准测试,这些测试往往无法反映现实世界医学咨询中的歧义性。我们引入了“临床歧义与可靠性评估”(CLEAR)框架,旨在评估决策空间的呈现方式、歧义性和不确定性如何影响大语言模型在医学基准测试中的推理能力。

CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods.

CLEAR 系统地扰动了以下三个维度:(1) 合理答案选项的数量;(2) 是否存在标准答案或弃权选项;(3) 答案选项的语义框架。通过在 17 个大语言模型上对三个基准测试应用 CLEAR 框架,我们揭示了现有评估方法的三大显著局限性。

First, increasing the number of plausible answers degrades a model’s ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like “None of the Above” to uncertainty admission like “I don’t know” (IDK). Notably, just including IDK in the answer space increases incorrect answer selections.

首先,增加合理答案的数量会降低模型识别正确答案以及对错误答案进行弃权的能力。其次,当弃权的表述方式从“以上皆非”这种断言式拒绝,转变为“我不知道”(IDK)这种承认不确定性的表达时,模型的谨慎程度会进一步下降。值得注意的是,仅仅在答案空间中加入“我不知道”这一选项,就会导致模型错误选择的增加。

Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.

最后,我们将模型在识别正确答案与对错误答案弃权之间的性能差距定义为“谦逊赤字”(humility deficit),并发现该赤字会随着模型规模的扩大而加剧。我们的研究结果揭示了标准医学基准测试的局限性,并强调仅靠扩大模型规模并不能解决大语言模型的可靠性问题。