IMCBench: A benchmark for multimodal LLMs in Image-grounded Medical Conversations

IMCBench: A benchmark for multimodal LLMs in Image-grounded Medical Conversations

IMCBench:用于图像驱动医学对话的多模态大模型基准测试

Abstract: Recent advances in large language models and vision-language models have enabled reasoning over multimodal data, offering opportunities for clinical applications such as decision support and triaging. However, existing medical AI benchmarks are fragmented: some support multi-turn dialogues but lack images, while others provide multimodal inputs but focus on single-turn QA tasks.

摘要: 大语言模型和视觉语言模型的最新进展实现了对多模态数据的推理,为临床决策支持和分诊等应用提供了机会。然而,现有的医学人工智能基准测试较为碎片化:一些支持多轮对话但缺乏图像,而另一些提供多模态输入却仅侧重于单轮问答任务。

To address this gap, we introduce IMCBench, an image-grounded, multi-turn medical conversation benchmark that pairs real, publicly available clinical images with synthetic patient profiles to simulate realistic patient-clinician interactions. Each conversation is evaluated across three clinical dimensions: safety, accuracy, and appropriate use of uncertainty in diagnosis.

为了填补这一空白,我们推出了 IMCBench,这是一个基于图像的多轮医学对话基准测试。它将真实的公开临床图像与合成的患者档案相结合,以模拟真实的医患互动。每段对话都从三个临床维度进行评估:安全性、准确性以及诊断中不确定性的适当处理。

We benchmark eight multimodal frontier models across four model families (Claude, GPT, Nova, and Llama), scoring each on a 1-5 scale using LLM-as-Jury scoring calibrated against expert clinician annotations. Our results show that Claude Opus 4.6 achieves the highest overall score (3.61), followed by Claude Sonnet 4.6 (3.30) and GPT-5.2 (3.29), though no model dominates all dimensions and safety degrades for both malignant and rare conditions ($\Delta$ = -0.27 each).

我们对来自四个模型家族(Claude、GPT、Nova 和 Llama)的八个多模态前沿模型进行了基准测试,并使用经专家临床医生注释校准的“大模型作为评审”(LLM-as-Jury)评分方法,对每个模型进行 1-5 分的打分。结果显示,Claude Opus 4.6 获得了最高的总分(3.61),其次是 Claude Sonnet 4.6(3.30)和 GPT-5.2(3.29)。尽管如此,没有模型能在所有维度上占据绝对优势,且在处理恶性肿瘤和罕见病症时,安全性均有所下降(降幅均为 $\Delta$ = -0.27)。

Ablation studies further reveal that both visual input and EHR context contribute to safe guidance (safety drops of 0.18 and 0.23 on average when each is removed), with stronger models leveraging visual features more effectively. Together, these findings demonstrate that accurate clinical description does not guarantee safe patient guidance, motivating the need for multi-dimensional evaluation frameworks in medical AI.

消融研究进一步表明,视觉输入和电子健康记录(EHR)背景信息都有助于提供安全的指导(移除其中任何一项时,安全性平均分别下降 0.18 和 0.23),且更强大的模型能更有效地利用视觉特征。总之,这些发现表明,准确的临床描述并不能保证患者指导的安全性,这凸显了在医学人工智能领域建立多维度评估框架的必要性。