Synergistic Perception-Reasoning Governance: Grounding Medical MLLMs with Verifiable Anatomical Evidence

协同感知-推理治理：利用可验证的解剖学证据对医学多模态大模型进行基础性约束

Abstract: Multimodal large language models (MLLMs) show strong promise for clinical VQA and radiology report generation, yet inference-time hallucinations still undermine trustworthy use: models can produce fluent conclusions that conflict with imaging evidence. Existing mitigation strategies typically rely on additional training, external retrieval/knowledge bases, or multi-stage post-hoc verification, which increases cost and pipeline complexity and often generalizes poorly across models and tasks.

摘要： 多模态大模型（MLLMs）在临床视觉问答（VQA）和放射学报告生成方面展现出巨大潜力，但推理过程中的幻觉问题依然阻碍了其可信应用：模型可能会生成逻辑通顺但与影像证据相冲突的结论。现有的缓解策略通常依赖于额外训练、外部检索/知识库或多阶段的事后验证，这不仅增加了成本和流程复杂度，且在不同模型和任务间的泛化能力往往较差。

To address this, we propose a holistic, training-free evidence-injection framework that systematically mitigates hallucinations through dual-side evidence injection. By leveraging ROI priors acquired using MedSAM in our implementation, we recalibrate the visual perception trajectory via ROI-guided activation modulation while anchoring the textual reasoning trajectory by mapping anatomical coordinates into discrete semantic tokens as verifiable external memory.

为了解决这一问题，我们提出了一种整体性的、无需训练的证据注入框架，通过双侧证据注入系统性地缓解幻觉。在我们的实现中，通过利用 MedSAM 获取的感兴趣区域（ROI）先验，我们一方面通过 ROI 引导的激活调制来校准视觉感知轨迹，另一方面通过将解剖坐标映射为离散的语义标记作为可验证的外部记忆，从而锚定文本推理轨迹。

Then we introduce a task-aware dynamic router to select modality-specific interventions based on task semantics, balancing perceptual grounding and linguistic fluency. We conduct systematic evaluations on 2 tasks and 5 datasets using LLaVA-1.5-7B, LLaVA-Med-1.5-7B, Qwen3-VL-8B/32B, and InternVL-3.5-8B/38B.

随后，我们引入了一个任务感知动态路由，根据任务语义选择特定模态的干预措施，以平衡感知基础与语言流畅度。我们使用 LLaVA-1.5-7B、LLaVA-Med-1.5-7B、Qwen3-VL-8B/32B 和 InternVL-3.5-8B/38B 在 2 个任务和 5 个数据集上进行了系统性评估。

Controlled ablations and visualizations further validate the framework, which consistently outperforms baselines across medical benchmarks, improving close-ended accuracy by up to ~6%↑ and reducing open-ended hallucinations by ~35%↓. The code has been made available on GitHub.

受控消融实验和可视化结果进一步验证了该框架的有效性。该框架在各项医学基准测试中始终优于基线模型，将封闭式问题的准确率提高了约 6%↑，并将开放式问题的幻觉率降低了约 35%↓。相关代码已在 GitHub 上开源。