CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT:一种可解释的视觉语言模型框架

Abstract: Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction.

摘要: 视觉语言模型(VLMs)仍然容易产生幻觉,输出虽然流畅但与视觉内容不符。现有的思维链(Chain-of-Thought)和检索增强方法只能部分解决这一问题,因为它们既无法强制执行步骤级的引用溯源,也无法将验证失败的情况反馈回检索环节进行修正。

We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval.

我们提出了 CaVe-VLM-CoT,这是一个基于反射的模块化智能体 RAG 框架。它通过一个五阶段闭环流水线强制执行基于证据的推理,包括:提取器(Extractor)、检索器(Retriever)、求解器(Solver)、引用注入器(Citation Injector)和验证器(Verifier)。在该流程中,一旦检测到无根据的声明,系统会向提取器触发结构化反馈,从而进行针对性的重新检索。

Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding.

由于目前尚无框架能同时衡量检索质量、步骤级引用忠实度以及跨模态溯源能力,我们提出了一套涵盖所有阶段的 23 项组件级指标,并以 CaVeScore 为核心。CaVeScore 是一项综合指标,对准确率、引用精确率与召回率、归因能力以及证据溯源能力进行加权评估。

Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).

在无需任何架构或提示词修改的情况下,CaVe-VLM-CoT 在 ScienceQA 上达到了 87.1% 的准确率和 56.6 的 CaVeScore,在 MMMU(30 个学科)上达到了 55.2% 的准确率和 35.7 的 CaVeScore。