CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT：一种可解释的视觉语言模型框架

Abstract: Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction.

摘要： 视觉语言模型（VLMs）仍然容易产生幻觉，输出虽然流畅但与视觉内容不符。现有的思维链（Chain-of-Thought）和检索增强方法只能部分解决这一问题，因为它们既无法强制执行步骤级的引用溯源，也无法将验证失败的情况反馈回检索环节进行修正。

We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval.

我们提出了 CaVe-VLM-CoT，这是一个基于反射的模块化智能体 RAG 框架。它通过一个五阶段闭环流水线强制执行基于证据的推理，包括：提取器（Extractor）、检索器（Retriever）、求解器（Solver）、引用注入器（Citation Injector）和验证器（Verifier）。在该流程中，一旦检测到无根据的声明，系统会向提取器触发结构化反馈，从而进行针对性的重新检索。

Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding.

由于目前尚无框架能同时衡量检索质量、步骤级引用忠实度以及跨模态溯源能力，我们提出了一套涵盖所有阶段的 23 项组件级指标，并以 CaVeScore 为核心。CaVeScore 是一项综合指标，对准确率、引用精确率与召回率、归因能力以及证据溯源能力进行加权评估。

Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1% accuracy and 56.6% CaVeScore on ScienceQA, and 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).

在无需任何架构或提示词修改的情况下，CaVe-VLM-CoT 在 ScienceQA 上达到了 87.1% 的准确率和 56.6 的 CaVeScore，在 MMMU（30 个学科）上达到了 55.2% 的准确率和 35.7 的 CaVeScore。