The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

上下文的代价：缓解多模态检索增强生成中的文本偏见

Abstract: While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate “oracle” context causes a capable model to abandon an initially correct prediction.

摘要： 尽管多模态大语言模型（MLLMs）越来越多地与检索增强生成（RAG）相结合以缓解幻觉问题，但引入外部文档可能会掩盖实例层面的严重失效模式。我们识别并形式化了“重度破坏”（recorruption）现象，即即便引入了完全准确的“预言机”（oracle）上下文，也会导致原本能力出色的模型放弃其最初正确的预测。

Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance.

通过对内部注意力矩阵的机制诊断，我们发现“重度破坏”是由双重注意力崩溃驱动的：(1) 视觉盲区，表现为视觉注意力权重（$M_{vis}$）和锐度（$S_{vis}$）的系统性抑制；(2) 结构性位置偏见，迫使模型优先考虑边界标记（boundary tokens）而非语义相关性。

Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model’s textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors.

我们的分析揭示了一种“成功的错觉”，证明许多看似正确的 RAG 结果仅仅是位置上的巧合，即模型的文本复制偏见恰好与真实答案的位置重合。为了解决这些漏洞，我们提出了“恢复瓶颈注意力干预”（BAIR），这是一个无需参数、在推理阶段运行的框架，旨在恢复视觉显著性并对文本干扰项施加位置感知惩罚。

Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

在医学事实性、社会公平性和地理空间基准测试中，BAIR 成功恢复了多模态基础（grounding），并在无需模型重训练或微调的情况下提高了诊断的可靠性。