Ground Then Rank: Revisiting Knowledge-Based VQA with Training-Free Entity Identification

先定位后排序：基于无需训练实体识别的知识库视觉问答重探

Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires grounding visual queries to external knowledge beyond directly observable content in images. While recent multi-modal large language models (MLLMs) show strong perceptual abilities, they struggle on KB-VQA tasks requiring groundings from both fine-grained entity and evidence levels.

摘要： 知识库视觉问答（KB-VQA）要求将视觉查询与图像中无法直接观察到的外部知识进行关联。尽管近期多模态大语言模型（MLLMs）展现出了强大的感知能力，但在需要细粒度实体和证据层面进行关联的 KB-VQA 任务上，它们仍面临挑战。

Most existing multi-modal retrieval augmented generation (MM-RAG) methods tightly couple entity discrimination and section-level evidence ranking into a single re-ranking stage, leading to high cost and limited generalization. In this work, we revisit existing MM-RAG solutions from a workflow perspective and argue both entity-level and fact-level groundings are key bottlenecks.

目前大多数多模态检索增强生成（MM-RAG）方法将实体判别与章节级证据排序紧密耦合在单一的重排序阶段中，导致计算成本高昂且泛化能力受限。在本文中，我们从工作流的角度重新审视了现有的 MM-RAG 解决方案，并指出实体级和事实级的关联是两个关键瓶颈。

We observe that although MLLMs often fail under open-ended entity naming, they can better identify the correct entity when selecting from a small set of candidate names. Based on this insight, we propose a simple and training-free identify-before-answer (IBA) framework that decouples entity identification from section-level re-ranking.

我们观察到，尽管 MLLMs 在开放式实体命名任务中经常表现不佳，但当从少量候选名称中进行选择时，它们能更准确地识别出正确实体。基于这一洞察，我们提出了一个简单且无需训练的“先识别后回答”（IBA）框架，将实体识别与章节级重排序解耦。

Our approach prompts an MLLM to select high-confidence entities using only candidate names, followed by an off-the-shelf textual re-ranker for evidence selection. Experiments on Encyclopedic-VQA and InfoSeek show that our method consistently outperforms fine-tuned multi-modal re-ranking baselines while reducing training and inference complexity.

我们的方法通过提示 MLLM 仅利用候选名称来选择高置信度实体，随后使用现成的文本重排序器进行证据选择。在 Encyclopedic-VQA 和 InfoSeek 上的实验表明，我们的方法在降低训练和推理复杂度的同时，始终优于经过微调的多模态重排序基准模型。

Additional analyses reveal that the improvements arise not only from better entity identification, but also from selecting more informative evidence once the correct entity is fixed. Our implementation is made public to ease reproducibility.

进一步分析显示，性能的提升不仅源于更好的实体识别，还源于在确定正确实体后能够筛选出信息量更大的证据。我们的实现代码已开源，以方便复现。