Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting

识别并解决基于知识的视觉问答（KB-VQA）基准测试中的缺陷：审计、修复与增强

Abstract: Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is widely adopted as the primary evaluation metric, implicitly treating correctness as a proxy for knowledge-grounded reasoning. However, for existing KB-VQA benchmarks, this proxy relies on critical assumptions that are often overlooked and rendered unreliable by benchmark issues: annotated answer must be derivable from the associated knowledge base, question must be well-posed with sufficient constraints, and visual setting must meaningfully require grounded disambiguation.

摘要： 基于知识的视觉问答（KB-VQA）旨在评估视觉语言模型（VLM）是否能够检索、定位并基于视觉证据之外的外部结构化知识进行推理。在实践中，答案准确率被广泛用作主要的评估指标，隐含地将正确性视为基于知识推理的代理指标。然而，对于现有的 KB-VQA 基准测试而言，这一代理指标依赖于一些关键假设，而这些假设往往被忽视，并因基准测试本身的问题而变得不可靠：即标注的答案必须能从关联的知识库中推导出来，问题必须表述清晰且具有足够的约束条件，且视觉场景必须有意义地需要基于定位的消歧。

In this work, we show that these assumptions are systematically violated in existing KB-VQA benchmarks. Our audit reveals substantial instances with missing or contradicted answers and underspecified questions that render accuracy a misleading metric. Furthermore, we find that existing datasets rely on visually trivial, single-entity scenes that bypass the need for sophisticated visual-to-knowledge mapping. We demonstrate that even with controlled architectures, these flaws lead to distorted model rankings and overestimations of reasoning capabilities.

在这项工作中，我们证明了这些假设在现有的 KB-VQA 基准测试中被系统性地违背了。我们的审计揭示了大量存在答案缺失或矛盾、问题表述不清的实例，这使得准确率成为一个具有误导性的指标。此外，我们发现现有的数据集依赖于视觉上简单、单一实体的场景，从而绕过了对复杂的“视觉到知识”映射的需求。我们证明，即使在受控的架构下，这些缺陷也会导致模型排名失真，并高估模型的推理能力。

To address this, we introduce (1) a principled audit-and-repair protocol that restores answer derivability and question clarity, and (2) a controlled multi-entity augmentation protocol that introduces visual ambiguity to challenge initial retrieval and grounded reasoning. Re-evaluation under corrected and augmented settings yields markedly different performance trends. Our findings call for rethinking evaluation protocols and designing more interaction-aware KB-VQA benchmarks that prioritize verifiable reasoning over simple matching.

为了解决这些问题，我们引入了：（1）一套原则性的审计与修复协议，用于恢复答案的可推导性和问题的清晰度；（2）一套受控的多实体增强协议，通过引入视觉歧义来挑战初始检索和基于定位的推理。在修正和增强后的设置下进行重新评估，得出了截然不同的性能趋势。我们的研究结果呼吁重新思考评估协议，并设计更具交互感知能力的 KB-VQA 基准测试，将可验证的推理置于简单的匹配之上。