Mirage Probes: How Vision Models Fake Visual Understanding

Mirage Probes: How Vision Models Fake Visual Understanding

幻影探针:视觉模型如何伪造视觉理解

Abstract: Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. 摘要: 视觉语言模型(VLM)即使在没有提供图像的情况下,也能自信且往往正确地回答基于图像的问题。这种“幻影”行为在没有反映真实视觉基础的情况下虚高了基准测试分数。以往的研究将其视为单一的故障模式,而我们认为它实际上包含两种模式。

Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. 通过使用“幻影探针”(Mirage Probes)——一种将改写后的问题变体与同一图像上的匹配幻影和非幻影标签配对的对比探测框架,我们证明了在两个开源 VLM 中,幻影行为可以从残差流、MLP、注意力后处理和注意力头位置的内部激活中线性解码出来。我们证明了朴素贝叶斯文本基线无法恢复这一信号,从而排除了表面词汇混淆的可能性。

Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. 跨基准的可分性模式,结合衡量模型仅凭文本回答能力的全新“先验利用指数”(PHI),揭示了两种截然不同的机制:一是文本偏见,即模型在不调用视觉表征的情况下仅根据语言先验进行回答;二是虚假图像,即模型在潜在空间中构建虚假的视觉内容,并表现得如同基于真实视觉一样进行回答。

The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model’s visual representations rather than its text. Faithful visual grounding will require interventions at the representational level. 这种区分对缓解策略具有直接影响:文本分布清洗可以解决第一种机制,但无法触及第二种,因为虚假图像幻影存在于模型的视觉表征中,而非文本中。实现真正的视觉基础(Visual Grounding)将需要从表征层面进行干预。