PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Abstract: This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory.

摘要： 本文探讨了多轮视觉推理，并观察到多模态大模型（MLLM）在定位目标时频繁失败，导致推理轨迹冗长且冗余。我们将这种失败归因于推理和感知在单一模型中的纠缠：MLLM 同时进行推理和定位，而不准确的定位会触发额外的推理轮次，从而使轨迹变得臃肿。

To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions.

为了解决这一问题，我们提出了 PixelEyes，这是一个多轮视觉推理智能体，它明确地将推理与感知解耦，即推理器决定“寻找什么”，而专门的感知工具负责回答“它在哪里”。具体而言，PixelEyes 引入了：1) 掩码引导的视觉搜索。通过调用指代分割模型提供掩码级的精确定位，使推理器无需再为不精确的定位进行补偿。2) 语义区域广度优先搜索（BFS）。为了消除因重复裁剪错误子区域而导致的冗余循环，我们将探索过程组织为对语义区域的广度优先搜索。

To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.

为了内化这些能力，我们通过重构现有数据中的专家轨迹，构建了 PixelEyes-6K 数据集。这明确地将我们的掩码引导搜索和 BFS 逻辑嵌入到模型中。此外，我们还引入了 Pinpoint-Bench，这是一个零提示（zero-hint）视觉搜索基准测试，即问题中不提供任何位置线索。该基准包含实例级掩码和边界框，能够将定位失败与推理失败区分开来，从而实现对“非注意盲视”（inattentional blindness）等失败模式的细粒度分析。目前最先进的 MLLM 和视觉推理智能体在 Pinpoint-Bench 上仍有很大的提升空间，证明了该基准的质量与难度。相关代码和模型已开源。