Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms

视觉语言模型像人类一样进行搜索吗？推理 Token 作为经典视觉搜索范式中反应时间的模拟

Abstract: Visual search has been one of the most productive paradigms in the study of visual attention: the way reaction time scales with the number of items distinguishes parallel, “pop-out” search from serial, attention-demanding search. I ask whether vision-language models (VLMs) exhibit the same behavioral signatures.

摘要： 视觉搜索一直是视觉注意力研究中最富有成效的范式之一：反应时间随项目数量变化的规律，区分了并行的“弹出式”（pop-out）搜索与串行的、需要注意力的搜索。我探讨了视觉语言模型（VLM）是否表现出相同的行为特征。

I adapt four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry; and present them to current frontier and mid-tier models. Because a single model call has no reaction time, I use the number of reasoning (“thinking”) tokens a model spends per trial as a within-model analog of search effort, and I compare against a large public human benchmark (Wolfe et al., 2010).

我采用了四种经典范式：特征搜索与结合搜索、空间配置（T对L）搜索、枚举以及倾斜/垂直搜索不对称性；并将它们应用于当前的前沿模型和中端模型。由于单次模型调用没有反应时间，我使用模型在每次试验中消耗的推理（“思考”）Token 数量作为模型内部搜索努力程度的模拟，并将其与一个大型公开的人类基准（Wolfe 等人，2010 年）进行了比较。

The models reproduce several human signatures: feature search costs flat effort while conjunction effort climbs with set size; frontier models hold accuracy where mid-tier models collapse to chance; and a resolution control shows the conjunction cost is genuine search rather than difficulty resolving small shapes.

这些模型重现了人类的几个特征：特征搜索消耗的努力程度保持平稳，而结合搜索的努力程度随集合大小增加；前沿模型在任务中保持了准确性，而中端模型则退化至随机水平；此外，分辨率控制实验表明，结合搜索的成本是真正的搜索过程，而非识别小形状的困难所致。

They also diverge from humans in informative ways. The target-present effort slope exceeds the target-absent slope, reversing the human ordering; enumeration remains accurate where humans would lose count; and a reasoning model with adaptive deliberation declines to deliberate on detection tasks altogether, so that a single search expresses itself as an effort gradient in one model and as an accuracy cliff in another.

它们在某些方面也与人类存在具有启发意义的差异。模型中“目标存在”的努力斜率超过了“目标缺失”的斜率，这与人类的顺序相反；在人类会数错的情况下，模型仍能保持准确的枚举；此外，一个具有自适应深思能力的推理模型在检测任务中完全拒绝深思，导致同一种搜索在不同模型中分别表现为努力程度的梯度变化或准确性的断崖式下跌。

I argue that psychophysical paradigms, applied behaviorally, are a sharp and inexpensive probe of machine visual cognition, and that the points of divergence are as informative as the points of agreement.

我认为，将心理物理学范式应用于行为分析，是探测机器视觉认知的一种敏锐且低成本的方法，且模型与人类的分歧点与共识点同样具有信息价值。