ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision：通过时间视觉冗余缩减扩展计算机操作智能体

Abstract: Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains.

摘要： 计算机操作智能体（CUAs）依赖于图形用户界面的视觉观察，其中每一张截图都被编码为大量的视觉 Token。随着交互轨迹的增长，Token 的成本迅速增加，限制了在固定上下文和计算预算下所能纳入的历史信息量。与其他领域不同，这导致在利用历史信息时，性能提升微乎其微甚至毫无改善。

We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model.

我们通过引入 ReVision 来解决这一低效问题。ReVision 用于在轨迹上训练多模态语言模型，通过一个学习型的 Patch 选择器移除冗余的视觉 Patch。该选择器通过比较连续截图间的 Patch 表示，同时保留模型所需的空间结构。

Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens.

在 OSWorld、WebTailBench 和 AgentNetBench 三个基准测试中，当使用 Qwen2.5-VL-7B 处理包含 5 个历史截图的轨迹时，ReVision 平均减少了约 46% 的 Token 使用量，同时将成功率较“无丢弃”基准提升了 3%。这确立了明显的效率提升，使智能体能够以更少的 Token 处理更长的轨迹。

With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.

凭借这种提升的效率，我们重新审视了历史信息在 CUA 中的作用，并发现当冗余被移除后，随着更多过去观察信息的纳入，性能会持续提升。这表明，视觉历史中常见的性能饱和现象并非源于过去信息价值有限，而是低效的 Token 表示所导致的后果。