MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena：在在线 macOS 环境中基准测试计算机使用智能体

Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld, which serve both as evaluation tools and as training environments for reinforcement learning. 计算机使用智能体（CUAs）通过视觉和控制原语操作图形用户界面（GUIs）。在诸如 OSWorld 等标准化在线评估基准的推动下，这些智能体的能力得到了飞速发展，这些基准既充当评估工具，也作为强化学习的训练环境。

However, macOS remains underserved in this landscape: the only existing benchmark, macOSWorld, covers a narrow slice of first-party applications with simpler tasks, and runs on x86 virtual machines incompatible with Apple Silicon. 然而，macOS 在这一领域仍未得到充分覆盖：现有的唯一基准测试 macOSWorld 仅涵盖了极少数第一方应用程序，任务较为简单，且运行在与 Apple Silicon 不兼容的 x86 虚拟机上。

We introduce MacArena, a benchmark of 421 manually verified tasks spanning 50 applications that combines a curated port of OSWorld tasks, content sourced from macOSWorld, and 49 new macOS-native tasks, all running on Apple’s native Virtualization framework on Apple Silicon. 我们推出了 MacArena，这是一个包含 421 个手动验证任务的基准测试，跨越 50 个应用程序。它结合了 OSWorld 任务的精选移植版、源自 macOSWorld 的内容以及 49 个全新的 macOS 原生任务，所有任务均运行在 Apple Silicon 上原生的 Apple 虚拟化框架中。

We argue that macOS presents distinct GUI challenges beyond what Linux-based benchmarks capture, and our evaluation supports this claim: strong model performance on existing benchmarks can reflect familiarity with task distributions rather than genuine cross-platform GUI competence. 我们认为，macOS 提出了不同于 Linux 基准测试所能捕捉到的独特 GUI 挑战，我们的评估结果支持这一观点：模型在现有基准测试中的出色表现，可能反映的是对任务分布的熟悉程度，而非真正的跨平台 GUI 操作能力。

Notably, model rankings invert between ported and macOS-native tasks, with a leading model trailing by over 26% on the MacArena subset, suggesting that macOS poses a genuinely harder environment for current GUI agents. 值得注意的是，模型在移植任务和 macOS 原生任务之间的排名发生了倒置，领先模型在 MacArena 子集上的表现落后了超过 26%，这表明对于当前的 GUI 智能体而言，macOS 确实是一个更具挑战性的环境。