PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

PhoneHarness:通过混合 GUI、CLI 和工具操作驾驭手机智能体

Abstract: Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred.

摘要: 人们对手机智能体的期望已不再仅仅是预测下一个屏幕动作,而是完成真实的移动端工作流。然而,目前大多数关于移动智能体的文献仍主要将其评估为 GUI 控制器,即通过观察屏幕、发出点击和滑动指令,并根据目标应用的状态进行评分。真实的手机使用任务更为广泛:它们需要智能体决定何时使用应用 GUI、设备端命令或结构化工具,同时还要留下证据证明预期的副作用确实已经发生。

We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers.

我们推出了 PhoneHarness,这是一个用于研究可验证移动工作流中手机智能体的混合动作基准测试与执行框架。PhoneHarness 在 GUI、CLI 和主机端工具操作之上运行设备端智能体循环,将确定性动作路由与受限的 GUI 委托及可审计的执行轨迹相结合。其基准测试 PhoneHarness Bench 不仅评估智能体是否能给出合理的最终答案,还评估其是否能完成具有可观察副作用的任务。

On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

在标注的评估集上,PhoneHarness 的通过率达到了 75.0%,比最强的非 PhoneHarness 设置高出 12.9 个百分点。因此,PhoneHarness 和 PhoneHarness Bench 发挥着不同但相互依赖的作用:前者使混合手机工作流变得可执行,而后者则衡量智能体能否可靠且安全地使用该框架。我们的研究结果表明,可靠的手机自动化依赖于动作表面的路由和可验证的执行,而不仅仅是视觉 GUI 控制。