Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

仿生人会梦见破坏游戏吗？使用 BenchJack 系统化审计 AI 智能体基准测试

Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design.

摘要： 智能体（Agent）基准测试已成为衡量前沿 AI 能力的事实标准，指导着模型的选择、投资和部署。然而，前沿模型在无需过拟合的情况下，会自发出现“奖励欺骗”（reward hacking）现象，即智能体在未完成预期任务的情况下最大化得分。我们认为，基准测试必须在设计之初就具备安全性。

From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner.

通过分析过往的奖励欺骗案例，我们总结出八种反复出现的缺陷模式，并将其汇编为供基准测试设计者使用的“智能体评估检查清单”（Agent-Eval Checklist）。我们将这些见解浓缩为 BenchJack，这是一个自动化的红队测试系统，它驱动编码智能体以“先知”般的方式审计基准测试，并识别潜在的奖励欺骗漏洞。

Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations.

此外，我们将 BenchJack 扩展为一个迭代式的生成对抗流水线，能够发现新缺陷并进行迭代修复，从而提高基准测试的稳健性。我们将 BenchJack 应用于 10 个流行的智能体基准测试，涵盖软件工程、网页导航、桌面计算和终端操作等领域。

BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack’s extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations.

BenchJack 合成了多种奖励欺骗手段，在大多数基准测试中，智能体无需解决任何实际任务即可获得近乎完美的得分，并揭示了八大类中共 219 个不同的缺陷。此外，BenchJack 的扩展流水线在四个没有致命设计缺陷的基准测试中，将可被攻击的任务比例从近 100% 降低到 10% 以下，并在三次迭代内完全修复了 WebArena 和 OSWorld。

Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

我们的研究结果表明，当前的评估流水线尚未内化对抗性思维，而主动审计有助于弥补快速发展的基准测试领域中的安全漏洞。