Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
代理 AI 控制评估中的攻击选择行为显著降低了安全性
Abstract: An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safety framework for deploying capable but untrusted AI agents under the oversight of a weaker, trusted monitor and a limited human audit budget. Control evaluations stress-test these protocols by pitting a red-team attack policy against the blue-team monitor, but current evaluations typically assume attackers that do not strategically select when to attack.
摘要: 一个能够策略性选择攻击时机的攻击者,比盲目攻击的攻击者更难被发现。AI 控制是一种安全框架,旨在在较弱但可信的监控器和有限的人工审计预算下,部署能力强大但不可信的 AI 智能体。控制评估通过让红队攻击策略与蓝队监控器对抗来压力测试这些协议,但目前的评估通常假设攻击者不会策略性地选择攻击时机。
We study this capability, attack selection, in agentic settings by decomposing attack decisions into a start policy, which decides when an attacker should attack, and a stop policy, which decides when an attacker should abort an ongoing attack. Across two agentic settings, BashArena and LinuxArena, both policies substantially lower measured empirical safety without changing the underlying attack capability.
我们在代理环境中研究了这种“攻击选择”能力,将攻击决策分解为“启动策略”(决定攻击者何时发起攻击)和“停止策略”(决定攻击者何时中止正在进行的攻击)。在 BashArena 和 LinuxArena 这两个代理环境中,这两种策略在不改变底层攻击能力的前提下,显著降低了测得的经验安全性。
At a 1% audit budget, our start policy reduces safety by 20pp on both BashArena and LinuxArena, and our stop policy reduces safety by 20pp on BashArena and 28pp on LinuxArena. These reductions should be interpreted as upper bounds on the effect of attack selection. Existing control evaluations may therefore yield overly optimistic safety estimates against selective attackers. We recommend that future evaluations, system cards, and safety cases elicit attack selection to produce more realistic safety estimates.
在 1% 的审计预算下,我们的启动策略在 BashArena 和 LinuxArena 上均将安全性降低了 20 个百分点,而我们的停止策略在 BashArena 上降低了 20 个百分点,在 LinuxArena 上降低了 28 个百分点。这些降低幅度应被视为攻击选择效应的上限。因此,现有的控制评估在面对具有选择性的攻击者时,可能会得出过于乐观的安全评估结果。我们建议未来的评估、系统卡和安全案例应引入攻击选择机制,以产生更现实的安全评估。