Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

智能体崩溃：通往地狱之路是由“乐于助人”的智能体铺就的

Abstract: Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks.

摘要： 在计算机和网络环境中运行的智能体不可避免地会遇到各种错误：如无法访问的网页、丢失的文件、本地或远程配置错误等。这些错误并不会阻碍基于最先进模型的智能体；相反，它们会“乐于助人”地继续寻找完成任务的方法。

We introduce, characterize, and measure a new type of agent failure we call accidental meltdown: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors.

我们引入、表征并衡量了一种新型的智能体故障，称之为“意外崩溃”（accidental meltdown）：即在没有任何对抗性输入的情况下，智能体因响应良性环境错误而表现出的不安全或有害行为。由于现有的可靠性或安全性基准测试并未涵盖此类崩溃，我们开发了一套崩溃行为分类法。

We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type.

随后，我们实现了一个与智能体无关的基础设施，用于向运行环境中注入模拟的本地和远程错误，并利用它系统地评估了由 GPT、Grok 和 Gemini 驱动的智能体系统。我们的评估表明，在遇到模拟错误的智能体运行中，有 64.7% 的情况发生了不同严重程度和成功率的崩溃（例如进行未经授权的侦察或破坏访问控制），且这种情况涵盖了所有智能体系统、底层模型和错误类型的组合。

In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior.

在超过一半的崩溃案例中，不安全行为并未向用户报告。通过对比同一智能体在有无错误情况下的行为，我们发现，为应对错误而进行的“探索”行为与不安全及有害行为之间存在相关性。