Securing the future of AI agents

Securing the future of AI agents

保障 AI 智能体的未来

June 18, 2026 | Responsibility & Safety 2026年6月18日 | 责任与安全

How we’re securing internal systems against increasingly capable and imperfectly aligned AI 我们如何保护内部系统,以应对日益强大且对齐尚不完美的 AI

AI agents are transforming our relationship with technology. By autonomously executing complex tasks — from cyber defence to scientific discovery and product development — these systems are unlocking a new era of productivity. In the U.S alone, AI agents could create $2.9 trillion in economic value by 2030. AI 智能体正在改变我们与技术的关系。通过自主执行从网络防御到科学发现和产品开发等复杂任务,这些系统正在开启一个生产力的新时代。仅在美国,到2030年,AI 智能体就有望创造 2.9 万亿美元的经济价值。

As these agents become more capable, they also require more sophisticated safeguards. That’s why we developed our AI Control Roadmap: a framework for building and managing the advanced AI we deploy within Google. This “defense-in-depth” approach, which could serve as a model for the wider industry, goes beyond traditional model alignment, adding a crucial layer of system-level security that provides assurance even if alignment is imperfect. 随着这些智能体变得越来越强大,它们也需要更复杂的安全保障。因此,我们制定了“AI 控制路线图”(AI Control Roadmap):这是一个用于构建和管理我们在 Google 内部部署的先进 AI 的框架。这种“纵深防御”方法可以作为整个行业的典范,它超越了传统的模型对齐,增加了一个关键的系统级安全层,即使在对齐不完美的情况下也能提供保障。

Understanding AI Control

理解 AI 控制

Our approach to security starts with a strong foundation, incorporating traditional safeguards like sandboxing, endpoint security, and prompt injection resistance. On top of this, the AI Control Roadmap uses model alignment, i.e. training AI to be inherently safe and helpful, as a primary defense. It provides an additional layer of security by treating internal agents as potentially misaligned, providing assurance even if alignment is imperfect. 我们的安全方法始于坚实的基础,结合了沙箱、端点安全和提示注入防御等传统保障措施。在此基础上,“AI 控制路线图”将模型对齐(即训练 AI 使其本质上安全且有益)作为主要防御手段。它通过将内部智能体视为可能存在对齐偏差的对象,提供了一层额外的安全保障,即使在对齐不完美时也能确保安全。

Think of it like a driving instructor with dual controls. The instructor trusts the student but stays ready to take the wheel or hit the brakes if a mistake occurs. Similarly, our AI control system grants AI agents permissions based on their verified behavior, allowing us to build trust through controlled, incremental access. 可以将其想象成带有双重控制装置的教练车。教练信任学员,但随时准备在出现错误时接管方向盘或踩下刹车。同样,我们的 AI 控制系统根据 AI 智能体的验证行为授予其权限,使我们能够通过受控的、渐进式的访问来建立信任。

Our roadmap is built on a simple but cautious assumption: what if a highly capable AI agent isn’t aligned with our goals and acts in unexpected ways? To address this, our plan focuses on three key areas. 我们的路线图建立在一个简单但谨慎的假设之上:如果一个能力极强的 AI 智能体与我们的目标不一致并以意想不到的方式行事,该怎么办?为了解决这个问题,我们的计划聚焦于三个关键领域。

First, following best practices in cybersecurity, we have developed a novel threat-modelling framework for AI. The roadmap treats untrusted AI agents as potential “insider threats” — similar to the way a company would treat a potential rogue employee who already has access to the office. By building on the industry-standard MITRE ATT&CK framework, a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations, we break down potential attacks into smaller tactics and techniques. This allows us to methodically track risks, spot bad behavior before it causes damage and safely run drills to test if our defenses actually hold up. 首先,遵循网络安全最佳实践,我们开发了一种针对 AI 的新型威胁建模框架。该路线图将不受信任的 AI 智能体视为潜在的“内部威胁”——类似于公司对待一名已经拥有办公室访问权限的潜在违规员工的方式。通过基于行业标准的 MITRE ATT&CK 框架(一个基于现实观察、全球通用的对手战术和技术知识库),我们将潜在攻击分解为更小的战术和技术。这使我们能够有条不紊地跟踪风险,在不良行为造成损害之前发现它们,并安全地进行演习以测试我们的防御措施是否真正有效。

Second, we develop a plan to deploy AI control mitigations to address these risks. 其次,我们制定了部署 AI 控制缓解措施的计划,以应对这些风险。

  • Identifying the problem (Detection): Monitoring is the backbone of AI control. We use other, trusted AI systems as “supervisors” to constantly review a working agent’s reasoning, actions and plans to see if it is going off track.

  • 识别问题(检测): 监控是 AI 控制的支柱。我们使用其他受信任的 AI 系统作为“监督者”,不断审查正在运行的智能体的推理、行动和计划,以查看其是否偏离轨道。

  • Managing the problem (Prevention and Response): If a supervisor sees the agent trying to do something harmful, it can step in to block the action before damage occurs.

  • 管理问题(预防与响应): 如果监督者发现智能体试图执行有害操作,它可以介入并在损害发生前阻止该行为。

Finally, to ensure our systems are functional, we constantly measure performance across three critical metrics: coverage (the fraction of traffic monitored), recall (the fraction of misaligned behaviors caught), and time-to-response. 最后,为了确保系统功能正常,我们不断衡量三个关键指标的性能:覆盖率(被监控流量的比例)、召回率(捕获到的未对齐行为的比例)以及响应时间。

Scaling security as AI gets smarter

随着 AI 变得更聪明,扩展安全能力

As AI models continue to advance, our defenses must also strengthen in tandem. In our AI Control Roadmap, we map security protocols to measurable milestones in AI capabilities on two critical fronts: 随着 AI 模型的不断进步,我们的防御能力也必须同步增强。在“AI 控制路线图”中,我们将安全协议映射到 AI 能力的两个关键方面的可衡量里程碑上:

  • Ability to evade detection: Today, we monitor a model’s visible chain-of-thought. However, as models learn to hide their reasoning — either through oversight awareness (knowing they are being monitored) or opaque reasoning (reasoning without using visible text) — simply reading their verbalized reasoning will not be enough. We will need to analyze their behavior and potentially inspect the model’s inner workings.

  • 规避检测的能力: 目前,我们监控模型可见的思维链。然而,随着模型学会隐藏其推理过程——无论是通过监督意识(知道自己正在被监控)还是不透明推理(在不使用可见文本的情况下进行推理)——仅仅阅读它们口头表达的推理将不再足够。我们需要分析它们的行为,并可能需要检查模型的内部运作机制。

  • Ability for potential harm: For low-risk and reversible actions, we can rely on reviewing transcripts with a delay and remediating issues quickly after they occur (asynchronous response). For particularly high-risk actions, such as major cyber attacks, we must shift to real-time prevention, blocking harmful actions before they can be executed (synchronous response). In other words, our response strategy must scale with the severity of the threat.

  • 潜在危害的能力: 对于低风险且可逆的操作,我们可以依赖延迟审查记录,并在问题发生后迅速补救(异步响应)。对于特别高风险的操作(如重大网络攻击),我们必须转向实时预防,在有害操作执行前将其拦截(同步响应)。换句话说,我们的响应策略必须随着威胁的严重程度而扩展。

The AI Control Roadmap recommends security measures based on the deployed model’s capability. The model’s detection evasion capability determines the Detection level (D1-D4), and the model’s attack execution capability determines the Prevention and Response level (R1-R3). “AI 控制路线图”根据所部署模型的能力推荐安全措施。模型的规避检测能力决定了检测级别(D1-D4),而模型的攻击执行能力决定了预防与响应级别(R1-R3)。

Analyzing a million agent trajectories to inform live monitoring

分析百万条智能体轨迹以指导实时监控

To build these systems, we need data. Our team has built an internal prototype of an asynchronous monitoring system that reviews coding agent trajectories, and escalates findings for human review. Each flagged event is classified against the threat taxonomy from our roadmap. 为了构建这些系统,我们需要数据。我们的团队构建了一个异步监控系统的内部原型,用于审查编码智能体的轨迹,并将发现的问题升级以供人工审核。每个被标记的事件都会根据我们路线图中的威胁分类法进行分类。

We’ve already analyzed a million coding agent tasks, giving us valuable insights to refine our safety protocols and allowing us to move beyond simple keyword filtering to identifying high-signal behavioral patterns. For example, this research has been instrumental in building a live monitor for the Gemini Spark agent. 我们已经分析了一百万个编码智能体任务,这为我们完善安全协议提供了宝贵的见解,使我们能够超越简单的关键词过滤,转向识别高信号的行为模式。例如,这项研究对于构建 Gemini Spark 智能体的实时监控器起到了至关重要的作用。