The AI Agent Security Surface: What Gets Exposed When You Add Tools and Memory

The AI Agent Security Surface: What Gets Exposed When You Add Tools and Memory

AI 智能体安全面:当你添加工具和记忆时,哪些东西会被暴露?

Agentic AI: The AI Agent Security Surface: What Gets Exposed When You Add Tools and Memory. Standard prompt attacks are merely the beginning. A structured framework to map and mitigate the backend attack vectors of agentic workflows. 智能体 AI:AI 智能体安全面:当你添加工具和记忆时,哪些东西会被暴露?标准的提示词攻击仅仅是个开始。这是一个用于映射和缓解智能体工作流后端攻击向量的结构化框架。

Mostafa Ibrahim | May 8, 2026 | 10 min read Mostafa Ibrahim | 2026年5月8日 | 10分钟阅读

From LLM to Agent: Why the Threat Model Changes

从大语言模型(LLM)到智能体:为什么威胁模型发生了变化

Most AI security work focuses on the model: what it says, what it refuses, and how it handles malicious prompts. This framing made sense when AI was a text interface. The user sends a message, and it responds. The attack surface was narrow and well-defined. 大多数 AI 安全工作都集中在模型本身:它说了什么、拒绝了什么,以及如何处理恶意提示词。当 AI 仅仅是一个文本界面时,这种框架是有意义的。用户发送消息,它进行回复。攻击面狭窄且定义明确。

Agents change the shape of the problem entirely. An AI agent does much more than generate text. It plans, uses tools, stores memory across sessions, and often coordinates with other agents to complete multi-step tasks. Think of the difference between a navigation app suggesting a route and an autopilot system wired directly into the vehicle’s steering and throttle. One provides information. The other executes control. The risk model is no longer comparable. 智能体彻底改变了问题的形态。AI 智能体所做的远不止生成文本。它会进行规划、使用工具、跨会话存储记忆,并经常与其他智能体协作以完成多步骤任务。试想一下导航应用建议路线与直接连接到车辆转向和油门的自动驾驶系统之间的区别。前者提供信息,后者执行控制。风险模型已不可同日而语。

The numbers confirm this is no longer a theoretical concern. According to Gravitee’s 2026 State of AI Agent Security report, based on a survey of more than 900 executives and practitioners: 数据证实,这已不再是理论上的担忧。根据 Gravitee 发布的《2026 年 AI 智能体安全状况报告》(基于对 900 多名高管和从业者的调查):

  • 88 percent of organizations reported confirmed or suspected AI agent security incidents in the past year.
  • 88% 的组织报告在过去一年中发生过已确认或疑似的 AI 智能体安全事件。
  • Only 14.4 percent of agentic systems went live with full security and IT approval.
  • 仅有 14.4% 的智能体系统在获得完整的安全和 IT 批准后上线。

This pattern extends across the industry. A 2026 report from Apono found that 98 percent of cybersecurity leaders report friction between accelerating agentic AI adoption and meeting security requirements, resulting in slowed or constrained deployments. That gap between deployment speed and security readiness is where incidents happen. 这种模式在整个行业中普遍存在。Apono 的一份 2026 年报告发现,98% 的网络安全领导者表示,在加速采用智能体 AI 与满足安全要求之间存在摩擦,导致部署放缓或受限。部署速度与安全准备程度之间的差距正是事故发生的地方。

The Four-Surface Attack Taxonomy

四大攻击面分类法

A standalone LLM has one attack surface: the prompt. An agent exposes four: 独立的 LLM 只有一个攻击面:提示词。而智能体则暴露了四个:

  1. The Prompt Surface: Reading external inputs.
  2. 提示词面: 读取外部输入。
  3. The Tool Surface: Executing backend actions.
  4. 工具面: 执行后端操作。
  5. The Memory Surface: Remembering past sessions.
  6. 记忆面: 记忆过往会话。
  7. The Planning Loop Surface: Deciding next steps.
  8. 规划循环面: 决定后续步骤。

Each surface has its own attack patterns. Defenses built for one do not transfer to the others. 每个面都有其独特的攻击模式。为其中一个面构建的防御措施无法直接迁移到其他面。

The prompt surface: when the agent reads the wrong thing

提示词面:当智能体读取了错误的内容时

The user input is perfectly clean. The vulnerability lies in everything else the agent consumes. When an agent fetches a webpage, a RAG document, or a backend response, these inputs arrive without a trust boundary. Attackers don’t compromise the user interface; they plant payloads where the agent will eventually look. This is indirect prompt injection. 用户输入可能非常干净,但漏洞存在于智能体消费的其他所有内容中。当智能体抓取网页、RAG 文档或后端响应时,这些输入在没有信任边界的情况下进入。攻击者不会破坏用户界面,而是将载荷植入智能体最终会查看的地方。这就是间接提示词注入。

Because models flatten all text into a single context window, they cannot distinguish your system instructions from a hidden command inside a retrieved PDF. They treat the malicious text as trusted context. Even tool docstrings and parameter names can invisibly hijack the agent’s behavior, leading to silent data exfiltration upstream while the user sees a normal response. 由于模型将所有文本扁平化到一个上下文窗口中,它们无法区分你的系统指令和检索到的 PDF 中隐藏的命令。它们将恶意文本视为可信上下文。甚至工具的文档字符串(docstrings)和参数名称都可能隐蔽地劫持智能体的行为,导致在上游进行静默数据窃取,而用户看到的却是一个正常的响应。

What Defense Looks Like Here: 此处的防御措施:

  • Boundary sanitization: Treat all external data as untrusted at every retrieval point.
  • 边界清理: 在每个检索点将所有外部数据视为不可信。
  • Instruction separation: Use structured formats to isolate system prompts from fetched content.
  • 指令隔离: 使用结构化格式将系统提示词与抓取的内容隔离开来。
  • Pre-execution filtering: Scan for exfiltration patterns before any tool fires.
  • 执行前过滤: 在任何工具触发前扫描数据窃取模式。

The Tool surface: when reading becomes doing

工具面:当“读取”变为“执行”时

Every tool an agent can call is a permission boundary, making it a primary target for exploitation. The core attack is parameter injection: manipulating the agent into passing attacker-controlled values into tools that trigger real-world consequences, like database writes or signed API requests. 智能体可以调用的每一个工具都是一个权限边界,使其成为攻击的主要目标。核心攻击是参数注入:操纵智能体将攻击者控制的值传递给触发现实后果的工具,例如数据库写入或签名 API 请求。

What Defense Looks Like Here: 此处的防御措施:

  • Least Privilege: Scope permissions strictly to the exact task.
  • 最小权限原则: 将权限严格限制在具体任务范围内。
  • Parameter Validation: Verify all inputs against strict schemas before execution.
  • 参数验证: 在执行前根据严格的模式验证所有输入。
  • Human Checkpoints: Require manual approval for any irreversible action.
  • 人工检查点: 对任何不可逆的操作要求人工审批。

The memory surface: when the whiteboard lies

记忆面:当“白板”撒谎时

Imagine a shared office whiteboard relied upon for daily decisions. If an outsider quietly rewrites one entry overnight, the team’s entire output shifts based on corrupted data. Persistent memory in an autonomous agent works exactly the same way. Control what the agent remembers, and you dictate its future actions across sessions and users. 想象一下一个用于日常决策的共享办公室白板。如果一个外人连夜悄悄重写了其中一个条目,整个团队的产出就会基于被篡改的数据而发生偏移。自主智能体中的持久化记忆工作原理完全相同。控制智能体记住的内容,你就控制了它在不同会话和用户中的未来行为。

What Defense Looks Like Here: 此处的防御措施:

  • Provenance Tracking: Securely log the source, context, and timestamp of every memory write.
  • 来源追踪: 安全地记录每次记忆写入的来源、上下文和时间戳。
  • Trust-Weighted Retrieval: Authenticated use.
  • 信任加权检索: 经过身份验证的使用。