AgentWall: A Runtime Safety Layer for Local AI Agents

AgentWall：本地 AI 智能体的运行时安全层

The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible.

自主 AI 智能体的安全性正日益被视为一个关键的开放性难题。随着智能体从被动的文本生成器转变为能够执行 Shell 命令、修改文件、调用 API 以及浏览网页的活跃执行者，不安全或被恶意操纵的行为所带来的后果变得直接且具体。

Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent’s intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control.

现有的 AI 安全研究主要集中在模型对齐和输入过滤上，但这些方法无法解决当智能体的意图在真实机器上转化为实际操作时所产生的问题。这种安全缺口在本地环境中尤为严重，因为开发者在运行智能体时，往往缺乏有效的运行时控制，导致智能体可以直接访问其文件系统、凭据和基础设施。

This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay.

本文介绍了 AgentWall，这是一个专为本地 AI 智能体设计的运行时安全与可观测性层。AgentWall 会在智能体的每一个拟议操作到达宿主环境之前进行拦截，根据明确的声明式策略对其进行评估，对敏感操作要求人工审批，并记录完整的执行轨迹以供审计和回放。

It is implemented as a policy-enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests.

它被实现为一个策略执行 MCP 代理和原生 OpenClaw 插件，只需一条安装命令即可在 Claude Desktop、Cursor、Windsurf、Claude Code 和 OpenClaw 中运行。我们展示了 AgentWall 的设计、架构、威胁模型和策略模型，并在 14 项基准测试中证明了其 92.9% 的策略执行准确率，且延迟低于毫秒级。