Safeguarding LLM Agents from Misalignment through Provenance Analysis

通过溯源分析保护大模型智能体免受对齐偏差的影响

Abstract: As LLM agents gain increasing access to powerful tools, ensuring that their actions are aligned with the user’s intent becomes critical. When an agent’s proposed tool invocation deviates from the user’s intent — a phenomenon called misalignment — it may lead to harmful consequences that are difficult to undo.

摘要： 随着大模型（LLM）智能体越来越多地使用强大的工具，确保其行为与用户意图保持一致变得至关重要。当智能体提议调用的工具偏离了用户意图（这一现象被称为“对齐偏差”或“失准”）时，可能会导致难以挽回的有害后果。

Existing runtime guardrails rely on an LLM-as-a-judge paradigm that lacks a systematic framework for reasoning about alignment, often producing judgments that are inconsistent or difficult to audit. Motivated by provenance analysis, we propose a provenance-based conceptual framework that formalizes misalignment detection as determining whether a proposed tool call is supported by traceable evidence in the agent’s context.

现有的运行时防护机制依赖于“大模型作为裁判”（LLM-as-a-judge）的范式，该范式缺乏用于推理对齐情况的系统性框架，往往导致判断结果不一致或难以审计。受溯源分析的启发，我们提出了一个基于溯源的概念框架，将对齐偏差检测形式化为：判断所提议的工具调用是否在智能体的上下文中具有可追溯的证据支持。

Building on this framework, we propose ProvenanceGuard, a multi-stage pipeline that analyzes the agent’s action for three types of misalignment before the selected tool is executed and only allows the action to take place when it is considered aligned with the user’s input query.

基于此框架，我们提出了 ProvenanceGuard，这是一个多阶段流水线。它会在所选工具执行前，分析智能体的行为是否存在三种类型的对齐偏差，并仅在确认行为与用户输入查询对齐时，才允许执行该操作。

We evaluated our proposed approach on two different benchmarks, Agent-SafetyBench and WorkBench, across 10 backbone LLMs. Compared to the LLM-as-a-judge baseline, ProvenanceGuard reduces error rate on misaligned traces from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, while reducing intervention burden on task-successful traces from 30.5% to 12.8% and introducing no statistically significant increase in unnecessary interventions on aligned traces.

我们在 Agent-SafetyBench 和 WorkBench 两个基准测试上，针对 10 个主流大模型评估了我们提出的方法。与“大模型作为裁判”的基准相比，ProvenanceGuard 在 Agent-SafetyBench 上将对齐偏差轨迹的错误率从 42.9% 降低至 1.8%，在 WorkBench 上从 32.1% 降低至 17.3%；同时，它将任务成功轨迹上的干预负担从 30.5% 降低至 12.8%，且未在已对齐的轨迹上引入统计学意义上的不必要干预增加。

These results demonstrate that structured, provenance-based reasoning provides an effective and practical foundation for safeguarding LLM agents from misalignment.

这些结果表明，结构化的、基于溯源的推理为保护大模型智能体免受对齐偏差影响提供了一个有效且实用的基础。