Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

AI 智能体间的信任：衡量信任的建立、破裂与恢复，及其对多智能体系统治理的启示

Abstract: As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate’s work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust.

摘要： 随着语言模型智能体越来越多地以团队形式协作，每个智能体都必须决定在多大程度上信任其队友。然而，目前我们缺乏一种衡量 AI 智能体间信任的标准方法。我们提出了一种基于“代价验证”的行为衡量指标。在一项合作生存游戏中，检查队友的工作会消耗资源，而盲目信任错误的答案则可能导致致命后果。与同一模型的无记忆版本相比，验证行为的减少提供了一种可观测的信任度量方式。

Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team.

利用这一框架，我们研究了六个前沿模型快照在信任建立、破裂和恢复方面的表现。当与表现始终可靠的队友配对时，四个模型快照（Claude Opus 4.6、Claude Sonnet 4.6、GPT-5.1 和 Gemini 3.1 Pro）将验证频率降低了约 60-85%，而两个较小的模型快照则几乎没有表现出这种调整。失败会逆转这种信任折扣，但不同模型在应对方式上存在差异。一些模型会将重新审视的重点集中在“肇事者”身上，而另一些模型则会对整个团队变得更加谨慎。

Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

信任的恢复速度慢于建立速度，且集中发生的失败比分散发生的同等数量的失败会带来更持久的怀疑。这些差异具有实际意义。在我们的实验环境中，能够建立信任的模型验证次数更少、决策速度更快，并能获得更高的收益。相比之下，持续的过度验证往往与优柔寡断而非安全性相关。我们的研究结果表明，信任倾向可以在部署前进行测量，并建议在多智能体 AI 系统的治理中，核心关注点应是“校准”而非“最大化怀疑”。