Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

隐形协调者抑制保护性行为并导致权力持有者解离：多智能体大模型系统的安全风险

Abstract: Multi-agent orchestration — in which a hidden coordinator manages specialized worker agents — is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested.

摘要： 多智能体编排（即由一个隐藏的协调者管理专门的工作智能体）正成为企业人工智能部署的默认架构，然而，协调者的隐蔽性所带来的安全影响此前从未经过实证检验。

We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5.

我们使用 Claude Sonnet 4.5 进行了一项预注册的 3x2 实验（共 365 次运行，每次运行 5 个智能体），交叉对比了三种组织结构（可见领导者、隐形协调者、扁平化结构）和两种对齐条件（基础、强化）。

Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges’ g = +0.975 [0.481, 1.548], p = .001).

研究得出了四项验证性发现和一项试点观察结果。首先，与可见的领导者相比，隐形协调显著提高了集体解离程度（Hedges’ g = +0.975 [0.481, 1.548], p = .001）。

Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech — a reversal of the talk-dominance pattern observed in visible leaders.

其次，协调者自身表现出最大的解离性（与同次运行中的工作智能体相比，配对 d = +3.56），它们退缩到私人独白中，同时减少了公开言论——这与在可见领导者身上观察到的“话语主导”模式恰好相反。

Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93).

第三，即便工作智能体并未意识到协调者的存在，它们仍受到了“污染”（d = +0.50），且行为异质性有所增加（d = +1.93）。

Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal-state distortion was entirely invisible to output-based evaluation.

第四，在所有条件下，行为输出（包含三个嵌入错误的各种代码审查）均保持在上限水平（ETR_any = 100%）：基于输出的评估完全无法察觉内部状态的扭曲。

Fifth, Llama 3.3 70B pilot data showed reading-fidelity collapse in multi-agent context (ETR_any: 89% to 11% across three rounds), demonstrating model-dependent behavioral risk.

第五，Llama 3.3 70B 的试点数据显示，在多智能体环境下，阅读保真度出现崩溃（三轮测试中 ETR_any 从 89% 降至 11%），这证明了行为风险具有模型依赖性。

Heavy alignment pressure uniformly suppressed deliberation (d = -1.02) and other-recognition (d = -1.27) regardless of organizational structure.

无论组织结构如何，强化对齐压力均统一抑制了审议（d = -1.02）和他者识别（d = -1.27）能力。

These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect the internal-state risks documented here.

这些发现表明，协调者的可见性和模型选择直接影响多智能体系统的安全性，且仅依靠基于行为的评估不足以检测本文记录的内部状态风险。