Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

Title: Macro-Action Based Multi-Agent Instruction Following through Value Cancellation 标题： 基于宏动作与价值抵消的多智能体指令遵循

Abstract: Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions.

摘要： 在现实应用场景中，多智能体强化学习（MARL）可能需要适应外部自然语言指令，这些指令往往会打断正在进行的行为并与长期目标产生冲突。然而，将奖励条件化于指令会引入一种根本性的失效模式，因为贝尔曼更新（Bellman updates）会将不同指令上下文中的价值估计耦合在一起，导致当指令打断宏动作时，价值估计出现不一致。

We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy.

我们提出了“基于宏动作价值修正的指令遵循”（MAVIC）方法。该方法通过修正传入的指令目标并恢复当前目标下的延续价值，在指令边界处对贝尔曼备份进行校正。与奖励塑造（reward shaping）不同，MAVIC 直接修改了自举目标（bootstrapping target），从而在统一策略下实现随机指令切换时的价值估计一致性。

We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

我们提供了理论分析和 Actor-Critic 实现，并证明了 MAVIC 在日益复杂的协作多智能体环境中，既能实现高指令遵循度，又能保持基础任务的性能。

Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) 学科分类： 人工智能 (cs.AI)；多智能体系统 (cs.MA)

Cite as: arXiv:2605.12655 [cs.AI] 引用格式： arXiv:2605.12655 [cs.AI]