A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

具有双向信息不对称的上下文多臂老虎机监督博弈

Abstract: We study runtime human oversight of an AI agent when private information runs in both directions: the human privately knows her reward function, while the AI privately knows the quality of the action it proposes. This is the kind of asymmetry that arises naturally when an autonomous robot or software agent has inspected a situation its human supervisor cannot directly assess.

摘要： 我们研究了当信息在双向流动时，人类对人工智能代理的运行时监督问题：人类私下了解自己的奖励函数，而人工智能则私下了解其所提议行动的质量。当自主机器人或软件代理检查了其人类主管无法直接评估的情况时，这种不对称性便会自然产生。

Building on Cooperative Inverse Reinforcement Learning (CIRL) and the Oversight Game, we introduce a contextual-bandit team game with two-sided asymmetric information and a play/ask/trust/oversee interface. The bandit structure removes physical state transitions and thereby yields exact one-shot characterizations that would remain conjectural in the full POMDP setting, though the common belief remains a dynamically controlled state across rounds.

基于协作逆强化学习（CIRL）和监督博弈，我们引入了一种具有双向不对称信息以及“执行/询问/信任/监督”接口的上下文多臂老虎机团队博弈。老虎机结构消除了物理状态转换，从而得出了在完整 POMDP（部分可观测马尔可夫决策过程）设置下仍处于推测阶段的精确单次博弈特征描述，尽管共同信念在各轮次间仍保持为动态控制状态。

We give two one-shot characterizations, a team optimum and a behaviorally natural myopic rule, whose gap is a slab of avoidable harm: a region in which the AI privately knows the proposed action is harmful and shutdown would help, yet a myopic human, trusting her prior, declines to oversee. We show this gap is the price of non-credible oversight communication, and give a partial analysis of how it resolves dynamically over repeated rounds through passive learning and active signaling with a one-period-lagged oversight response.

我们给出了两种单次博弈特征描述：团队最优解和行为上自然的短视规则。两者之间的差距构成了一块“可避免的伤害”区域：在该区域内，人工智能私下知道所提议的行动是有害的且停止执行会有所帮助，但短视的人类由于信任其先验知识，拒绝进行监督。我们证明了这种差距是不可信监督沟通的代价，并部分分析了该问题如何通过被动学习和具有一期滞后监督响应的主动信号传递，在重复博弈中动态解决。