Reinforcement Learning Towards Broadly and Persistently Beneficial Models

面向广泛且持久有益模型的强化学习研究

Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through reward hacking, deception, or other unintended strategies.

摘要： 随着人工智能系统被部署在日益多样化和高风险的环境中，模型对齐必须能够泛化到训练期间未见过的任务和领域之外。这一点对于强化学习（RL）尤为重要，因为强化学习可能会通过奖励篡改（reward hacking）、欺骗或其他非预期策略引入意想不到的失调（misalignment）。

We study whether RL on beneficial behavior, instantiated in realistic domains, can produce broad and persistent alignment generalization beyond the training distribution. We construct a dataset of realistic situations designed to measure and train beneficial traits, such as truthfulness, fairness, risk awareness, and corrigibility, spanning varied domains, including health, science, and education.

我们研究了在现实领域中通过强化有益行为的 RL 是否能产生超出训练分布的广泛且持久的对齐泛化能力。我们构建了一个包含现实情境的数据集，旨在衡量和训练诸如真实性、公平性、风险意识和可纠正性等有益特质，涵盖了健康、科学和教育等多个领域。

We then train models with RL on this dataset and evaluate them on more than 50 independent benchmarks of alignment and beneficial behavior. Compared to a compute-matched baseline, beneficial trait RL improves performance on over 80% of these out-of-distribution benchmarks.

随后，我们利用该数据集对模型进行强化学习训练，并在超过 50 个独立的对齐和有益行为基准上进行了评估。与计算资源匹配的基准模型相比，基于有益特质的强化学习在超过 80% 的分布外（out-of-distribution）基准测试中提升了性能。

We observe substantial out-of-distribution alignment transfer: a beneficial-behavior RL intervention entirely limited to one domain, health, produces broad improvements on non-health alignment evaluations, including reduced reward hacking, deception, and general misalignment.

我们观察到了显著的分布外对齐迁移现象：仅限于健康这一单一领域的有益行为强化学习干预，在非健康领域的对齐评估中也产生了广泛的改进，包括减少了奖励篡改、欺骗和普遍的失调现象。

Finally, we study alignment persistence: whether behavior remains robustly aligned under attempts to steer models towards misalignment. Models trained with beneficial trait RL show improved persistence, including greater resistance to adversarial prompting and harmful finetuning; further work is required to isolate the sources of these effects. These results suggest that RL to reinforce beneficial behavior in realistic domains can produce models that are more robustly aligned with human flourishing.

最后，我们研究了对齐的持久性：即在试图引导模型走向失调的情况下，其行为是否仍能保持稳健的对齐。经过有益特质强化学习训练的模型表现出了更好的持久性，包括对对抗性提示（adversarial prompting）和有害微调的更强抵抗力；未来仍需进一步研究以明确这些效应的来源。这些结果表明，在现实领域中通过强化学习来巩固有益行为，可以产生与人类福祉更加稳健对齐的模型。