Constructive Alignment: Governing Preference Dynamics in Human-AI Interaction

建设性对齐：治理人机交互中的偏好动态

Abstract: Most approaches to AI alignment treat human preferences as fixed targets to be inferred and optimized. This assumption conflicts with extensive empirical evidence showing that preferences are layered, dynamic, and constructed through interaction—particularly with adaptive technologies.

摘要： 大多数人工智能对齐方法将人类偏好视为需要推断和优化的固定目标。然而，这一假设与大量的实证证据相冲突，这些证据表明偏好是分层的、动态的，并且是通过交互——特别是与自适应技术的交互——构建而成的。

As AI systems become more persistent, personalized, and socially embedded, they increasingly participate in shaping what people attend to, value, and endorse over time. We introduce Constructive Alignment, a paradigm that reframes alignment as a control problem over evolving human preference trajectories rather than static preference satisfaction.

随着人工智能系统变得更加持久、个性化且深入社会，它们越来越多地参与塑造人们随时间推移所关注、重视和认可的事物。我们引入了“建设性对齐”（Constructive Alignment），这是一种将对齐重新定义为对演变中的人类偏好轨迹进行控制的问题，而非仅仅满足静态偏好的范式。

Drawing on behavioral economics, psychology, and constructivist social theory, we model preferences as layered state variables that evolve under interaction with AI systems. We formalize this view using a control-theoretic framework in which system actions and interaction design jointly influence both world states and human evaluative states.

借鉴行为经济学、心理学和社会建构主义理论，我们将偏好建模为在与人工智能系统交互过程中不断演变的分层状态变量。我们使用控制理论框架将这一观点形式化，在该框架中，系统行为和交互设计共同影响世界状态和人类的评估状态。

We argue that alignment is not primarily about controlling AI behavior, but about regulating how AI systems influence the evolution of human preferences—ensuring that value trajectories remain coherent, reflectively endorsed, epistemically grounded, bounded against manipulation, and empowering under uncertainty. Alignment thus becomes a problem of governing long-term value formation rather than simply satisfying static preferences.

我们认为，对齐的核心不在于控制人工智能的行为，而在于规范人工智能系统如何影响人类偏好的演变——确保价值轨迹保持连贯性、经过反思性认可、具有认知基础、能够抵御操纵，并在不确定性下赋予人类自主权。因此，对齐不再仅仅是满足静态偏好的问题，而演变为治理长期价值形成的问题。