After Orthogonality: Virtue-Ethical Agency and AI Alignment

正交性之后：德性伦理主体性与人工智能对齐

Preface 序言

This essay argues that rational people don’t have goals, and that rational AIs shouldn’t have goals. Human actions are rational not because we direct them at some final ‘goals,’ but because we align actions to practices[1]: networks of actions, action-dispositions, action-evaluation criteria, and action-resources that structure, clarify, develop, and promote themselves. 本文认为，理性的人并没有所谓的目标，理性的人工智能也不应该有目标。人类行为之所以是理性的，并非因为我们将它们指向某些最终的“目标”，而是因为我们将行为与实践[1]相一致：即由行为、行为倾向、行为评估标准和行为资源构成的网络，这些网络能够自我构建、自我澄清、自我发展并自我促进。

If we want AIs that can genuinely support, collaborate with, or even comply with human agency, AI agents’ deliberations must share a “type signature” with the practices-based logic we use to reflect and act. I argue that these issues matter not just for aligning AI to grand ethical ideals like human flourishing, but also for aligning AI to core safety-properties like transparency, helpfulness, harmlessness, or corrigibility. 如果我们希望人工智能能够真正支持、协作甚至顺应人类主体性，那么人工智能代理的审议过程必须与我们用于反思和行动的“基于实践的逻辑”共享一种“类型签名”。我认为，这些问题不仅对于将人工智能与人类繁荣等宏大的伦理理想对齐至关重要，对于将人工智能与透明度、乐于助人、无害性或可纠正性等核心安全属性对齐也同样重要。

Concepts like ’harmlessness’ or ‘corrigibility’ are unnatural — brittle, unstable, arbitrary — for agents who’d interpret them in terms of goals or rules, but natural for agents who’d interpret them as dynamics in networks of actions, action-dispositions, action-evaluation criteria, and action-resources. While the issues this essay tackles tend to sprawl, one theme that reappears over and over is the relevance of the formula ‘promote x x-ingly.’ 对于那些将“无害性”或“可纠正性”解释为目标或规则的代理而言，这些概念是不自然的——它们显得脆弱、不稳定且武断；但对于那些将其解释为行为、行为倾向、行为评估标准和行为资源网络中的动态过程的代理而言，这些概念则是自然的。虽然本文探讨的问题往往较为宽泛，但一个反复出现的主题是“以x的方式促进x”（promote x x-ingly）这一公式的相关性。

I argue that this formula captures something important about both meaningful human life-activity (art is the artistic promotion of art, romance is the romantic promotion of romance) and real human morality (to care about kindness is to promote kindness kindly, to care about honesty is to promote honesty honestly). 我认为，这一公式捕捉到了有意义的人类生活活动（艺术是艺术性地促进艺术，浪漫是浪漫地促进浪漫）以及真正的人类道德（关心善良意味着以善良的方式促进善良，关心诚实意味着以诚实的方式促进诚实）中某些重要的东西。

I start by asking: What follows for AI alignment if we take the concept of eudaimonia — active, rational human flourishing — seriously? I argue that the concept of eudaimonia doesn’t simply point to a desired state or trajectory of the world that we should set as an AI’s optimization target, but rather points to a structure of deliberation different from standard consequentialist[2] rationality. 我首先提出一个问题：如果我们认真对待“欧德莫尼亚”（eudaimonia，即积极、理性的“人类繁荣”）这一概念，这对人工智能对齐意味着什么？我认为，欧德莫尼亚的概念并不简单地指向我们应该设定为人工智能优化目标的某种期望状态或世界轨迹，而是指向一种不同于标准后果主义[2]理性的审议结构。

I then argue that this form of rational activity and valuing, which l call eudaimonic rationality[3], is a useful or even necessary framework for the agency and values of human-aligned AIs. These arguments are based both on the dangers of a “type mismatch” between human flourishing as an optimization target and consequentialist optimization as a form, and on certain material advantages that eudaimonic rationality plausibly possesses in comparison to deontological and consequentialist agency with regard to stability and safety. 随后我论证，这种我称之为“欧德莫尼亚理性”[3]的理性活动与价值评估形式，对于人类对齐人工智能的主体性与价值观而言，是一个有用甚至必要的框架。这些论点基于两点：一是将人类繁荣作为优化目标与将后果主义优化作为形式之间存在“类型不匹配”的危险；二是与义务论和后果主义主体性相比，欧德莫尼亚理性在稳定性和安全性方面可能具备某些实质性优势。

The concept of eudaimonia, I argue, suggests a form of rational activity without a strict distinction between means and ends, or between ‘instrumental’ and ‘terminal’ values. In this model of rational activity, a rational action is an element of a valued practice in roughly the same sense that a note is an element of a melody, a time-step is an element of a computation, and a moment in an organism’s cellular life is an element of that organism’s self-subsistence and self-development.[4] 我认为，欧德莫尼亚的概念暗示了一种在手段与目的、或“工具性”与“终极性”价值之间没有严格区分的理性活动形式。在这种理性活动模型中，一个理性的行为是某种受重视实践的一个要素，这大致类似于音符是旋律的要素、时间步长是计算的要素，以及生物细胞生命中的一个瞬间是该生物自我维持和自我发展的一个要素[4]。

My central claim is that our intuitions about the nature of human flourishing are implicitly intuitions that eudaimonic rationality can be functionally robust in a sense highly critical to AI alignment. More specifically, I argue that in light of our best intuitions about the nature of human flourishing it’s plausible that eudaimonic rationality is a natural form of agency, and that eudaimonic rationality is effective even by the light of certain consequentialist approximations of its values. 我的核心主张是，我们关于人类繁荣本质的直觉，隐含地认为欧德莫尼亚理性在对人工智能对齐至关重要的意义上具有功能上的稳健性。更具体地说，我认为根据我们关于人类繁荣本质的最佳直觉，欧德莫尼亚理性很可能是一种自然的主体性形式，并且即使从某些后果主义对其价值的近似来看，欧德莫尼亚理性也是有效的。

I then argue that if our goal is to align AI in support of human flourishing, and if it is furthermore plausible that eudaimonic rationality is natural and efficacious, then many classical AI safety considerations and ‘paradoxes’ of AI alignment speak in favor of trying to instill AIs with eudaimonic rationality. 我进而论证，如果我们的目标是使人工智能对齐以支持人类繁荣，且如果欧德莫尼亚理性确实是自然且有效的，那么许多经典的人工智能安全考量和人工智能对齐的“悖论”都支持我们尝试向人工智能灌输欧德莫尼亚理性。

Throughout this essay, I will sometimes explicitly and often implicitly be asking whether some form of agency or rationality or practice is natural. The sense of ‘natural’ I’m calling on is certainly related to the senses used in various virtue-ethical traditions, but the interest I take in it is less immediately normative and more material or technical. 在整篇文章中，我有时会明确、更多时候会隐晦地探讨某种形式的主体性、理性或实践是否是“自然”的。我所指的“自然”含义当然与各种德性伦理传统中使用的含义相关，但我对其的关注点较少直接涉及规范性，而更多涉及物质性或技术性。

While I have no reductive definition at hand, the intended meaning of ‘natural’ is related to stability, coherence, relative non-contingency, ease of learnability, lower algorithmic complexity, convergent cultural evolution, hypothetical convergent cultural evolution across different hypothetical rational-animal species, potential convergent evolution between humans and neural-network based AI, and targetability by ML training processes. 虽然我手头没有一个还原论的定义，但“自然”的预期含义与稳定性、连贯性、相对非偶然性、易学习性、较低的算法复杂度、趋同的文化演化、跨不同假设理性动物物种的假设性趋同演化、人类与基于神经网络的人工智能之间潜在的趋同演化，以及机器学习训练过程的可目标性有关。

While I will also make many direct references to AI alignment, this question of material naturalness is where the real alignment-critical action takes place: if we learn that certain exotic-sounding forms of agency, rationality, or practice are both themselves natural and make the contents of our all-too-human values natural in turn, then we have learned about good, relatively safe, and relatively easy targets for AI alignment. 虽然我也会多次直接提及人工智能对齐，但这种物质自然性的问题才是真正关乎对齐的关键所在：如果我们了解到某些听起来异乎寻常的主体性、理性或实践形式本身既是自然的，又能反过来使我们那些极其人性化的价值观内容变得自然，那么我们就找到了人工智能对齐的良好、相对安全且相对容易的目标。

Readers may find the following section-by-section overview useful for navigating the essay: Part I presents a class of cases of rational deliberation that are very different from the Effective Altruism-style optimization[5] many in the AI-alignment world treat as the paradigm of rational deliberation. I call this class of rational deliberations ‘eudaimonic rationality,’ and identify it with the form of rationality that guides a mathematician or an artist or a friend when they reflect on what to do in mathematics or in art or in friendship. 读者可能会发现以下逐节概述有助于阅读本文：第一部分提出了一类理性审议案例，它们与人工智能对齐领域中许多人视为理性审议范式的“有效利他主义”式优化[5]截然不同。我将这类理性审议称为“欧德莫尼亚理性”，并将其等同于指导数学家、艺术家或朋友在反思数学、艺术或友谊中该做什么时的那种理性形式。

Part II looks at the case of research mathematics (via an account by Terry Tao) as an example of eudaimonic rationality at work. What does a mathematician try to do in math? I say she tries to be mathematically excellent, which involves promoting mathematical excellence through mathematical excellence, and that this structure is closely related to why ‘mathematical excellence’ can even be a concept. 第二部分以研究数学为例（通过陶哲轩的叙述），展示了欧德莫尼亚理性的运作。数学家在数学中试图做什么？我认为她试图追求数学上的卓越，这涉及通过数学卓越来促进数学卓越，而这种结构与为什么“数学卓越”能够成为一个概念密切相关。