What Do People Actually Want From AI? Mapping Preference Plurality
What Do People Actually Want From AI? Mapping Preference Plurality
人们到底想要什么样的 AI?映射偏好的多元性
Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people’s preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. 大型语言模型(LLMs)通常通过人类反馈强化学习(RLHF)进行微调,以使其与人类的偏好和价值观保持一致。然而,这种方法存在已知的局限性:它汇总了相互冲突的偏好,往往依赖于缺乏代表性的样本,并且仅使用二元比较。
Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. 通过分析来自 75 个国家、PRISM 数据集中的 1,500 条开放式回答,我们研究了人们对 AI 系统真正的需求,并揭示了当前方法存在的具体缺陷。我们发现,不同的人有不同的需求:大多数价值观的提及率不到受访者的四分之一,唯有“真实性”是一个例外,占比为 49%。
Furthermore, the same words hide divergent meanings: when people describe what they mean by “truthfulness”, they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. 此外,相同的词汇背后隐藏着不同的含义:当人们描述他们所理解的“真实性”时,他们展现出了截然不同且可能互不兼容的认识论基础——有些人要求提供来源依据,有些人要求专家意见,甚至有些人要求提供非主流观点。
Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do “by default” versus “if requested”) that binary comparisons cannot capture. 某些能力(即模型表现得有多像人类)以及某些功能(如 AI 防护栏)存在明显的争议,一些人渴望这些功能,而另一些人则表示拒绝。我们还发现,人们经常使用二元比较无法捕捉到的语境区分(即 AI 应该“默认”做什么,与“在被要求时”做什么的区别)。
These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users’ clear demands for accuracy, suggests that current methods fail to identify actual preferences. 这些发现揭示了当前对齐实践中的根本性问题。当 49% 的人要求“真实性”但对其定义各不相同时,单一的奖励模型很难捕捉到这种差异。尽管用户对准确性有明确要求,但资金充足的模型中仍然存在高幻觉率,这表明当前的方法未能识别出真实的偏好。
This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence. 本文阐明了那些处于特定情境下、存在争议且不完美的信号,这些信号目前正被简化为通用的偏好模型,这种做法被其他人称为“认识论暴力”。