Re-Centering Humans in LLM Personalization

在大语言模型个性化中回归以人为本

Abstract: Despite growing interest, most evaluations of large language models’ (LLMs’) personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data.

摘要： 尽管人们对此兴趣日益浓厚，但目前对大语言模型（LLM）个性化能力的大多数评估都依赖于合成数据。目前尚不清楚现有的个性化系统在真实用户身上的表现如何。在本文中，我们研究了在使用合成数据与人类真实数据时，LLM 个性化性能之间的差距。

We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101).

我们收集了涵盖个性化三个阶段的人类对话（550 场对话）及相关判断：从对话中提取用户属性（5,949 条判断）、将相关属性与新提示词配对（11,919 条），以及将相关属性整合到个性化回复中（1,101 条）。

Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better).

引入人类数据揭示了系统在每个阶段的局限性。模型在从人类对话中提取属性时表现吃力，在相关属性的判断上与人类存在分歧，且生成的个性化回复在人类看来并不比通用回复更好（尽管 LLM 评判员普遍认为其更好）。

We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly.

我们引入了两种基于训练的轻量级干预措施，在前两个阶段将自动化个性化评估向人类数据靠拢。然而，在第三阶段，我们发现学习到的奖励模型与人类评分的相关性仅为一般，这表明与人类对齐的个性化质量判断难以直接建模。

Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

我们收集的数据为研究模型应如何以人类认为有用的方式提取、选择和整合用户信息奠定了基础。