Understanding Annotator Safety Policy with Interpretability

通过可解释性理解标注员安全策略

Abstract: Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety).

摘要： 安全策略定义了什么是安全或不安全的 AI 输出，并指导着数据标注和模型开发。然而，标注分歧是普遍存在的，其来源多种多样，例如操作失误（标注员误解或错误执行任务）、策略模糊性（策略措辞留有解释空间），或价值多元化（不同的标注员对安全持有不同的观点）。

Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to reflect actual decision processes.

区分这些来源至关重要。例如，操作失误需要质量控制，策略模糊需要进一步澄清，而价值多元化则需要审慎考虑如何纳入多样化的观点。然而，理解标注员为何产生分歧并非易事。直接询问标注员的推理过程成本高昂，会显著增加标注负担，且对于人类和大型语言模型（LLM）标注员来说，这种方法往往不可靠，因为自我报告的推理通常无法反映真实的决策过程。

We introduce Annotator Policy Models (APMs), interpretable models that learn annotators’ internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort. We validate that APMs accurately model annotator safety policy (>80% accuracy), faithfully predict responses to counterfactual edits, and recover known policy differences in controlled settings.

我们引入了标注员策略模型（Annotator Policy Models, APMs）。这是一种可解释的模型，仅通过标注行为即可学习标注员的内部安全策略，从而在无需额外标注工作的情况下，使标注员的推理过程变得可见且可比较。我们验证了 APMs 能准确建模标注员的安全策略（准确率 >80%），能够忠实地预测对反事实编辑的响应，并在受控环境中恢复已知的策略差异。

Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design.

通过将 APMs 应用于 LLM 和人类标注，我们展示了两个核心应用：(1) 通过揭示标注员如何以不同方式解读安全指令来显现策略模糊性；(2) 通过发现不同人口统计群体在安全优先级上的系统性差异来显现价值多元化。这些能力共同支持了更具针对性、透明度和包容性的安全策略设计。