Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

从噪声偏好中学习：一种用于直接偏好优化的半监督学习方法

Abstract: Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser.

摘要： 人类的视觉偏好本质上是多维的，涵盖了美学、细节保真度和语义对齐。然而，现有的数据集仅提供单一的整体标注，导致了严重的标签噪声：在某些维度上表现出色但在其他维度上有所欠缺的图像，往往被简单地标记为“胜者”或“败者”。

We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data.

我们从理论上证明，将多维偏好压缩为二元标签会产生冲突的梯度信号，从而误导扩散直接偏好优化（DPO）。为了解决这一问题，我们提出了 Semi-DPO，这是一种半监督方法，它将一致的偏好对视为干净的标记数据，将冲突的偏好对视为带有噪声的未标记数据。

Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training.

我们的方法首先在经过共识过滤的干净子集上进行训练，然后将该模型作为隐式分类器，为噪声集生成伪标签以进行迭代优化。实验结果表明，Semi-DPO 达到了最先进的性能，并显著改善了与复杂人类偏好的对齐效果，且在训练过程中无需额外的人工标注或显式的奖励模型。

We will release our code and models at: this https URL

我们将在此处发布我们的代码和模型：[链接地址]