Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

从噪声偏好中学习:一种用于直接偏好优化的半监督学习方法

Abstract: Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser.

摘要: 人类的视觉偏好本质上是多维的,涵盖了美学、细节保真度和语义对齐。然而,现有的数据集仅提供单一的整体标注,导致了严重的标签噪声:在某些维度上表现出色但在其他维度上有所欠缺的图像,往往被简单地标记为“胜者”或“败者”。

We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data.

我们从理论上证明,将多维偏好压缩为二元标签会产生冲突的梯度信号,从而误导扩散直接偏好优化(DPO)。为了解决这一问题,我们提出了 Semi-DPO,这是一种半监督方法,它将一致的偏好对视为干净的标记数据,将冲突的偏好对视为带有噪声的未标记数据。

Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training.

我们的方法首先在经过共识过滤的干净子集上进行训练,然后将该模型作为隐式分类器,为噪声集生成伪标签以进行迭代优化。实验结果表明,Semi-DPO 达到了最先进的性能,并显著改善了与复杂人类偏好的对齐效果,且在训练过程中无需额外的人工标注或显式的奖励模型。

We will release our code and models at: this https URL

我们将在此处发布我们的代码和模型:[链接地址]