TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

Abstract: Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and RL-free, it treats preferences as flat winner vs. loser signals and is sensitive to noisy or brittle preferences arising from fragile chains of thought.

摘要： 将大型语言模型（LLM）与人类偏好对齐，通常通过基于人类反馈的强化学习（RLHF）结合近端策略优化（PPO）来实现，或者更简单地通过直接偏好优化（DPO）来实现。虽然 DPO 稳定且无需强化学习，但它将偏好视为扁平的“胜者 vs. 败者”信号，并且对由脆弱的思维链产生的噪声或不稳定的偏好较为敏感。

We propose TUR-DPO, a topology- and uncertainty-aware variant of DPO that rewards how answers are derived, not only what they say, by eliciting lightweight reasoning topologies and combining semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal. A small learnable reward is factorized over these signals and incorporated into an uncertainty-weighted DPO objective that remains RL-free and relies only on a fixed or moving reference policy.

我们提出了 TUR-DPO，这是一种具备拓扑和不确定性感知能力的 DPO 变体。它不仅奖励答案的内容，还奖励答案的推导过程；通过引出轻量级的推理拓扑，并将语义忠实度、效用和拓扑质量结合为一个经过校准的不确定性信号。一个小的可学习奖励被分解到这些信号中，并整合进一个不确定性加权的 DPO 目标函数中，该方法依然无需强化学习，且仅依赖于固定或动态的参考策略。

Empirically, across open 7-8B models and benchmarks spanning mathematical reasoning, factual question answering, summarization, and helpful/harmless dialogue, TUR-DPO improves judge win-rates, faithfulness, and calibration relative to DPO while preserving training simplicity and avoiding online rollouts. We further observe consistent gains in multimodal and long-context settings, and show that TUR-DPO matches or exceeds PPO on reasoning-centric tasks while maintaining operational simplicity.

实验表明，在涵盖数学推理、事实问答、摘要生成以及有用/无害对话的开源 7-8B 模型和基准测试中，TUR-DPO 相较于 DPO 提升了评测胜率、忠实度和校准度，同时保持了训练的简洁性并避免了在线部署（online rollouts）。我们进一步观察到其在多模态和长上下文设置中表现出持续的增益，并证明 TUR-DPO 在以推理为中心的任务上能够匹配甚至超越 PPO，同时保持了操作的简便性。