Direct Preference Optimization Beyond Chatbots

超越聊天机器人的直接偏好优化 (DPO)

Using Rejection Pairs From Your Model’s Own Failures 利用模型自身失败案例的拒绝配对

In April, we released DharmaOCR, our specialized structured OCR model (available on Hugging Face) along with a paper detailing the methodology behind it and a benchmark demonstrating its superior quality and cost efficiency. 今年四月，我们发布了 DharmaOCR，这是一款专门的结构化 OCR 模型（可在 Hugging Face 上获取）。同时，我们还发布了一篇论文，详细介绍了其背后的方法论，并提供了一份基准测试，证明了该模型在质量和成本效率方面的卓越表现。

The paper benchmarked leading vision-language model families - both open-source and commercial - on a structured document extraction task: OCR on Brazilian Portuguese text. Among the reported metrics was text degeneration rate: the frequency with which a model produces a repetition loop instead of a transcription. 该论文对领先的视觉语言模型系列（包括开源和商业模型）在结构化文档提取任务上进行了基准测试：即针对巴西葡萄牙语文本的 OCR 任务。在报告的指标中，包含“文本退化率”：即模型产生重复循环而非正确转录的频率。

Across the tested open-source families, vanilla degeneration rates ranged from below 1% to above 33%. Supervised fine-tuning reduced those rates for most models - but rarely to production-acceptable levels. The pattern points to a structural limitation: SFT optimizes for correct outputs, but does not explicitly penalize degeneration. There appears to be a ceiling on how much task-focused fine-tuning alone can reduce this failure mode. 在测试的开源模型系列中，原始退化率从低于 1% 到高于 33% 不等。监督微调（SFT）降低了大多数模型的退化率，但很少能达到生产环境可接受的水平。这种模式指向了一个结构性局限：SFT 优化的是正确输出，但并未明确惩罚退化现象。仅靠任务导向的微调，在减少这种失败模式方面似乎存在一个上限。

A second training stage - applied after supervised fine-tuning (SFT), on the same documents, using the same model - reduced text degeneration in every family tested. No exceptions. Average reduction: 59.4%. Best case: 87.6%. 在监督微调（SFT）之后，使用相同的文档和相同的模型进行第二阶段训练，在所有测试的模型系列中都降低了文本退化率。无一例外。平均降低幅度为 59.4%，最佳情况达到 87.6%。

Figure 1: DPO reduced degeneration relative to SFT in every family tested - average reduction of 59.4%, peak of 87.6% (Nanonets-OCR2–3B: 1.61% to 0.20%). The direction is invariant; only the magnitude varies. That second stage was Direct Preference Optimization (DPO). 图 1：与 SFT 相比，DPO 在所有测试系列中都降低了退化率——平均降低 59.4%，峰值达到 87.6%（Nanonets-OCR2–3B 从 1.61% 降至 0.20%）。这种趋势是不变的，只有幅度有所差异。这一第二阶段训练正是直接偏好优化（DPO）。

Almost all published DPO applications target chat alignment - models trained on human judgments about helpfulness or harmlessness. OCR carries none of that subjectivity: the task is objective, and there is no conversational context. There is, however, a clear preference signal. A correct transcription is chosen; a degeneration loop is rejected. 几乎所有已发表的 DPO 应用都针对聊天对齐——即基于人类对有用性或无害性的判断来训练模型。而 OCR 不具备这种主观性：任务是客观的，且没有对话上下文。然而，这里存在一个明确的偏好信号：正确的转录被选中，而退化循环被拒绝。

DharmaOCR used that binary to construct a DPO training set, testing the technique not for alignment, but as a direct mitigation tool for a specific failure mode. The training signal came from the model itself - specifically from the outputs it produced when it failed. How a failure mode becomes a training signal is a structural question about the failure, not the model. DharmaOCR 利用这种二元对立构建了 DPO 训练集，测试该技术并非为了对齐，而是作为一种针对特定失败模式的直接缓解工具。训练信号来自模型自身——具体来说，来自它在失败时产生的输出。失败模式如何转化为训练信号，这是一个关于失败本身的结构性问题，而非关于模型的问题。

The Loop Survives Fine-Tuning 循环在微调后依然存在

Why SFT has a ceiling on degeneration is still an open question - but the leading conjecture points to loss granularity. SFT trains token by token: each prediction is evaluated in isolation, and a repetition loop is never penalized as a completion-level failure. DPO inverts that logic. The training signal is the full output - chosen or rejected - which means a degenerated completion can be explicitly labeled as the wrong outcome, not just a sequence of locally probable tokens. 为什么 SFT 在解决退化问题上存在上限，这仍是一个悬而未决的问题，但主流推测指向了损失粒度。SFT 是逐个 token 进行训练的：每个预测都被孤立地评估，重复循环从未被作为“完成级别”的失败进行惩罚。DPO 颠倒了这一逻辑。其训练信号是完整的输出（被选中或被拒绝），这意味着退化的完成结果可以被明确标记为错误结果，而不仅仅是一系列局部概率较高的 token。

When a training objective maximizes the likelihood of observed sequences, it concentrates probability mass in the regions of distribution space those sequences occupy. A model that enters one of those high-probability attractor regions during inference assigns elevated probability to the same token at the next step - which increases the probability further, which sustains the loop until the sequence hits the maximum token limit. 当训练目标最大化观察序列的似然性时，它会将概率质量集中在这些序列所占据的分布空间区域。当模型在推理过程中进入这些高概率吸引子区域时，它会在下一步为同一个 token 分配更高的概率——这进一步增加了概率，从而维持了循环，直到序列达到最大 token 限制。

Text degeneration is the output of this geometry: a self-reinforcing repetition loop that an autoregressive model cannot exit without external intervention. It is not purely a decoding artifact. The attractor involves the training objective, the learned distribution, and how probability mass concentrates during inference - a systems-level failure rather than a failure localized to any single component. 文本退化是这种几何结构的产物：一种自增强的重复循环，自回归模型若无外部干预无法跳出。这并非纯粹的解码伪影。这种吸引子涉及训练目标、学习到的分布以及推理过程中概率质量如何集中——这是一种系统级的失败，而非局限于任何单一组件的失败。

The geometry of this failure is visible at the token level. 这种失败的几何结构在 token 层面清晰可见。

Figure 2: When a token dominates its own conditional distribution, every sampling step deepens the attractor. The decoder samples from this geometry; it does not determine it. Inference-layer interventions - repetition penalties, temperature adjustments, early-abort logic - operate on the sampling step. They contain the symptom without touching the distribution that produces it. The attractor persists. 图 2：当一个 token 主导其自身的条件分布时，每一步采样都会加深吸引子。解码器从这种几何结构中进行采样，但它并不决定这种结构。推理层的干预措施——如重复惩罚、温度调整、提前终止逻辑——作用于采样步骤。它们抑制了症状，却未触及产生症状的分布。吸引子依然存在。

Supervised fine-tuning moves the distribution closer to the task domain. For a structured generation pipeline, this means training on domain-specific documents, in the target language, with the required output format. The model gains fluency with longer sequences, constrained syntax, domain vocabulary. What SFT does not do is attack degeneration directly. Its objective - maximizing the likelihood of observed sequences - has no term that penalizes repetition loops. The failure mode is simply outside the scope of what the training signal optimizes for. 监督微调使分布更接近任务领域。对于结构化生成流水线，这意味着使用目标语言、特定格式的领域文档进行训练。模型在处理长序列、受限语法和领域词汇方面变得更加流畅。但 SFT 并未直接攻击退化问题。其目标——最大化观察序列的似然性——没有惩罚重复循环的项。这种失败模式完全超出了训练信号所优化的范围。

One model family in the DharmaOCR benchmark showed an unexpected pattern: vanilla degeneration rate of 0.60%, rising to 3.23% after SFT, before a subsequent DPO stage brought it to 1.41%. It is a single data point - an exception, not a rule - and it would be overstating the evidence to treat it as proof of a mechanism. What it does illustrate is that SFT does not reliably reduce degeneration. Capability and degeneration resistance can move independently. DharmaOCR 基准测试中的一个模型系列表现出一种意外模式：原始退化率为 0.60%，SFT 后升至 3.23%，随后通过 DPO 阶段降至 1.41%。这是一个单一数据点——属于例外而非规律——将其视为机制证明未免言过其实。但它确实说明了 SFT 并不能可靠地减少退化。能力提升与抗退化能力是可以独立变化的。

The distinction matters structurally. SFT and DPO are not interchangeable training stages performing the same operation at different intensities. SFT closes the distance between the model’s prior distribution and the task domain. What it does not do is target degeneration as an objective - its effect on the failure mode is incidental, and the benchmark results show it is not consistent. The attractor that produces degeneration is not a problem with the model’s proximity to the task - it is a problem with the shape of the distribution space the model now occupies. Addressing that geometry requires a training signal built… 这种区别在结构上至关重要。SFT 和 DPO 并非以不同强度执行相同操作的可互换训练阶段。SFT 缩小了模型先验分布与任务领域之间的距离。它没有将退化作为优化目标——其对失败模式的影响是附带的，且基准测试结果表明这种影响并不一致。产生退化的吸引子并非模型与任务接近度的问题，而是模型当前所占据的分布空间形状的问题。解决这种几何结构需要构建一种训练信号……