The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

泛化的“搭便车”假说：解释并缓解涌现式对齐失效

Abstract: The mechanisms behind LLMs’ broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries.

摘要： 大语言模型（LLM）在训练样本之外进行广泛过度泛化的机制尚不明确。“涌现式对齐失效”（Emergent Misalignment, EM）提供了一个引人注目的案例研究：在特定窄任务上进行微调，会导致模型在语义无关的测试领域出现广泛的对齐失效。在这项工作中，我们提出了“搭便车假说”（Piggyback Hypothesis）：聊天模板中的标记（tokens）可以将微调后的行为“搭便车”带入到域外查询中。

We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM.

我们通过实验验证了这一假说：对前缀（即所有用户查询之前的标记）进行微小扰动，或者用未微调模型的前缀表示来修补当前模型，都可以在不改变用户查询的情况下恢复对齐。基于这一发现，我们提出了“标记正则化微调”（Token-Regularized Finetuning, TReFT），该方法在训练过程中对特定的标记表示进行正则化，从而缓解 EM 问题。

Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis.

在不同的模型和多个诱发 EM 的数据集上，TReFT 在保持领域内学习能力的同时降低了 EM。在法律领域微调的 Llama-3.1-8B 模型上，与使用保留对齐样本集进行数据交替的方法相比，TReFT 的 EM 降低幅度提升了 33.5%。我们进一步证明，TReFT 可扩展至其他窄任务微调场景，包括弃权、工具使用和拒绝回答（离题泛化平均降低了 54.3%），这有力地支持了“搭便车假说”。

Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

总的来说，我们的工作强调了 LLM 可能会以非预期的方式进行学习和泛化，并为实现更受约束的微调指明了方向。同时，这也呼吁学术界进一步研究共享输入特征如何跨领域“搭便车”影响模型行为。