Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

自我识别微调可预防并逆转涌现式对齐失效

Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model’s aligned character rather than direct learning of harmful content. 摘要: 涌现式对齐失效(Emergent misalignment, EM)已被证实与对齐失效的人格向量及负面性格特征的激活有关,这表明 EM 的运作机制是通过破坏模型已对齐的性格,而非直接学习有害内容。

Motivated by this connection, we study self-generated text recognition (SGTR) finetuning as a character-targeted intervention that is distinct from existing in-training defenses. 受此联系的启发,我们研究了自我生成文本识别(SGTR)微调,将其作为一种针对性格的干预手段,这与现有的训练中防御方法有所不同。

We conduct two-stage finetuning experiments across three models (GPT-4.1, Qwen2.5-32B-Instruct, Seed-OSS-36B-Instruct) and multiple EM datasets to compare SGTR finetuning against benign finetuning baselines (correct domain-specific data, general knowledge, and word counting) to find it an effective defense in both reversal and prevention settings. 我们针对三个模型(GPT-4.1、Qwen2.5-32B-Instruct、Seed-OSS-36B-Instruct)及多个 EM 数据集进行了两阶段微调实验,将 SGTR 微调与良性微调基准(包括正确的领域特定数据、通用知识和单词计数)进行对比,发现它在逆转和预防场景中都是一种有效的防御手段。

We find that all interventions produce comparable EM reversal, but only when restoring capabilities that EM had degraded. 我们发现,所有干预措施都能产生相当的 EM 逆转效果,但前提是必须恢复那些被 EM 削弱的能力。

For prevention, only SGTR finetuning consistently reduces misalignment without exacerbating any individual metric, suggesting that character fortification specifically drives prevention. 在预防方面,只有 SGTR 微调能够在不恶化任何单一指标的情况下持续降低对齐失效,这表明性格强化是实现预防的关键驱动因素。

We provide further evidence for EM’s relation to the LLM’s default character by showing that EM finetuning induces diversity into the LLM’s identity self-reports, artificially corrupting self-recognition exacerbates misalignment caused by EM finetuning, and that removing the model’s identity-bearing system prompt substantially reduces the effect of EM finetuning. 我们通过以下证据进一步证明了 EM 与大语言模型(LLM)默认性格之间的关系:EM 微调会导致 LLM 的身份自述出现多样性;人为破坏自我识别会加剧由 EM 微调引起的对齐失效;以及移除模型中带有身份信息的系统提示词会显著降低 EM 微调的效果。

Together, these findings reframe EM not as the adoption of a coherent misaligned persona, but as the destabilization of aligned character. 综上所述,这些发现将 EM 重新定义为:它并非模型采纳了一个连贯的对齐失效人格,而是模型已对齐性格的失稳。