RepSelect: Robust LLM Unlearning via Representation Selectivity

RepSelect：通过表征选择性实现鲁棒的大语言模型遗忘

Abstract: Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow.

摘要： 如何让大语言模型（LLM）在不牺牲通用能力的前提下深度遗忘特定的知识和价值观，仍然是模型遗忘领域的核心挑战。然而，现有的方法很容易通过微调或少样本提示（few-shot prompting）被逆转，这表明它们的遗忘仅仅停留在浅层。

We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse.

我们找出了问题的根本原因。现有方法所针对的表征，往往同时存在于保留集（retain set）和微调攻击者所恢复的子空间中，这使得遗忘过程既破坏了模型的通用能力，又容易被逆转。

We propose RepSelect (Representation Selectivity), which isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover.

我们提出了 RepSelect（表征选择性）方法。该方法通过在每次更新前压缩权重梯度的前几个主成分，从而分离出针对“遗忘集”的特定表征。这种方式在保持模型通用能力完整的同时，限制了微调所能恢复的内容。

We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite).

我们在两个遗忘类别（生物危害知识和滥用倾向）以及四个涵盖稠密架构和混合专家（MoE）架构的模型系列（Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite）上进行了评估。

Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

与五种主流基准方法（GradDiff、NPO、SimNPO、RMU、UNDIAL）相比，RepSelect 在重学习后的回答准确率降低幅度上比最强的基准方法高出 4 到 50 倍，并且对少样本提示攻击表现出近乎完美的鲁棒性。因此，针对性地选择表征是实现深度且鲁棒的 LLM 遗忘的重要一步。