RepSelect: Robust LLM Unlearning via Representation Selectivity
RepSelect: Robust LLM Unlearning via Representation Selectivity
RepSelect:通过表征选择性实现鲁棒的大语言模型遗忘
Abstract: Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow.
摘要: 如何让大语言模型(LLM)在不牺牲通用能力的前提下深度遗忘特定的知识和价值观,仍然是模型遗忘领域的核心挑战。然而,现有的方法很容易通过微调或少样本提示(few-shot prompting)被逆转,这表明它们的遗忘仅仅停留在浅层。
We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse.
我们找出了问题的根本原因。现有方法所针对的表征,往往同时存在于保留集(retain set)和微调攻击者所恢复的子空间中,这使得遗忘过程既破坏了模型的通用能力,又容易被逆转。
We propose RepSelect (Representation Selectivity), which isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover.
我们提出了 RepSelect(表征选择性)方法。该方法通过在每次更新前压缩权重梯度的前几个主成分,从而分离出针对“遗忘集”的特定表征。这种方式在保持模型通用能力完整的同时,限制了微调所能恢复的内容。
We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite).
我们在两个遗忘类别(生物危害知识和滥用倾向)以及四个涵盖稠密架构和混合专家(MoE)架构的模型系列(Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite)上进行了评估。
Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.
与五种主流基准方法(GradDiff、NPO、SimNPO、RMU、UNDIAL)相比,RepSelect 在重学习后的回答准确率降低幅度上比最强的基准方法高出 4 到 50 倍,并且对少样本提示攻击表现出近乎完美的鲁棒性。因此,针对性地选择表征是实现深度且鲁棒的 LLM 遗忘的重要一步。