SafeGene: Reusable Adapters for Transferable Safety Alignment
SafeGene: Reusable Adapters for Transferable Safety Alignment
SafeGene:用于可迁移安全对齐的可重用适配器
Abstract: Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make models more vulnerable to malicious prompts, even when the training data is not intentionally harmful. This creates a recurring safety recovery problem as target models are repeatedly updated with new task data or user interactions.
摘要: 开源权重的大语言模型(LLM)正越来越多地被微调为定制化助手,但下游微调可能会削弱模型的安全对齐能力,使其更容易受到恶意提示词的攻击,即使训练数据本身并非有意包含有害信息。随着目标模型不断通过新任务数据或用户交互进行更新,这导致了一个反复出现的安全恢复问题。
We propose SafeGene, a reusable safety-adapter module designed for cross-task reuse within each architecture-compatible model family. Rather than treating safety recovery as a model-specific repair step, SafeGene treats safety capability as an independent, reusable adapter representation decoupled from task-specific updates.
我们提出了 SafeGene,这是一种可重用的安全适配器模块,旨在同一架构兼容的模型系列内实现跨任务复用。SafeGene 不再将安全恢复视为针对特定模型的修复步骤,而是将安全能力视为一种独立、可重用的适配器表示,并将其与特定任务的更新解耦。
This representation is obtained from aligned—degraded model discrepancies, refined into task-transferable safety vectors through data-aware layer selection, and expressed in each downstream task-adapted model via few-shot layer-wise coefficient recalibration.
这种表示方法通过对比“已对齐”与“退化”模型之间的差异获得,通过数据感知层选择将其精炼为可跨任务迁移的安全向量,并通过少样本(few-shot)的逐层系数重校准,在每个适配了下游任务的模型中进行表达。
Experiments across multiple model families, downstream tasks, and safety judges show that SafeGene-enhanced models reduce harmful response rates while maintaining downstream performance, outperforming representative safe adaptation methods in safety—utility trade-off.
在多个模型系列、下游任务和安全评估器上的实验表明,经 SafeGene 增强的模型在降低有害响应率的同时,能够保持下游任务的性能,在安全与效用的权衡(safety-utility trade-off)方面优于现有的代表性安全适配方法。