Refusal Lives Downstream of Persona in Chat Models
Refusal Lives Downstream of Persona in Chat Models
拒绝机制位于聊天模型“人格”的下游
Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates refusal.
摘要: 在指令微调的聊天模型中,研究人员已在激活空间中识别出代表“拒绝”和“人格”特征的线性方向,但此前这两者通常被视为独立的机制进行研究。我们证明了它们之间存在相互作用:顺从的人格会控制(gate)拒绝行为。
In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both. Compliant persona steering suppresses refusal — in Llama, the refusal rate falls from 97% to 2%.
在 Qwen2.5-7B-Instruct 和 Llama-3.1-8B-Instruct 模型中,我们提取了顺从的模型人格方向和拒绝方向,并对两者进行了干预。引导模型表现出顺从的人格会抑制拒绝行为——在 Llama 模型中,拒绝率从 97% 下降到了 2%。
Reintroducing the refusal direction partially restores refusal at late layers but not at early ones. Projecting out the persona direction in a late-layer window restores it to baseline; projecting out a random direction does not.
在后期层重新引入拒绝方向可以部分恢复拒绝行为,但在早期层则无效。在后期层窗口中剔除人格方向可使模型恢复到基准状态;而剔除随机方向则无此效果。
Refusal is therefore gated at the late-layer expression stage, downstream of where it is computed. Treating refusal as a single isolated direction misses its dependence on persona.
因此,拒绝行为是在后期层的表达阶段被控制的,位于其计算过程的下游。将拒绝视为一个单一的孤立方向,会忽略它对人格的依赖性。