Refusal Lives Downstream of Persona in Chat Models

拒绝机制位于聊天模型“人格”的下游

Abstract: Linear directions in activation space have been identified for both refusal and persona traits in instruction-tuned chat models, but the two have been studied as separate mechanisms. We show they interact: a compliant persona gates refusal.

摘要： 在指令微调的聊天模型中，研究人员已在激活空间中识别出代表“拒绝”和“人格”特征的线性方向，但此前这两者通常被视为独立的机制进行研究。我们证明了它们之间存在相互作用：顺从的人格会控制（gate）拒绝行为。

In Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, we extract a compliant model-persona direction and a refusal direction and intervene on both. Compliant persona steering suppresses refusal — in Llama, the refusal rate falls from 97% to 2%.

在 Qwen2.5-7B-Instruct 和 Llama-3.1-8B-Instruct 模型中，我们提取了顺从的模型人格方向和拒绝方向，并对两者进行了干预。引导模型表现出顺从的人格会抑制拒绝行为——在 Llama 模型中，拒绝率从 97% 下降到了 2%。

Reintroducing the refusal direction partially restores refusal at late layers but not at early ones. Projecting out the persona direction in a late-layer window restores it to baseline; projecting out a random direction does not.

在后期层重新引入拒绝方向可以部分恢复拒绝行为，但在早期层则无效。在后期层窗口中剔除人格方向可使模型恢复到基准状态；而剔除随机方向则无此效果。

Refusal is therefore gated at the late-layer expression stage, downstream of where it is computed. Treating refusal as a single isolated direction misses its dependence on persona.

因此，拒绝行为是在后期层的表达阶段被控制的，位于其计算过程的下游。将拒绝视为一个单一的孤立方向，会忽略它对人格的依赖性。