Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

超越单一方向的拒绝机制：Diff-in-Means 与 INLP 的初步比较

Abstract: Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) — nullspace projection and counterfactual flipping — on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions.

摘要： Arditi 等人（2024）的研究表明，经过安全微调的聊天模型中的拒绝行为是由残差流中的单一线性方向所介导的，该方向可以通过有害激活与无害激活的均值差（DiM）来恢复。我们在五个开源权重的聊天模型上，将基于 DiM 的干预方法（激活添加和方向消融）与源自迭代零空间投影（INLP）的两种干预方法（零空间投影和反事实翻转）进行了比较，旨在探讨 INLP 在引导拒绝行为方面是否能媲美 DiM，以及其更丰富的参数化是否能提供更具可调性的干预手段。

INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability.

在抑制拒绝行为方面，INLP 的反事实翻转方法与 DiM 的方向消融方法表现相当，而零空间投影的效果则始终较弱。将 INLP 限制在提取出的子空间的主方向上，可以在保持接近基准困惑度（perplexity）的同时保留大部分抑制效果，从而提供了一种可调节的能力。

Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations between the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite — an intriguing distinction that warrants further investigation in future work.

从几何角度来看，这两种 INLP 干预方法落在了激活空间中性质截然不同的区域：零空间投影将转换后的激活压缩在有害簇和无害簇之间，而反事实翻转则将它们移动到相反的簇中。这表明模型对“概念缺失”的编码方式与其“对立概念”的编码方式存在差异——这一有趣的发现值得在未来的工作中进一步研究。