Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

完美的检测,失败的控制:语言模型中“认知”与“引导”的几何差异

Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model’s activations, we should be able to modify it. This rests on a hidden premise — that the direction which detects a behavior and the direction which controls it are the same, or close.

摘要: 机械可解释性(mechanistic interpretability)的一个核心愿景是可控性:如果我们知道某种行为在模型的激活状态中是如何表示的,我们理应能够对其进行修改。这一愿景基于一个隐含的前提——即检测某种行为的方向与控制该行为的方向是相同或相近的。

We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis.

我们从几何角度对此进行了测试:检测某种行为的最佳方向与诱导该行为的最佳方向之间夹角是多少?如果“检测”意味着“控制”,那么它们的余弦值应接近 1;否则,该值则量化了“检测-干预”之间的鸿沟。在 Gemma 2-2B-it 模型上,输出格式(纯 JSON 与 Markdown 格式)将这两个角色合并到了同一个轴上。

Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1.000 from layer 5), yet that direction sits at cos = 0.12 (about 83 degrees) from the direction producing a refusal — a small, reproducible alignment, far from the cos = 1 that “detection is control” would require. A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0.06).

但在处理幻觉问题时并非如此:模型能够以完美的线性可分性检测出虚假实体(从第 5 层开始 AUC = 1.000),然而该方向与产生“拒绝回答”行为的方向之间的余弦值仅为 0.12(约 83 度)——这是一种微小且可复现的对齐,远未达到“检测即控制”所要求的 cos = 1。仅基于激活状态构建的检测器(不包含选定 Token)同样无法实现对齐(cos = -0.06)。

The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0.12, 0.20], identical before and after instruction tuning (0.1197 vs 0.1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it — 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives.

这种鸿沟具有普遍性:在来自三个模型家族、两种规模(1B-9B)的四个模型中,余弦值始终保持在 [0.12, 0.20] 之间,且在指令微调前后几乎完全一致(0.1197 对比 0.1200),这表明其根源在于预训练阶段。向“拒绝”方向旋转 15 度可以部分弥合这一差距——在两个留出的虚假实体类别上,实现了 73% 和 60% 的拒绝率,同时误报率仅为 1.8%。

We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.

随后我们探讨了该余弦值是否能预测可引导性(steerability),结果发现不能:检测是一个高维类别,而非单一方向;而区分“可引导”情况的因素是功能性的,无法通过静态角度来解读。该余弦值是“认知”与“引导”之间解离的权重可计算特征,而非其预测指标。