Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

完美的检测，失败的控制：语言模型中“认知”与“引导”的几何差异

Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model’s activations, we should be able to modify it. This rests on a hidden premise — that the direction which detects a behavior and the direction which controls it are the same, or close.

摘要： 机械可解释性（mechanistic interpretability）的一个核心愿景是可控性：如果我们知道某种行为在模型的激活状态中是如何表示的，我们理应能够对其进行修改。这一愿景基于一个隐含的前提——即检测某种行为的方向与控制该行为的方向是相同或相近的。

We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis.

我们从几何角度对此进行了测试：检测某种行为的最佳方向与诱导该行为的最佳方向之间夹角是多少？如果“检测”意味着“控制”，那么它们的余弦值应接近 1；否则，该值则量化了“检测-干预”之间的鸿沟。在 Gemma 2-2B-it 模型上，输出格式（纯 JSON 与 Markdown 格式）将这两个角色合并到了同一个轴上。

Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1.000 from layer 5), yet that direction sits at cos = 0.12 (about 83 degrees) from the direction producing a refusal — a small, reproducible alignment, far from the cos = 1 that “detection is control” would require. A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0.06).

但在处理幻觉问题时并非如此：模型能够以完美的线性可分性检测出虚假实体（从第 5 层开始 AUC = 1.000），然而该方向与产生“拒绝回答”行为的方向之间的余弦值仅为 0.12（约 83 度）——这是一种微小且可复现的对齐，远未达到“检测即控制”所要求的 cos = 1。仅基于激活状态构建的检测器（不包含选定 Token）同样无法实现对齐（cos = -0.06）。

The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0.12, 0.20], identical before and after instruction tuning (0.1197 vs 0.1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it — 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives.

这种鸿沟具有普遍性：在来自三个模型家族、两种规模（1B-9B）的四个模型中，余弦值始终保持在 [0.12, 0.20] 之间，且在指令微调前后几乎完全一致（0.1197 对比 0.1200），这表明其根源在于预训练阶段。向“拒绝”方向旋转 15 度可以部分弥合这一差距——在两个留出的虚假实体类别上，实现了 73% 和 60% 的拒绝率，同时误报率仅为 1.8%。

We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.

随后我们探讨了该余弦值是否能预测可引导性（steerability），结果发现不能：检测是一个高维类别，而非单一方向；而区分“可引导”情况的因素是功能性的，无法通过静态角度来解读。该余弦值是“认知”与“引导”之间解离的权重可计算特征，而非其预测指标。