Architecture Determines Observability in Transformers

架构决定了 Transformer 的可观测性

Abstract: Autoregressive transformers make confident errors, but activation monitoring can catch them only if the model preserves an internal signal that output confidence does not expose. This preservation is determined by architecture and training recipe. We define observability as the linear readability of per-token decision quality from frozen mid-layer activations after controlling for max-softmax confidence and activation norm. The correction is essential. Confidence controls absorb 57.7% of raw probe signal on average across 13 models in 6 families.

摘要： 自回归 Transformer 模型往往会产生“自信的错误”（confident errors），但只有当模型保留了输出置信度所未暴露的内部信号时，激活监测才能捕捉到这些错误。这种信号的保留取决于架构和训练方案。我们将“可观测性”定义为：在控制了最大 Softmax 置信度和激活范数后，从冻结的中间层激活中线性读取每个 Token 决策质量的能力。这种校正至关重要。在 6 个模型家族的 13 个模型中，置信度控制平均吸收了 57.7% 的原始探测信号。

Observability is not a generic property of transformers. In Pythia’s controlled suite, every tested run with the 24-layer, 16-head configuration collapses to rho_partial ~0.10 across a 3.5x parameter gap and two Pile variants, while six other configurations occupy a separated healthy band from 0.21 to 0.38. The output-controlled residual collapses at the same points, and neither tested nonlinear probes nor layer sweeps recover healthy-range signal.

可观测性并非 Transformer 的通用属性。在 Pythia 的受控测试套件中，所有采用 24 层、16 头配置的测试运行，在 3.5 倍参数差距和两种 Pile 数据集变体下，其偏相关系数（rho_partial）均坍缩至约 0.10，而其他六种配置则处于 0.21 到 0.38 的独立健康区间。输出控制后的残差在相同点位发生坍缩，且无论是测试的非线性探测器还是层扫描，都无法恢复到健康范围的信号。

Checkpoint dynamics show the collapse is emergent during training. Both configurations at matched hidden dimension form the signal at the earliest measured checkpoint, but training erases it in the (24L, 16H) class while predictive loss continues improving. Across independent recipes the collapse map changes but the phenomenon persists. Qwen 2.5 and Llama differ by 2.9x at matched 3B scale with probe seed distributions that do not overlap, while Mistral 7B preserves observability where Llama 3.1 8B collapses despite similar broad architecture.

检查点动态显示，这种坍缩是在训练过程中涌现的。在匹配的隐藏维度下，两种配置在最早测量的检查点都形成了信号，但 (24L, 16H) 类配置在训练过程中抹除了该信号，尽管其预测损失仍在持续改善。在不同的训练方案中，坍缩图谱虽有变化，但该现象依然存在。在 3B 参数规模下，Qwen 2.5 和 Llama 的表现差异达 2.9 倍，且探测器种子分布互不重叠；而尽管架构大体相似，Mistral 7B 保留了可观测性，但 Llama 3.1 8B 却出现了坍缩。

A WikiText-trained observer transfers to downstream QA without training on those tasks, catching errors confidence misses. At 20% flag rate, its exclusive catch rate is 10.9-13.4% of all errors in seven of nine model-task cells. Architecture selection is a monitoring decision.

一个在 WikiText 上训练的观察器（observer）无需在下游任务上进行训练即可迁移，并能捕捉到置信度所遗漏的错误。在 20% 的标记率下，在九个模型-任务单元中的七个里，其对错误的独家捕捉率达到了 10.9% 到 13.4%。架构选择本身就是一项关于监测能力的决策。