Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol

Representation as a Bottleneck for Mechanistic Interpretability: The Manifestation Unit Protocol

表现形式作为机械可解释性的瓶颈:表现单元协议

Mechanistic interpretability has produced a rich inventory of component-level analyses that characterise what neural-network components encode and how they interact. Their outputs, however, are not easily reusable: selectivity tables, circuit diagrams, and feature lists remain locked in per-study notebooks - non-composable, not queryable in natural language, and not directly actionable for downstream audit or intervention.

机械可解释性领域已经产生了丰富的组件级分析清单,用以描述神经网络组件的编码内容及其交互方式。然而,这些分析的输出结果并不易于复用:选择性表、电路图和特征列表往往被局限在各自的研究笔记中——它们无法组合、无法通过自然语言查询,也无法直接用于下游的审计或干预。

We study the representation layer that sits between these analyses and downstream use as a bottleneck that can be evaluated independently, and introduce Manifestation Units, a typed tuple protocol (E, S, R, D, G) extended with attention-head primitives (T) for transformer architectures, organising per-component statistics into structured fields populated automatically and queried through hybrid retrieval.

我们研究了位于这些分析与下游应用之间的“表现层”,将其视为一个可以独立评估的瓶颈,并引入了“表现单元”(Manifestation Units)。这是一种类型化元组协议 (E, S, R, D, G),并针对 Transformer 架构扩展了注意力头原语 (T),将各组件的统计数据组织成结构化字段,实现自动填充并通过混合检索进行查询。

Instantiated across generative vision (beta-VAE), discriminative vision (CNN), and language (GPT-2), the protocol supports two findings: typed structure substantially outperforms unstructured baselines on retrieval, and CNN filters retrieved by the schema satisfy causal sufficiency and necessity criteria under matched-budget controls.

通过在生成式视觉 (beta-VAE)、判别式视觉 (CNN) 和语言模型 (GPT-2) 上的实例化验证,该协议支持两项发现:类型化结构在检索性能上显著优于非结构化基准;且在匹配预算的控制下,通过该模式检索到的 CNN 滤波器满足因果充分性和必要性标准。

The schema absorbs attention-head primitives without modification, set-recovers known IOI circuit members under retrieval-budget-matched controls, and reveals an irreducible two-field core (S+R) with remaining fields either redundant or actively interfering. We present this as schema infrastructure for mechanistic interpretability rather than frontier-scale validation.

该模式无需修改即可吸收注意力头原语,在检索预算匹配的控制下能够恢复已知的 IOI(间接客体识别)电路成员,并揭示了一个不可约的双字段核心 (S+R),其余字段要么是冗余的,要么会产生主动干扰。我们提出该方案旨在为机械可解释性提供模式基础设施,而非前沿规模的验证。