Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

语言模型智能体能否成为机械可解释性中有效的电路解释者？

Abstract: Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize. In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified.

摘要： 机械可解释性在自动定位电路方面已经取得了实质性进展，但解释这些被定位的组件具体执行什么功能仍然是一项耗时且难以标准化的工作。在这项研究中，我们探讨了在识别出电路后，语言模型（LM）智能体是否能够协助解决这一解释难题。

We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer circuits with 163 component-level annotations. We propose HyVE (Hypothesize, Validate, Explain), an agentic explainer that analyzes each component through an iterative loop of observation, hypothesis generation, and causal validation, eventually producing a component-level explanation and a circuit-level task description.

我们引入了 AgenticInterpBench，这是一个用于电路解释的基准测试，由 84 个半合成 Transformer 电路和 163 个组件级标注构建而成。我们提出了 HyVE（假设、验证、解释），这是一种智能体解释器，它通过观察、假设生成和因果验证的迭代循环来分析每个组件，最终生成组件级的解释和电路级的任务描述。

Across four LM backbones, HyVE recovers useful component- and task-level explanations, but no backbone is uniformly best. Our analysis shows that strong backbones usually form observation-grounded hypotheses, while failures more often arise later in the validation loop, through incomplete validation plans, code execution errors, or unresolved hypotheses.

在四个语言模型主干网络中，HyVE 均能恢复出有用的组件级和任务级解释，但没有哪一个主干网络在所有方面都表现最优。我们的分析表明，强大的主干网络通常能形成基于观察的假设，而失败情况则更多出现在验证循环的后期，表现为验证计划不完整、代码执行错误或假设无法解决。

A case study on an arithmetic circuit in Llama-3-8B shows that the same formulation can extend beyond semi-synthetic benchmarks to naturally trained models. Overall, LM agents are promising circuit explainers, but reliable validation remains the key obstacle.

针对 Llama-3-8B 中算术电路的案例研究表明，同样的方案可以从半合成基准测试扩展到自然训练的模型中。总的来说，语言模型智能体作为电路解释者具有广阔的前景，但可靠的验证仍然是目前面临的关键障碍。