Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution
Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution
用于特征发现与长上下文归因的“轮次平均稀疏自编码器”(Turn-Averaged SAEs)
Sparse autoencoders (SAEs) have become a useful tool for extracting interpretable features in language models. However, standard SAE architectures operate on individual token activations, meaning that the number of active features scales linearly with context length, and studying long model transcripts becomes difficult.
稀疏自编码器(SAEs)已成为从语言模型中提取可解释特征的有效工具。然而,标准的 SAE 架构是基于单个 Token 的激活值进行操作的,这意味着活跃特征的数量会随上下文长度线性增长,从而导致研究长篇模型对话记录变得十分困难。
We introduce turn-averaged SAEs, which represent a single Human or Assistant turn with a fixed number of features by learning to reconstruct the average model activation across the turn.
我们引入了“轮次平均稀疏自编码器”(Turn-Averaged SAEs)。该方法通过学习重构整个对话轮次中的平均模型激活值,用固定数量的特征来表征单次人类或助手对话轮次。
We find that turn-averaged features describe a single turn’s high-level characteristics more completely than per-token features when judged by an LLM. We also demonstrate that turn-averaged SAEs greatly simplify common downstream uses of SAEs like attribution graphs. Broadly, turn-averaged SAEs make interpretability techniques practical at long context lengths.
研究发现,经由大语言模型(LLM)评估,轮次平均特征比逐 Token 特征能更完整地描述单轮对话的高层特征。我们还证明,轮次平均 SAE 大大简化了 SAE 在归因图等常见下游任务中的应用。总而言之,轮次平均 SAE 使得可解释性技术在处理长上下文时变得切实可行。