MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

MechELK:一种用于提取大语言模型潜在知识的机械可解释性框架

Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs — a phenomenon known as latent knowledge. 大语言模型(LLMs)经常在其内部表征中编码事实和推理知识,但这些知识并未如实反映在其表层输出中——这种现象被称为“潜在知识”(latent knowledge)。

Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to understand model behavior rather than to extract hidden knowledge. 现有的潜在知识提取方法(如对比一致性搜索,CCS)依赖于对比激活模式,在处理复杂的多步推理任务时表现不佳;而机械可解释性工具此前主要用于“理解”模型行为,而非“提取”隐藏知识。

We present MechELK, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. 我们提出了 MechELK,这是一个统一的三阶段框架,旨在架起机械可解释性与潜在知识提取之间的桥梁。

MechELK operates through: (1) Locate — using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) Verify — employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) Elicit — applying representation engineering to surface hidden knowledge without modifying model weights. MechELK 的工作流程包括:(1) 定位 (Locate) —— 使用稀疏自编码器 (SAE) 特征分析和激活修补技术来识别承载知识的表征;(2) 验证 (Verify) —— 采用因果探测来区分真实的潜在知识与虚假相关性;以及 (3) 提取 (Elicit) —— 应用表征工程在不修改模型权重的情况下显现隐藏知识。

Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7%, outperforming CCS by 6.2% and direct linear probing by 9.1%. 在 TruthfulQA(一个精选的欺骗性对齐基准测试)和 Quirky LM 数据集上的评估显示,MechELK 的平均提取准确率达到了 84.7%,比 CCS 高出 6.2%,比直接线性探测高出 9.1%。

Crucially, MechELK successfully identifies latent knowledge in 78.3% of cases where the model’s surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection. 至关重要的是,在模型表层输出错误或回避的情况下,MechELK 成功识别出了 78.3% 的潜在知识,这证明了其在包括欺骗性对齐检测在内的 AI 安全应用中的实用价值。