MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

MechELK：一种用于提取大语言模型潜在知识的机械可解释性框架

Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs — a phenomenon known as latent knowledge. 大语言模型（LLMs）经常在其内部表征中编码事实和推理知识，但这些知识并未如实反映在其表层输出中——这种现象被称为“潜在知识”（latent knowledge）。

Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to understand model behavior rather than to extract hidden knowledge. 现有的潜在知识提取方法（如对比一致性搜索，CCS）依赖于对比激活模式，在处理复杂的多步推理任务时表现不佳；而机械可解释性工具此前主要用于“理解”模型行为，而非“提取”隐藏知识。

We present MechELK, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. 我们提出了 MechELK，这是一个统一的三阶段框架，旨在架起机械可解释性与潜在知识提取之间的桥梁。

MechELK operates through: (1) Locate — using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) Verify — employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) Elicit — applying representation engineering to surface hidden knowledge without modifying model weights. MechELK 的工作流程包括：(1) 定位 (Locate) —— 使用稀疏自编码器 (SAE) 特征分析和激活修补技术来识别承载知识的表征；(2) 验证 (Verify) —— 采用因果探测来区分真实的潜在知识与虚假相关性；以及 (3) 提取 (Elicit) —— 应用表征工程在不修改模型权重的情况下显现隐藏知识。

Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7%, outperforming CCS by 6.2% and direct linear probing by 9.1%. 在 TruthfulQA（一个精选的欺骗性对齐基准测试）和 Quirky LM 数据集上的评估显示，MechELK 的平均提取准确率达到了 84.7%，比 CCS 高出 6.2%，比直接线性探测高出 9.1%。

Crucially, MechELK successfully identifies latent knowledge in 78.3% of cases where the model’s surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection. 至关重要的是，在模型表层输出错误或回避的情况下，MechELK 成功识别出了 78.3% 的潜在知识，这证明了其在包括欺骗性对齐检测在内的 AI 安全应用中的实用价值。