What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs
What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs
它们在想什么?大语言模型中概念的界定、探测与追踪
Abstract: As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM - which is what we might say a model is “thinking” about.
摘要: 随着大语言模型(LLM)影响力的不断扩大,深入洞察其决策过程变得至关重要。实现这一目标的方法之一是开发探测器(probes),用于检测大模型在计算嵌入(embeddings)过程中是否存在特定的广泛概念——这正是我们所说的模型正在“思考”的内容。
Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation. In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a concept through the creation of a dataset with the concept both present and then absent.
这类探测器应当具备低成本且易于应用于任何大模型的特性,从而使得在模型正常运行期间监控多种概念成为可能。在本文中,我们通过定义并执行关键任务示例,迈出了构建此类探测能力的第一步:首先,通过创建一个包含概念存在与缺失的数据集,对概念进行仔细的界定。
Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed. Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to easily monitor new models.
其次,训练并测试一组线性探测器,以检测大模型任意层中的概念,并探讨了所需探测器的复杂度。最后,我们展示了这些探测器能够在更广泛的上下文中追踪概念。该实验涵盖了四个独立的概念和三个不同的大模型。当这一流程扩展到更多概念时,将能够实现对新模型的轻松监控。
Paper Details:
- Authors: Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali, Jonathan Rose
- Submission Date: 7 Apr 2026
- Subject: Computation and Language (cs.CL)
- arXiv ID: 2605.28823
论文详情:
- 作者: Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali, Jonathan Rose
- 提交日期: 2026年4月7日
- 学科: 计算与语言 (cs.CL)
- arXiv ID: 2605.28823