Can LLMs Introspect? A Reality Check

大语言模型能进行内省吗？一项现实检验

Abstract: Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues.

摘要： 大语言模型能够检测并报告其自身的内部状态吗？许多研究认为答案是肯定的。基于人类元认知研究的经验，我们认为这一结论可能为时尚早：要确信这一结论，我们需要将真正的“内省”与基于表面线索的模式匹配区分开来。

Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular.

此外，我们认为仅凭行为证据不足以支撑强有力的内省主张。基于这一考量，我们重新审视了近期提出的两种评估范式。在第一种范式中，模型被要求检测其内部状态是否遭到篡改。我们发现，模型无法可靠地将对其内部状态的干预与对输入的操纵区分开来，这表明它们在原始研究中的成功反映了其检测异常的通用能力，而非专门针对内部状态干预的检测能力。

In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model’s own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task.

在第二种范式中，模型被要求预测源自其自身隐藏状态的标签。我们发现，仅能获取输入数据的分类器所达到的性能与模型自身的上下文预测相当，这表明原始结果并不能确凿地证明模型对其内部表征拥有特权访问权限。我们进一步引入了一个重新标记的对照设置，在该设置中，模型无法依赖任务语义来求解，而必须依赖内部表征；在这一控制更严谨的任务版本中，模型的表现接近于随机猜测。

Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

综上所述，这些结果表明，目前的证据不足以证明大语言模型表现出了元认知监控能力。