Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

Transformer 语言模型中情境建模与心智化能力的演进轨迹

Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measured by the false belief task (FBT), yet persistent concerns of construct validity remain. 近期研究表明，大型语言模型（LLMs）能够感知文本所描述主体的信念状态，这可以通过错误信念任务（FBT）进行衡量，但关于其构念效度的担忧依然存在。

We adopt a developmental perspective, tracing the pattern of mental state reasoning behavior — and likely preconditions for this behavior — across multiple training stages in the Olmo2 and Pythia language model suites. 我们采用了发展视角，追踪了 Olmo2 和 Pythia 语言模型系列在多个训练阶段中，心智状态推理行为的模式及其可能的先决条件。

We find that above-chance FBT performance depends both on model size and sufficient training volume, emerges relatively late in pretraining, and is most improved by post-training interventions (SFT, DPO) in the condition most diagnostic of mentalizing (False Belief, Implicit). 研究发现，高于随机水平的 FBT 表现既取决于模型规模，也取决于充足的训练量；该能力在预训练后期才相对显现，且在最能诊断心智化能力的条件下（隐性错误信念），通过训练后干预（SFT、DPO）提升最为显著。

However, FBT performance is fragile: consistent with past work, the use of non-factive verbs (e.g., thinks) increases false belief attributions even in the True Belief condition. 然而，FBT 的表现非常脆弱：与以往研究一致，使用非事实动词（如“认为”）即使在真实信念条件下，也会增加对错误信念的归因。

To contextualize these findings, we track the emergence of situation modeling: the ability to report on basic factual properties of a described scene. 为了将这些发现置于语境中，我们追踪了情境建模的出现过程：即描述所呈现场景基本事实属性的能力。

Situation modeling accuracy generally precedes and exceeds FBT accuracy, yet situational representations also prove surprisingly incoherent in certain respects: when asked about the knowledge states of the Antagonist agent — who always knows the item’s true location — Olmo2 13b is consistently influenced both by the Target agent’s knowledge state and the presence of non-factive verbs. 情境建模的准确性通常先于并优于 FBT 的准确性，但情境表征在某些方面也表现出令人惊讶的不连贯性：当被问及“对手”主体（始终知道物品真实位置）的知识状态时，Olmo2 13b 模型始终会受到“目标”主体知识状态以及非事实动词存在的影响。

Together, these results suggest that larger, sufficiently trained models build partially coherent situation models in a developmentally appropriate sequence, yet display surprising fragility — highlighting the value of developmental and stress-testing approaches for evaluating LLM capabilities. 总之，这些结果表明，规模更大、训练充分的模型会按照符合发展规律的顺序构建部分连贯的情境模型，但同时也表现出惊人的脆弱性——这凸显了采用发展视角和压力测试方法来评估 LLM 能力的价值。