Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

内化未来:一种用于世界模型规划的统一智能体训练范式

Large language model (LLM) agents have demonstrated strong capability in sequential decision-making, yet they remains fundamentally reactive in long-horizon tasks. Unlike humans who employ “what-if” reasoning to evaluate potential plans before commitment, standard agents lack an internal world model to simulate future outcomes. 大型语言模型(LLM)智能体在序列决策方面展现出了强大的能力,但在处理长程任务时,它们本质上仍然是被动的。与人类在行动前通过“假设性”推理来评估潜在计划不同,标准智能体缺乏一个能够模拟未来结果的内部世界模型。

Therefore, we propose to internalize future-aware planning by training a single autoregressive model to verbalize both a prospective state rollout and a plan-conditioned success estimate—a textual analogue of the Q-value. Crucially, we identify a format-capability gap: simply fine-tuning agents on look-ahead traces during post-training leads to superficial mimicry of foresight without genuine predictive grounding. 因此,我们提出通过训练单一的自回归模型来内化具备未来意识的规划,使其能够以文本形式输出预期的状态演变以及基于计划的成功率估计(即 Q 值的文本模拟)。至关重要的是,我们发现了一个“格式-能力差距”:仅仅在后训练阶段通过前瞻轨迹对智能体进行微调,只会导致对预见性的表面模仿,而缺乏真正的预测基础。

To bridge this gap, we introduce a three-stage training paradigm: (i) World Model Agentic Mid-Training (WM-AMT) to inject latent predictive capabilities into the policy; (ii) Format-Eliciting SFT (FE-SFT) to structure this injected capability; and (iii) Foresight-Conditioned Reinforcement Learning (FC-RL) to refine the calibration and utility of the generated simulations. 为了弥补这一差距,我们引入了一个三阶段训练范式:(i) 世界模型智能体中期训练(WM-AMT),旨在将潜在的预测能力注入策略中;(ii) 格式诱导监督微调(FE-SFT),用于构建这种已注入的能力;以及 (iii) 前瞻条件强化学习(FC-RL),用于优化生成模拟的校准度和实用性。

Evaluated on search and mathematical reasoning tasks, our approach consistently outperforms other training baselines. Our results demonstrate that effective internal world modeling in LLM agents requires a capability-first training pipeline to achieve grounded and calibrated foresight. 在搜索和数学推理任务上的评估表明,我们的方法始终优于其他训练基线。研究结果证明,LLM 智能体要实现有效的内部世界建模,需要一个“能力优先”的训练流程,才能获得扎实且经过校准的预见能力。