Temporal Backtracking Search for Test-time Generative Video Reasoning

用于测试时生成式视频推理的时间回溯搜索

Abstract: While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process.

摘要： 尽管测试时扩展（test-time scaling）彻底改变了大语言模型的推理能力，但生成式视频推理仍受限于“单次生成”（single-shot）范式。我们证明，在去噪步骤中进行搜索无法挽救逻辑错误的生成结果，因为空间轨迹在扩散过程的早期阶段就已经确定了。

Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis.

根级“最佳 N 采样”（Best-of-N, BoN）同样效率低下：推理错误往往集中在时间轴的早期，而重采样则盲目地丢弃了已验证的上游进度。为了释放视频模型有效的测试时扩展能力，我们引入了时间回溯搜索（Temporal Backtracking Search, TBS），将搜索空间转移到了时间轴上。

TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling.

TBS 通过三个核心机制将视频生成转化为一个“生成-验证-重启”的迭代循环：(1) 可变 K 条件化（variable-K conditioning），用于从任意干净的前缀恢复生成；(2) 时间过程验证，用于定位故障并提取有效的重启锚点；以及 (3) 基于前缀的搜索，将计算资源重新分配给扩展正确的轨迹，而非进行根级重采样。

Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

在算法、导航和机器人领域，TBS 在帕累托最优性上超越了同等预算下的 BoN。在单次生成几乎失效（BoN 成功率仅为 0.7%）的严格分布外（OOD）设置中，TBS 达到了 22.7% 的成功率，且每个成功解决的片段均源自重启的分支。最终，TBS 表明视频模型的局部推理能力远超单次生成所展现的水平，并提供了一个可扩展的测试时框架来释放这一潜力。