Temporal Backtracking Search for Test-time Generative Video Reasoning

Temporal Backtracking Search for Test-time Generative Video Reasoning

用于测试时生成式视频推理的时间回溯搜索

Abstract: While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process.

摘要: 尽管测试时扩展(test-time scaling)彻底改变了大语言模型的推理能力,但生成式视频推理仍受限于“单次生成”(single-shot)范式。我们证明,在去噪步骤中进行搜索无法挽救逻辑错误的生成结果,因为空间轨迹在扩散过程的早期阶段就已经确定了。

Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis.

根级“最佳 N 采样”(Best-of-N, BoN)同样效率低下:推理错误往往集中在时间轴的早期,而重采样则盲目地丢弃了已验证的上游进度。为了释放视频模型有效的测试时扩展能力,我们引入了时间回溯搜索(Temporal Backtracking Search, TBS),将搜索空间转移到了时间轴上。

TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling.

TBS 通过三个核心机制将视频生成转化为一个“生成-验证-重启”的迭代循环:(1) 可变 K 条件化(variable-K conditioning),用于从任意干净的前缀恢复生成;(2) 时间过程验证,用于定位故障并提取有效的重启锚点;以及 (3) 基于前缀的搜索,将计算资源重新分配给扩展正确的轨迹,而非进行根级重采样。

Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

在算法、导航和机器人领域,TBS 在帕累托最优性上超越了同等预算下的 BoN。在单次生成几乎失效(BoN 成功率仅为 0.7%)的严格分布外(OOD)设置中,TBS 达到了 22.7% 的成功率,且每个成功解决的片段均源自重启的分支。最终,TBS 表明视频模型的局部推理能力远超单次生成所展现的水平,并提供了一个可扩展的测试时框架来释放这一潜力。