CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra：用于电影视频生成的统一实体中心化条件控制

Abstract: Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four.

摘要： 电影视频描绘了多个主体在特定时刻的动作或交互，通过精心的摄像机运动进行捕捉，并由镜头转场拼接而成。这些要素共同要求一种超越当前文本生成视频模型（text-to-video models）的细粒度控制能力。现有的研究通常孤立地处理每一个维度：多主体个性化、时间控制、多镜头合成或摄像机控制；目前尚无框架能将这四者整合在一起。

We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities.

我们提出了 CineOrchestra，这是一个统一的视频扩散模型，能够同时控制主体、事件、摄像机和镜头转场。我们的核心见解是，这些异构的电影元素共享一种基本结构：每一个元素都是在特定时间间隔内作用的“实体”，因此它们都可以通过一套共享的“实体中心化条件基元”（entity-centric conditioning primitives）来表达，并辅以视觉实体的参考图像。

This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region.

这种表述将架构挑战简化为一个单一的位置编码问题，我们通过两种无参数的协同旋转嵌入（coordinated rotary embeddings）解决了这一问题：(a) 一种区间采样的时间 RoPE，它能在持续时间差异巨大的事件中产生一致的注意力行为；(b) 一种 2D 实体-时间交叉注意力 RoPE，它能够消除各实体条件的歧义，并将每个条件引导至其对应的时空区域。

On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

在两个新的基准测试中，CineOrchestra 在密集字幕遵循和镜头转场时序方面优于六种针对单一维度的专业模型，并在成对用户研究和组件消融实验中表现出持续的性能提升。