StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
StreamPro:从被动感知到流媒体视频的主动决策
Abstract: Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a “see-then-answer” paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model’s ability to make timely and reliable decisions under incomplete observations.
摘要: 主动式流媒体视频理解要求模型能够持续处理视频流,并决定“何时”做出响应,而不仅仅是决定“响应什么”。这自然引入了一个在部分观测条件下的决策问题,模型必须在早期预测与获取充分证据之间取得平衡。然而,现有的基准测试大多遵循“先看后答”的范式,即只有在明确的证据出现后才会触发响应,这实际上将主动推理降级为了延迟感知。因此,这些基准测试无法评估模型在观测信息不完整的情况下做出及时且可靠决策的能力。
Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model’s ability to make early yet reliable decisions under partial observations.
此外,由于流媒体轨迹中静默信号与响应信号之间存在极度不平衡,且需要同时优化响应的准确性和时机,训练主动式模型本身就极具挑战性。为了解决这些难题,我们引入了 StreamPro-Bench,这是一个全新的基准测试,从三个互补的维度评估流媒体模型:感知理解、时间推理和主动代理(Proactive Agency)。其中,主动代理维度专门衡量模型在部分观测条件下做出早期且可靠决策的能力。
We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.
我们进一步提出了 StreamPro,这是一个用于主动学习的两阶段训练框架。首先,我们引入了 CB-Stream Loss,以缓解监督微调(SFT)过程中严重的监督信号不平衡问题。随后,我们应用了组相对策略优化(GRPO),并结合了包含轮次级(turn-level)和轨迹级(trajectory-level)奖励的多粒度奖励设计。实验表明,StreamPro 显著提升了主动性能。在 StreamPro-Bench 上,该模型达到了 41.5 分,大幅超越了此前的最佳成绩(10.4 分),同时在实时流媒体基准测试中也保持了强劲表现,在 StreamingBench-RTVU 上达到了 78.9 分。