TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

TSA：用于持久化以对象为中心的视频表示的时间槽激活机制

Abstract: Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible.

摘要： 无监督视频对象中心学习旨在将动态场景分解为时间上持久的实体表示。现有的循环视频槽注意力（slot-attention）方法在帧间传播一组固定的槽，但通常假设无条件的槽传播：无论对应的对象是否可见，每个槽在每一帧都会被更新和解码。

We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object’s representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention.

我们指出，这种设计违背了持久化槽的基本生命周期要求：当对象缺失或完全被遮挡时，其槽位应保持先前的状态，并避免去解释不相关的可见内容。相反，无条件传播会导致两条失败路径：一是更新引起的状态漂移，即当前帧的证据覆盖了缺失对象的表示；二是解码器引起的重建干扰，即非活动槽位通过解码器注意力机制仍然与重建过程耦合。

We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $\alpha_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous state via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference.

我们提出了时间槽激活（TSA），这是一种无需可见性监督即可学习每个槽位、每一帧激活分数 $\alpha_{k,t} \in (0, 1)$ 的机制。TSA 将此激活作为槽位生命周期建模的共享潜在控制变量。当槽位处于非活动状态时，TSA 通过激活门控更新将其状态锚定到前一状态，并通过在 Softmax 归一化之前对注意力 Logits 应用依赖于激活的加性偏置，来抑制其对解码器的参与。这共同减少了状态漂移和由重建驱动的干扰。

To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

为了改善在部分遮挡和逐渐重现情况下的决策，TSA 进一步将激活预测条件化，使其依赖于由时间上下文编码器（Temporal Context Encoder）生成的每个槽位的时间记忆。我们在 MOVi-C/E、YT-VIS 和 OVIS 基准测试上，使用标准指标和基于跟踪的指标（FG-ARI、mBO、IDF1、HOTA）对 TSA 进行了评估。结果表明，TSA 持续提升了对象分解和时间身份保持能力，特别是在长视频和严重遮挡视频中表现出显著增益。