Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing
Title: Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing 标题:快慢分割:基于双路径处理的实时开放词汇视频实例分割
Abstract: Object-centric models inspired by DETR have become the dominant paradigm for open-vocabulary video instance segmentation (OV-VIS). While recent efforts have reduced the computational cost of pixel decoding, textual modality fusion, and object decoding to make these architectures more suitable for mobile devices, real-time on-device inference at high frame rates remains an open challenge. 摘要: 受 DETR 启发的以对象为中心的模型已成为开放词汇视频实例分割(OV-VIS)的主流范式。尽管近期的研究通过降低像素解码、文本模态融合和对象解码的计算成本,使这些架构更适用于移动设备,但在高帧率下实现实时的端侧推理仍然是一个尚未解决的挑战。
In this paper, we introduce SegFS, a dual-stream fast-slow framework that significantly improves efficiency without sacrificing accuracy. On sparse keyframes, an open-vocabulary object-based model predicts instance-level representations. These representations are then projected back into the backbone feature space to condition a lightweight fast network, which efficiently relocalizes and segments the instances in subsequent frames. 在本文中,我们引入了 SegFS,这是一个双流“快-慢”框架,它在不牺牲准确性的前提下显著提高了效率。在稀疏的关键帧上,基于开放词汇的对象模型会预测实例级表示。随后,这些表示被投影回骨干网络特征空间,以调节一个轻量级的快速网络,从而在后续帧中高效地重新定位并分割实例。
By shifting instance propagation from object decoding to feature-space conditioning, our approach decouples multimodal semantic understanding from dense mask prediction and enables efficient temporal propagation. The proposed fast branch achieves up to 14x lower latency than the mobile-oriented MOBIUS model, while maintaining competitive segmentation performance on standard OV-VIS benchmarks. 通过将实例传播从对象解码转移到特征空间调节,我们的方法将多模态语义理解与密集掩码预测解耦,并实现了高效的时间传播。所提出的快速分支比面向移动端的 MOBIUS 模型降低了高达 14 倍的延迟,同时在标准的 OV-VIS 基准测试中保持了极具竞争力的分割性能。