GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

GOPAgen：基于结构化记忆与分层推理的运动感知高效智能体长视频理解

Abstract: Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec.

摘要： 尽管智能体长视频理解领域已取得显著进展，但现有方法仍缺乏对精细运动的理解，且缺乏高效的记忆架构。在本文中，我们提出了 GOPAgen，这是一种创新的方法，它首次通过精心设计的运动智能体，将视频编解码器集成到视频理解框架中，该智能体基于视频编解码器中的图像组（GOPs）进行训练。

We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model’s ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory.

我们进一步开发了一种 GOP 树推理算法，该算法与视频编解码器自然对齐，增强了模型理解视频中局部细节运动的能力。此外，我们精心设计了一种结构化记忆机制，将局部运动信息与结构化页面中的详细字幕相结合，并提出了一种高效的从粗到细（coarse-to-fine）的缩放算法，以充分利用这种结构化记忆。

Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

此外，我们将运动矢量数据库纳入框架，实现了不同粒度下运动矢量的高效检索。总体而言，我们的方法在包括 MotionBench 和 Egoschema 在内的多个视频理解基准测试中，实现了卓越的视频问答（VQA）性能，从而证明了我们所提框架的优越性。