Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
Abstract: Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling.
摘要: 视频多模态大模型(Video MLLMs)在空间保真度和时间覆盖范围之间面临着持续的矛盾:保留细粒度的视觉细节需要大量的空间 Token,而捕捉短时事件则需要密集的时间采样。
We propose Fre-Res, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens.
我们提出了 Fre-Res,这是一个预算自适应的双轨视频 Token 压缩框架,将这两种证据形式分离开来。Fre-Res 保留了稀疏的高保真空间锚点,并通过紧凑的残差频率 Token 来表示密集的时间演变。
Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens.
具体而言,该框架在视觉潜空间(vision-latent space)中对帧间残差轨迹应用了一维离散余弦变换(1D-DCT),我们观察到其中存在强烈的低频集中现象。为了将频域动态与原生视觉嵌入对齐,Fre-Res 引入了一个空间引导吸收器(Spatial-Guided Absorber),将时间残差信息注入到空间对应的锚点 Token 中。
Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy—efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length.
在细粒度短视频和长视频推理基准测试中,Fre-Res 实现了良好的准确率与效率平衡,在大幅减少视觉 Token 长度的同时,匹配或接近了全量 Token 的性能表现。
Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.
广泛的消融实验进一步表明,时间频率残差保留了因果转换线索,而空间锚点对于细粒度的物体和布局推理仍然至关重要。