OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem：面向流式音视频大语言模型的扰动感知内存压缩

Abstract: Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. 摘要： 音视频大语言模型（LLMs）在长视频理解方面展现出巨大潜力，但其长视频推理过程受到视频 Token 和键值（KV）缓存线性增长的根本性限制。

We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. 我们提出了 OmniMem，这是一个专为音视频大语言模型设计的内存高效流式框架。与现有将所有 Token 同等对待的压缩方法不同，OmniMem 引入了一种模态感知的内存分配策略，分别管理视觉和音频上下文，从而解决了两种模态之间严重的 Token 不平衡问题。

OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. OmniMem 通过扰动感知的内存选择机制，进一步保留了信息丰富且非冗余的 KV 状态，在不牺牲长距离理解能力的前提下实现了内存压缩。为了在实际部署约束下增强压缩效果，我们还探索了预算感知的微调方法，鼓励模型将有用信息整合到保留的内存中。

Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning. 在 VideoMME Long、LVBench 和 LVOmniBench 上，使用 video-SALMONN 2+ 和 Qwen-2.5-Omni 进行的实验表明，在相同的内存预算下，OmniMem 相比强大的免训练压缩基线，准确率稳定提升了 2-4%，微调后可额外获得 1-2% 的性能增益。