How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

Title: How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A 标题: 多模态语言模型需要多少视觉 Token?利用 F^3A 进行视觉 Token 剪枝的扩展研究

Abstract: Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? 摘要: 视觉-语言模型通过向语言主干网络输入越来越长的视觉 Token 序列来提升感知能力,但由此产生的推理成本引发了一个基本的扩展性问题:随着多模态模型的增长,实际上需要多少视觉 Token?在固定的视觉 Token 预算下,又该如何分配这些 Token?

Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. 现有的无需训练(training-free)的剪枝方法通常使用一次性代理指标(如解码器注意力、视觉相似度或条件多样性)来回答这一问题。我们认为,视觉 Token 剪枝更应被视为一种“任务条件下的证据搜索”,特别是在高压缩率和跨模型规模的情况下。

We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. 我们提出了 F^3A,这是一种用于视觉 Token 剪枝的无需训练路由机制,它在语言模型处理图像 Token 之前运行。F^3A 构建了轻量级的“问题条件线索”,通过冻结的稀疏感知头将其与视觉网格 Token 进行匹配,并通过粗略的证据定位、局部细化、覆盖保持竞争以及对未覆盖区域的恢复,来分配固定的视觉 Token 预算。

It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline. 该方法无需模型训练,不需要额外的 LLM 前向传播,并且保留了原始的多模态提示和解码流程。