From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策：多模态大语言模型中听觉与视觉感知的信息流

Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood.

摘要： 多模态大语言模型（MLLMs）能够“听”和“看”，但音频和视觉信号究竟是如何在网络中传递并最终形成答案的呢？尽管它们在研究和实际应用中的作用日益凸显，但音频和视觉 Token 影响最终预测的内部路径仍未被充分理解。

In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task’s reliance on each modality.

在本研究中，我们考察了视听大语言模型（AVLLMs）内部的视听信息流，追踪了 AVLLMs 如何在两种输入配置（视听视频和多个交错的视听项目）下路由、利用和整合视听信息。我们发现，对于视听视频，AVLLMs 遵循为视觉语言模型（VLMs）和视频大语言模型（VideoLLMs）所建立的顺序信息流路径，其中音频和视觉的贡献沿着该路径流动，其比例取决于任务对各模态的依赖程度。

In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model’s prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference.

在包含多个交错视听项目的设置中，这种路由会转向不同的并行流。此外，我们证明了视听及其他类型的 Token 在其信息被传输到大语言模型后即可被丢弃，且对模型预测的影响微乎其微，甚至能带来轻微的性能提升。这一结论在多个任务和数据集上具有普适性，从而实现了更高效的推理。

These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

这些发现适用于多种模型和规模（包括 3B 和 7B 规模的 Qwen2.5-Omni 和 Video-SALMONN2 Plus），并引发了关于这些流动结构为何产生的假设。总之，这些结果首次勾勒出 AVLLMs 如何在网络内部协调声音与视觉的连贯图景，并为视听领域及更广泛的多模态大语言模型在可解释性、设计和效率方面的下一波进步奠定了基础。