Continuous Audio Thinking for Large Audio Language Models

大型音频语言模型的连续音频思维 (Continuous Audio Thinking)

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response.

大型音频语言模型 (LALMs) 在从语音转录到音乐分析等多种音频理解任务中展现出了令人印象深刻的能力。然而，由于 LALMs 通常被训练用于生成与文本对齐的响应，其隐藏状态会逐渐向文本生成方向演变，而非用于保留声学信息。因此，音频中携带的丰富声学内容（如语音细节、韵律、声音事件、情感和音高）在处理过程中会丢失，难以在最终响应中被有效利用。

We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline.

我们引入了“连续音频思维”(Continuous Audio Thinking, CoAT) 框架。该框架为音频语言模型配备了一个连续的潜在工作空间，用于在生成响应之前组织声学信息，并以音频专家的蒸馏知识作为基础。在这一思维空间内，模型在生成响应时能够利用专家蒸馏所提供的丰富声学信息。此外，所提出的连续思维模块可以在单次预填充 (prefill) 过程中完成处理，因此 CoAT 不会比基准模型产生额外的自回归解码成本。

Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model’s textual responses.

通过在 Qwen2-Audio、Qwen2.5-Omni-7B 和 Audio Flamingo~3 这三个 LALMs 上的实验，CoAT 在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准测试集上均实现了性能提升，证明了其有效性。进一步的分析证实，辅助监督信号能够从思维位置传播到模型的文本响应中。