EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill: Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

EchoDistill：用于鲁棒音频大模型的“噪声到纯净”对齐自蒸馏框架

Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations.

摘要： 音频大模型（ALLMs）在面对现实世界中的噪声时非常脆弱，这往往会导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级的声学增强、答案级的监督，或对噪声表征的内部抑制。

To address these issues, we propose EchoDistill, an alignment-based noisy-to-clean self-distillation framework. EchoDistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus.

为了解决这些问题，我们提出了 EchoDistill，这是一个基于对齐的“噪声到纯净”自蒸馏框架。EchoDistill 利用一个冻结的纯净音频教师模型，为推理阶段的噪声音频学生模型提供语义参考。具体而言，学生模型在噪声条件下采样候选响应，以暴露其在测试时的行为。随后，这些轨迹通过组相对策略优化（GRPO）进行优化，其中与教师模型在 Token 级别的连贯性被用作奖励加成。

By aligning the noisy student’s candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. EchoDistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs.

通过将噪声学生模型的候选响应与纯净的语义证据对齐，并应用音频感知的奖励塑造，我们的方法鼓励生成既正确又具有真实声学基础的推理轨迹。EchoDistill 在不引入任何额外推理成本的情况下，显著提高了音频大模型在复杂噪声环境下的语义可靠性和任务性能。

Extensive experiments show that: (I) Compared with the strongest baseline, EchoDistill achieves average improvements of 4.18%↑ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that EchoDistill improves over the GRPO-only variant by 3.02%↑ in Acc, 3.89%↑ in Noisy, and 4.53%↑ in GSR on average. Our codes are available at this https URL.

大量实验表明：（I）与最强的基线模型相比，EchoDistill 在强噪声环境下 GSR 指标平均提升了 4.18%↑。（II）在 Qwen-Omni 上的消融实验结果进一步显示，与仅使用 GRPO 的变体相比，EchoDistill 在准确率（Acc）、噪声环境表现（Noisy）和 GSR 指标上分别平均提升了 3.02%↑、3.89%↑ 和 4.53%↑。我们的代码已在链接中提供。