Raon-Speech Technical Report

Raon-Speech Technical Report

Abstract: We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation.

摘要: 我们推出了 Raon-Speech,这是一款性能顶尖的 9B 参数语音语言模型(SpeechLM),专为英语和韩语的语音理解、问答及生成而设计;同时还推出了 Raon-SpeechChat,这是一款用于实现自然实时对话的高性能全双工扩展模型。

Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training.

Raon-Speech 成功地将预训练的大语言模型(LLM)转化为既能理解又能生成语音,同时保留强大文本处理能力的语音语言模型(SpeechLM)。该模型在 138 万小时经过精心筛选的英韩语音和文本数据集上进行训练,训练阶段包括:(1)语音模块对齐;(2)结合知识蒸馏的端到端 SpeechLM 预训练;以及(3)基于多任务偏好优化的后训练。

Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance.

在 42 项英韩语音和文本基准测试中,与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个同等规模的近期音频基础模型相比,Raon-Speech 在以语音为中心的任务中展现了最强的综合表现,同时保持了出色的文本问答性能。

Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control.

在此基础上,Raon-SpeechChat 通过在 11.9 万小时的时间对齐真实及合成对话数据上进行持续训练,实现了自然的双向全双工对话。该模型经历了三个互补的训练阶段:(1)因果编码器适配;(2)全双工预训练;(3)针对语音和角色控制的全双工微调。

On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

在多项全双工基准测试中,Raon-SpeechChat 在 FDB v1.0 涵盖的轮流对话(turn-taking)和中断敏感行为方面表现出最显著的优势,并在更广泛的全双工评估套件中保持了极强的竞争力。我们已开源所有模型检查点、训练与推理流水线,并提供了一个交互式演示。