Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

低资源环境下尼泊尔语口语到情感化手语虚拟形象的多模态翻译

Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a proof-of-concept multimodal framework that demonstrates the feasibility of generating emotion-conditioned Nepali Sign Language avatars from spoken input.

摘要： 集成情感表达的手语交流系统目前研究尚不充分，尤其是在低资源语言领域。本项试点研究提出了 NEST-V1（尼泊尔语情感与语音 Transformer 第一版），这是一个概念验证型的多模态框架，展示了通过口语输入生成情感化尼泊尔手语虚拟形象的可行性。

As a preliminary investigation, we focus on four common Nepali words (“thank you”, “hello”, “house”, “me”) across three emotional states (happy, neutral, sad) to validate our core technical approach. Our lightweight architecture employs a shared acoustic encoder for simultaneous Automatic Speech Recognition and emotion classification, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 labeled audio samples from 50 speakers.

作为初步探索，我们选取了四个常见的尼泊尔语单词（“谢谢”、“你好”、“房子”、“我”）以及三种情感状态（快乐、中性、悲伤），以验证我们的核心技术方案。我们的轻量化架构采用共享声学编码器，可同时进行自动语音识别（ASR）和情感分类；在包含 50 名说话人、600 个标注音频样本的数据集上，该系统实现了 81.1% 的 ASR 准确率和 79.21% 的情感识别准确率。

The system demonstrates 37% parameter efficiency compared to separate model architectures while maintaining a lightweight footprint with only 22.1M parameters suitable for edge deployment. This pilot work establishes the technical foundation for emotion-aware sign language translation in low-resource settings and provides a scalable framework for future expansion to larger vocabularies and more diverse emotional expressions.

与独立模型架构相比，该系统在参数效率上提升了 37%，同时保持了仅 2210 万参数的轻量级规模，非常适合边缘设备部署。这项试点工作为低资源环境下的情感感知手语翻译奠定了技术基础，并为未来扩展至更大词汇量和更多样化的情感表达提供了一个可扩展的框架。

Our preliminary results indicate the viability of real-time, emotionally expressive sign language communication systems for the hearing-impaired community, with clear pathways for enhancement in subsequent development phases.

我们的初步结果表明，为听障群体开发实时、具有情感表达能力的手语交流系统是切实可行的，并为后续开发阶段的改进指明了清晰的方向。