Soro: A Lightweight Foundation Model and Chatbot for Tajik
Soro: A Lightweight Foundation Model and Chatbot for Tajik
Soro:面向塔吉克语的轻量级基础模型与聊天机器人
We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. 我们推出了 Soro,这是一个专为塔吉克语设计的对话式大语言模型(LLM)系列,旨在满足塔吉克斯坦在计算资源和网络连接受限的实际环境下的部署需求。
Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. 该模型基于开源的 Gemma 3 检查点,通过 19 亿 token 的塔吉克语语料库进行持续预训练。该语料库涵盖了经过筛选的网络文本、PDF 文档以及符合课程标准的教育材料。随后,我们利用 4 万条塔吉克语教师风格的示例进行了监督指令微调。
To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. 由于标准基准测试对塔吉克语的覆盖有限,为了进行严格的评估,我们引入了一套涵盖常识、语言能力以及中小学和大学入学考试领域的塔吉克语基准测试,并已在 Hugging Face 上开源。
Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. 在这些塔吉克语基准测试中,Soro 的表现显著优于同等规模的 Gemma 3 基准模型,同时在标准数据集上依然保持了强大的英语处理能力。
We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan. 我们进一步证明,对 Soro 进行 FP8 和 INT4 量化可以在减少边缘部署内存需求的同时,保留大部分塔吉克语性能增益,从而支持目前正在进行的教育领域试点项目,并为未来在塔吉克斯坦全国学校的推广应用提供支撑。