TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

TajPersLexon:用于跨脚本低资源自然语言处理的塔吉克语-波斯语词汇资源与混合模型

Abstract: This work introduces TajPersLexon, a curated Tajik—Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. 摘要: 本文介绍了 TajPersLexon,这是一个精心整理的塔吉克语-波斯语平行词汇资源,包含 40,112 个词汇和短语对,旨在解决低资源环境下的跨脚本词汇检索、音译和对齐问题。

We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. 我们进行了一项全面的仅限 CPU 的基准测试,比较了三个方法论系列:(i) 轻量级混合流水线,(ii) 神经序列到序列模型,以及 (iii) 检索方法。

Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. 我们的评估结果表明,该任务在本质上是可以解决的,神经模型和检索基准模型均达到了 98-99% 的 Top-1 准确率。

Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. 至关重要的是,我们证明了尽管大型多语言句子转换器(Sentence Transformers)在处理这种精确词汇匹配时表现不佳,但我们可解释的混合模型在实际应用中提供了良好的准确性与效率平衡,在 OCR 后校正任务中达到了 96.4% 的准确率。

All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released. 所有实验均使用固定的随机种子以确保完全可复现性。数据集、代码和模型将向公众发布。


Journal reference: Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026), pages 29-37. 期刊参考: 第一届伊朗语系 NLP 与大语言模型研讨会论文集 (SilkRoadNLP 2026),第 29-37 页。