AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

AfriSUD:用于评估非洲语言模型的依存句法树库集合

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa.

尽管非洲语言具有语言多样性和全球重要性,但在支持自然语言处理(NLP)的研究和资源方面,它们仍然代表性不足。我们旨在通过引入 AfriSUD 来弥补这一差距,这是首个针对撒哈拉以南非洲地区主要语系和区域的九种不同非洲语言的大规模句法标注树库集合。

Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone.

利用表层句法通用依存(SUD)框架,我们这项由社区主导的工作提供了高质量、经母语人士验证的数据,这些数据捕捉了诸如黏着语特征和声调等关键的类型学特征。

We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

我们在 AfriSUD 上评估了一系列用于词性标注和依存句法分析的模型,包括非 Transformer 基线模型、多语言预训练编码器以及大语言模型(LLM)。我们的结果揭示了一个显著的句法鸿沟,模型在九种语言中仍表现出明显的局限性,这表明现有的架构可能无法完全捕捉非洲语言句法的结构多样性。