The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

低资源 NLP 评估中的标注稀缺悖论：十年的加速与新兴约束

Abstract: Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised.

摘要： 在过去十年中，得益于跨语言迁移、大规模多语言模型以及基准测试的快速激增，低资源自然语言处理（NLP）经历了爆炸式增长。然而，这种表面上的进步掩盖了一个关键且尚未得到充分审视的矛盾：评估日益复杂的生成式系统所需的深厚社会语言学专业知识正面临严重匮乏、分布不均以及结构性边缘化的问题。

We present a critical narrative survey of low-resource NLP evaluation (2014—present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them.

我们对低资源 NLP 评估（2014 年至今）进行了批判性的叙述性综述，追溯了其经历的三个阶段：早期的启发式乐观主义、自上而下的基准扩展幻觉，以及当前的生成式瓶颈时代。我们提出了“标注稀缺悖论”（Annotation Scarcity Paradox）这一概念，即当模型扩展的技术能力远远超过对其进行真实评估所需的主权人类基础设施时，所产生的结构性摩擦。

By examining extractive data pipelines, undercompensated “ghost work”, and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses — including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning — and assess their equity and validity trade-offs.

通过审视提取式数据流水线、报酬过低的“幽灵工作”以及语言数据激增现象，我们认为这一悖论威胁到了所报告进展的认知有效性。我们调查了各种新兴的应对措施——包括数据增强、基于模型的评估、参与式策展，以及通过项目反应理论和主动学习实现的标注高效方法——并评估了它们在公平性与有效性之间的权衡。

We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

最后，我们向从业者发出行动呼吁，认为要克服这一瓶颈，需要实现范式转移：从交易型的数据提取转向基于认知治理、数据主权和共享所有权的、植根于社区的关系型评估。