Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

重新审视评估：自然语言处理中评估问题的分类法

Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation.

摘要： 大型语言模型（LLMs）的最新进展促使越来越多的研究开始质疑当前主流评估实践的方法论。然而，许多此类批评在自然语言处理（NLP）领域早已被广泛讨论；该领域在评估方法论的反思方面有着悠久的历史。

We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation.

我们对 NLP 中有关评估问题的研究进行了范围综述，并开发了一套分类法，综合了各领域中反复出现的观点和权衡。我们还讨论了该分类法的实际意义，包括一份结构化的检查清单，以支持更审慎的评估设计与解读。

By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.

通过将当代的争论置于其历史背景中，本研究为推敲评估实践提供了一个综合性的参考。

Paper Details:

Authors: Ruchira Dhar, Anders Søgaard
arXiv ID: 2604.25923
Subject: Computation and Language (cs.CL)
Submission Date: 1 Apr 2026

论文详情：

作者： Ruchira Dhar, Anders Søgaard
arXiv ID： 2604.25923
学科： 计算与语言 (cs.CL)
提交日期： 2026年4月1日