Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

基准测试前沿大语言模型在阿拉伯文化与社会语言学知识上的表现：基于人类领域专家（SME）基准真值的交叉评估框架

Abstract: The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only linguistic fluency but deep cultural familiarity that cannot be approximated by surface-level metrics.

摘要： 人类专家评估的高昂成本是语言模型在专业、高风险领域部署的主要瓶颈。对于阿拉伯社会语言学知识而言，这一问题尤为突出：可靠的评分不仅需要语言流利度，还需要深厚的文化熟悉度，而这些是表面指标无法模拟的。

We address this with a cross-evaluation framework instantiated on two underrepresented Arabic dialect communities: Egyptian and Iraqi Arabic. We contribute 103 validated prompt-rubric pairs (70 Egyptian, 33 Iraqi; 53 Cultural, 50 Linguistic), authored and graded by native-speaker SMEs using penalty-weighted rubrics distinguishing positive content requirements from answer-specific negative error criteria.

我们通过一个在两个代表性不足的阿拉伯方言社区（埃及阿拉伯语和伊拉克阿拉伯语）中实例化的交叉评估框架来解决这一问题。我们贡献了 103 个经过验证的“提示词-评分标准”对（70 个埃及方言，33 个伊拉克方言；53 个文化类，50 个语言类），这些内容由母语领域专家（SME）编写和评分，并使用了惩罚加权评分标准，将正面内容要求与针对答案的负面错误标准区分开来。

Three frontier LLMs serve as target models (graded by human SMEs across 302 unique prompt-response pairs), while five frontier LLMs serve as automated judges enforcing a provider-level self-evaluation guard. A dual-metric scheme combining Mean Absolute Deviation (MAD) with Signed Mean Error separates directional grading bias from symmetric noise.

三个前沿大语言模型（LLM）作为目标模型（由人类专家对 302 个独特的提示-响应对进行评分），同时五个前沿大语言模型作为自动评估器，执行供应商级别的自我评估防护。我们采用了一种结合平均绝对偏差（MAD）和符号平均误差（Signed Mean Error）的双重指标方案，将方向性评分偏差与对称噪声分离开来。

Across 1,307 judge evaluations: GPT-5.4 is the most reliable judge (MADj = 10.21 pp, Signed Error = -1.12%); four of five judges show systematic leniency (+2.01% to +6.56%); Cultural tasks are harder to grade than Linguistic tasks for all judges (MAD gap 1.83-4.78 pp); and models substantially outperform on Egyptian prompts compared to Iraqi prompts.

在 1,307 次评估中：GPT-5.4 是最可靠的评估器（MADj = 10.21 pp，符号误差 = -1.12%）；五分之四的评估器表现出系统性的宽容倾向（+2.01% 至 +6.56%）；对于所有评估器而言，文化类任务比语言类任务更难评分（MAD 差距为 1.83-4.78 pp）；此外，模型在埃及方言提示词上的表现明显优于伊拉克方言提示词。

However, given leniency differences between Iraqi and Egyptian SMEs, we cannot solely attribute this gap to model knowledge. We therefore emphasize findings that do not assume identical leniency across human graders. Across all samples, implicit cultural reasoning — requiring models to simulate native-speaker judgment rather than rely on lexical verification — emerges as the primary failure mode for automated grading across all judge models.

然而，考虑到伊拉克和埃及专家在评分宽容度上的差异，我们不能将这一差距完全归因于模型知识。因此，我们强调那些不假设人类评分者具有相同宽容度的研究结果。在所有样本中，隐性文化推理——即要求模型模拟母语者的判断而非依赖词汇验证——成为所有自动评估模型的主要失败模式。