QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA قِمّة ⛰：一个以质量为先的阿拉伯语大模型排行榜

QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs. 🏆 Leaderboard · 🔧 GitHub · 📄 Paper QIMMA 在评估模型之前会对基准测试进行验证，确保所报告的分数能够真实反映大模型（LLM）的阿拉伯语能力。🏆 排行榜 · 🔧 GitHub · 📄 论文

If you’ve been tracking Arabic LLM evaluation, you’ve probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we’re measuring? We built QIMMA قمّة (Arabic for “summit”), to answer that question systematically. 如果你一直在关注阿拉伯语大模型的评估，你可能已经注意到一种日益增长的矛盾：基准测试和排行榜的数量在迅速增加，但我们真的测量到了我们想要测量的东西吗？我们构建了 QIMMA（阿拉伯语意为“顶峰”），旨在系统地回答这个问题。

Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results. This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up. 我们没有直接汇总现有的阿拉伯语基准测试并运行模型，而是在进行任何评估之前，应用了一套严格的质量验证流程。我们的发现令人警醒：即使是那些被广泛使用、备受推崇的阿拉伯语基准测试，也包含着系统性的质量问题，这些问题可能会悄无声息地破坏评估结果。本文将介绍 QIMMA 是什么、我们是如何构建它的、我们发现了哪些问题，以及在清理数据后模型排名会发生怎样的变化。

🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated

🔍 问题所在：阿拉伯语自然语言处理（NLP）评估碎片化且缺乏验证

Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work: 阿拉伯语在不同的方言和文化背景下被超过 4 亿人使用，然而阿拉伯语 NLP 的评估领域仍然处于碎片化状态。以下几个关键痛点促使我们开展了这项工作：

Translation issues. Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used. 翻译问题。 许多阿拉伯语基准测试是从英语翻译而来的。这引入了分布偏移。在英语中感觉自然的提问在阿拉伯语中会变得生硬或文化错位，使得基准测试数据无法代表阿拉伯语的自然使用方式。
Absent quality validation. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources. 缺乏质量验证。 即使是原生阿拉伯语基准测试，发布时也往往缺乏严格的质量检查。标注不一致、错误的金标准答案、编码错误以及真值标签中的文化偏见，在现有的资源中都有所记录。
Reproducibility gaps. Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work. 可重复性差距。 评估脚本和单样本输出很少公开，导致难以审计结果或基于前人的工作进行构建。
Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult. 覆盖范围碎片化。 现有的排行榜涵盖的任务孤立且领域狭窄，使得对模型进行全面评估变得困难。

⛰ What’s in QIMMA?

⛰ QIMMA 包含什么？

QIMMA consolidates 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples, spanning 7 domains. QIMMA 将来自 14 个源基准测试的 109 个子集整合为一个统一的评估套件，包含超过 52,000 个样本，涵盖 7 个领域。

A few things stand out about this design: 该设计有几个显著特点：

99% native Arabic content. The only exception is code evaluation, which is inherently language-agnostic. 99% 的原生阿拉伯语内容。 唯一的例外是代码评估，它本质上与语言无关。
First Arabic leaderboard with code evaluation. QIMMA integrates Arabic-adapted versions of HumanEval+ and MBPP+, making it possible to assess coding capability with Arabic-language problem statements. 首个具备代码评估功能的阿拉伯语排行榜。 QIMMA 集成了阿拉伯语适配版的 HumanEval+ 和 MBPP+，使得使用阿拉伯语问题描述来评估编程能力成为可能。
Diversity in Domains and Tasks. QIMMA evaluates real-world competency areas including education, governance, healthcare, creative expression, and software development. 领域和任务的多样性。 QIMMA 评估了现实世界中的能力领域，包括教育、治理、医疗保健、创意表达和软件开发。

🔬 The Quality Validation Pipeline

🔬 质量验证流程

This is the methodological heart of QIMMA. Before running a single model, we applied a multi-stage validation pipeline to every sample in every benchmark. 这是 QIMMA 的方法论核心。在运行任何模型之前，我们对每个基准测试中的每个样本都应用了多阶段验证流程。

Stage 1: Multi-Model Automated Assessment 第一阶段：多模型自动化评估

Each sample was independently evaluated by two state-of-the-art LLMs: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. We chose two models with strong Arabic capability but different training data compositions, so that their combined judgment is more robust than either alone. 每个样本都由两个最先进的大模型独立评估：Qwen3-235B-A22B-Instruct 和 DeepSeek-V3-671B。我们选择了两个具备强大阿拉伯语能力但训练数据构成不同的模型，以确保它们的综合判断比单一模型更稳健。

Stage 2: Human Annotation and Review 第二阶段：人工标注与审核

Flagged samples are reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on cultural context, regional variation, dialectal nuance, and subjective interpretation. 被标记的样本由具备文化和方言熟悉度的阿拉伯语母语者进行审核。人工标注员对文化背景、区域差异、方言细微差别和主观解释做出最终判断。

⚠️ What We Found: Systematic Quality Problems

⚠️ 我们的发现：系统性的质量问题

The pipeline revealed recurring quality issues across benchmarks; not isolated errors, but systematic patterns reflecting gaps in how benchmarks were originally constructed. 该流程揭示了各基准测试中反复出现的质量问题；这些并非孤立的错误，而是反映了基准测试在最初构建方式上存在缺陷的系统性模式。

Taxonomy of Issues Found: 发现的问题分类：

⚖️ Answer Quality: False or mismatched gold indices, factually wrong answers, missing or raw text answers. ⚖️ 答案质量： 错误或不匹配的金标准索引、事实错误的答案、缺失或原始文本答案。
📄 Text & Formatting Quality: Corrupt or illegible text, spelling and grammar errors, and duplicate samples. 📄 文本与格式质量： 损坏或难以辨认的文本、拼写和语法错误，以及重复样本。
💬 Cultural Sensitivity: Stereotype reinforcement and monolithic generalizations. 💬 文化敏感性： 刻板印象的强化和单一化的概括。