Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

大语言模型在标准阿拉伯语与方言对话中的文化基准测试

Abstract: There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues.

摘要： 目前在利用能够捕捉丰富文化内涵和方言语境的对话数据集来评估大语言模型（LLM）的文化推理能力方面，存在显著的空白。大多数阿拉伯语基准测试侧重于现代标准阿拉伯语（MSA）的短文本片段，忽略了对话中自然产生的文化细微差别。

To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country’s respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics.

为了填补这一空白，我们推出了 ArabCulture-Dialogue，这是一个基于文化背景的对话数据集，涵盖了 13 个阿拉伯语国家。该数据集包含现代标准阿拉伯语（MSA）以及各国的对应方言，涉及 12 个日常生活主题和 54 个细分主题。

We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation.

我们利用该数据集构建了三项基准测试任务：(i) 多项选择文化推理，(ii) 现代标准阿拉伯语与方言之间的机器翻译，以及 (iii) 方言引导生成。

Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.

我们的实验表明，现代标准阿拉伯语与阿拉伯语方言之间的性能差距依然存在；与现代标准阿拉伯语设置相比，模型在方言设置下的所有三项任务中表现均较差。