Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

Indi-RomCoM：用于评估大语言模型在罗马化印度语-英语指令下表现的代码混合基准

Abstract: Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs) perform strongly on monolingual and native-script benchmarks, their ability to follow instructions and reason over RCM-based content remains largely unexplored.

摘要： 罗马化代码混合（RCM）是指双语使用者在交流时流畅地将本地语言与使用罗马字母书写的英语混合，这已成为多语言社区中最主要的交流方式。尽管大语言模型（LLMs）在单语和原生脚本基准测试中表现出色，但它们在遵循指令以及对基于 RCM 的内容进行推理方面的能力仍未得到充分探索。

To this end, we introduce the Indi-RomCoM benchmark for facilitating systematic evaluation on Indic Romanized Code-Mixed instructions. Our benchmark spans seven instruction-following tasks, four widely spoken Indic languages, and three controlled code-mixing intensity levels.

为此，我们推出了 Indi-RomCoM 基准，旨在促进对印度语罗马化代码混合指令的系统性评估。我们的基准涵盖了七项指令遵循任务、四种广泛使用的印度语言，以及三个受控的代码混合强度级别。

We extensively evaluate a suite of LLMs covering proprietary, open-weight, and Indic-focused models under zero- and few-shot settings. LLMs consistently underperform on RCM instructions, with performance degrading as code-mixing density increases. Furthermore, reasoning tasks suffer less degradation than detection tasks (e.g., Toxicity) because the generated explanations offer necessary context. We believe Indi-RomCoM helps the community in developing inclusive multilingual systems.

我们对一系列大语言模型进行了广泛评估，涵盖了闭源模型、开源权重模型以及专注于印度语的模型，并分别在零样本（zero-shot）和少样本（few-shot）设置下进行了测试。结果显示，大语言模型在处理 RCM 指令时表现持续不佳，且性能会随着代码混合密度的增加而下降。此外，推理任务的性能下降程度低于检测任务（如毒性检测），因为生成的解释提供了必要的上下文信息。我们相信 Indi-RomCoM 将有助于社区开发更具包容性的多语言系统。