IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX: A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX：用于习语理解、检索与解释的多语言基准测试

Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models.

习语表达一直是自然语言处理领域的一个长期挑战，因为它们的含义通常是非组合性的、依赖于上下文的，且难以在不同语言之间进行对齐。现有的习语资源往往在规模、上下文多样性或多语言覆盖范围上存在局限，限制了它们在现代语言模型中的应用价值。

We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata.

我们推出了 IdiomX，这是一个用于习语理解、检索和解释的大规模多语言基准测试。它通过一个可复现的多阶段流程构建，结合了词汇资源提取、大规模标准化、受控的大语言模型增强以及结构化验证。最终的数据集包含超过 19 万个语境化示例，涵盖了 1.2 万多个习语，并提供了对齐的英语、阿拉伯语和法语语义表示、习语与字面用法标签，以及丰富的语言学元数据。

Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval.

基于这一资源，我们定义了一个统一的四任务基准测试，涵盖习语检测、上下文到习语的检索、阿拉伯语到英语的习语检索以及习语解释，将评估范围从比喻识别扩展到了语义基础和可解释的意义检索。

Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension.

实验表明，上下文 Transformer 模型显著提升了习语检测能力，而混合检索和重排序架构则显著增强了单语言和跨语言的习语检索效果。结果进一步证明，习语解释可以被有效地建模为一种语义检索任务，从而将“可解释性”引入为基准测试的一个补充维度。

Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks.

总的来说，IdiomX 为研究习语语言提供了一个可扩展的基准，涵盖了从检测到检索再到语义解释的完整流程，并提供了一个模块化框架，可扩展至更多语言和比喻推理任务。