ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

ALEE：基于以英语为中心的最小对的任意语言嵌入评估

Abstract: Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to overfitting, and poorly representative of low-resource languages.

摘要： 文本嵌入是语义相似度任务的标准工具，但对其进行评估仍然是一个尚未解决的挑战。目前的基准测试是静态的，仅涵盖有限的语言种类，通常局限于特定领域，容易出现过拟合，且对低资源语言的代表性不足。

To address these limitations, we introduce ALEE, a framework that extends Sentence Smith (Li et al., 2025) to the cross-lingual and paragraph level. ALEE uses Abstract Meaning Representations (AMR) to generate English minimal pairs with controlled, fine-grained semantic shifts, which are paired with translations in target languages.

为了解决这些局限性，我们引入了 ALEE，这是一个将 Sentence Smith (Li et al., 2025) 扩展到跨语言和段落层面的框架。ALEE 利用抽象语义表示 (AMR) 生成具有受控、细粒度语义偏移的英语最小对，并将其与目标语言的翻译进行配对。

This approach enables targeted diagnostics for models in any language with English parallel data. We conduct a large-scale empirical study across a diverse set of embedding models and 275+ languages spanning three parallel datasets.

这种方法能够针对任何拥有英语平行数据的语言模型进行定向诊断。我们对多种嵌入模型进行了大规模实证研究，涵盖了跨越三个平行数据集的 275 多种语言。

On ALEE, performance varies substantially across languages, text lengths, and linguistic phenomena, exposing persistent gaps in cross-lingual semantic representation that track language prevalence in training resources and subword tokenization. We release ALEE at this https URL.

在 ALEE 的测试中，模型性能在不同语言、文本长度和语言现象之间存在显著差异，这揭示了跨语言语义表示中持续存在的差距，这些差距与训练资源中的语言普及程度及子词分词方式密切相关。我们已在以下网址发布了 ALEE：[https URL]。