GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

GAIA-v2-LILT：超越翻译的智能体基准多语言适配

Abstract: Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context.

摘要： 目前的智能体（Agent）基准测试主要以英语为中心，而其多语言版本通常仅通过机器翻译（MT）和有限的后期编辑构建。我们认为，对于智能体任务而言，这种极简的工作流程很容易因查询-答案不匹配或文化背景偏差，从而破坏基准测试的有效性。

We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages.

我们提出了一种改进的工作流程，旨在将英语基准测试适配到多种语言中。该流程通过自动化检查和人工审核，实现了明确的功能对齐、文化对齐以及难度校准。利用这一工作流程，我们推出了 GAIA-v2-LILT，这是 GAIA 的一个经过重新审计的多语言扩展版本，涵盖了五种非英语语言。

In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages.

实验结果表明，与仅经过简单翻译的版本相比，我们的工作流程将智能体的成功率提高了 32.7%。在最接近的审计设置中，其性能与英语基准的差距缩小至 3.1%，尽管在许多其他情况下仍存在显著差距。这表明多语言性能差距的很大一部分是由基准测试引起的测量误差所致，这促使我们在跨语言适配英语基准测试时，必须进行任务层面的对齐。

The data is available as part of the MAPS package at this [https URL]. We also release the code used in our experiments at this [https URL].

相关数据已作为 MAPS 软件包的一部分发布，可访问此 [链接] 获取。我们同时在 [链接] 发布了实验所使用的代码。