Open-World Evaluations for Measuring Frontier AI Capabilities

用于衡量前沿人工智能能力的“开放世界”评估方法

Abstract: Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons.

摘要： 基于基准测试（Benchmark）的评估对于追踪前沿人工智能的进展仍然至关重要。然而，这种方法既可能高估也可能低估已部署系统的实际能力，因为它倾向于那些能够被精确定义、自动评分、易于优化，且能在低预算和短时间内完成的任务。

We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation.

我们提倡引入一类互补的评估方法，即我们所称的“开放世界评估”（open-world evaluations）：这类评估针对的是长周期、复杂且真实的现实世界任务，并通过小样本的定性分析而非基准测试规模的自动化手段来进行评估。

In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly.

在本文中，我们调研了近期的开放世界评估案例，指出了它们的优势与局限性，并介绍了 CRUX（Collaborative Research for Updating AI eXpectations，旨在更新人工智能预期的协作研究项目），这是一个用于定期开展此类评估的项目。

As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

作为首个实例，我们要求一个人工智能智能体开发并向苹果应用商店（Apple App Store）发布一款简单的 iOS 应用程序。该智能体仅在一次可避免的人工干预下就完成了任务，这表明开放世界评估可以为那些即将普及的能力提供早期预警。最后，我们针对如何设计和报告开放世界评估提出了相关建议。