The Open Agent Leaderboard
The Open Agent Leaderboard
开放式智能体排行榜 (The Open Agent Leaderboard)
How good are general purpose AI agents? We built an open evaluation framework to find out. Most evaluations in AI report a simple result: what score each model got on which benchmarking task. When you deploy an agent, you’re not just choosing a model. You’re choosing a full system: what tools the agent can use, how it plans its steps, what it remembers between actions, how it recovers when something goes wrong. Change any of those and the same model can produce very different results at very different costs. How well an AI agent works depends on how it’s built, not just the model inside it.
通用人工智能体(AI agents)的表现究竟如何?我们构建了一个开放的评估框架来寻找答案。目前大多数人工智能评估报告的结果都很简单:即每个模型在特定的基准测试任务中获得了什么分数。但当你部署一个智能体时,你选择的不仅仅是一个模型,而是一整套系统:包括智能体可以使用哪些工具、它如何规划步骤、在执行动作之间它能记住什么、以及当出现问题时它如何恢复。改变其中任何一个环节,同一个模型都可能产生截然不同的结果,且成本差异巨大。一个人工智能体的工作效果如何,取决于它是如何构建的,而不仅仅取决于其内部的模型。
Today we’re launching the Open Agent Leaderboard, an open benchmark for comparing full agent systems, not just the models inside them. It reports both quality and cost, so you can see not just what works, but what’s worth deploying. The leaderboard is paired with the Exgentic framework for running and reproducing evaluations, and a paper describing the full methodology and results. Everything is open from day one.
今天,我们正式发布“开放式智能体排行榜”(Open Agent Leaderboard),这是一个用于比较完整智能体系统(而非仅仅是其内部模型)的开放基准测试。它同时报告质量和成本,因此你不仅能看到哪些系统有效,还能判断哪些系统值得部署。该排行榜配套了用于运行和复现评估的 Exgentic 框架,以及一份详细描述完整方法论和结果的论文。所有内容从第一天起即完全开源。
Can we measure generality? AI agents are getting really useful when carefully tailored to a specific job, like coding in a familiar repository or handling customer service with a known set of tools. But the harder question is whether the same agent can handle many different jobs, each with its own tools, rules, and constraints, without being manually customized for each one. A more general agent is one you can drop into a new setting and have it just work. That’s what we mean by generality, and it’s best understood as a spectrum, not a binary label.
我们能衡量通用性吗?当人工智能体针对特定工作进行精心定制时,它们已经变得非常有用,例如在熟悉的存储库中编写代码,或使用已知的一套工具处理客户服务。但更难的问题是,同一个智能体是否能在无需针对每项工作进行手动定制的情况下,处理许多不同的任务,且每项任务都有其独特的工具、规则和约束。一个更通用的智能体,是指你可以将其投入到一个新的环境中,它就能直接工作。这就是我们所说的“通用性”,它最好被理解为一个光谱,而不是一个二元标签。
Of course, generality that only works in theory isn’t useful. What matters is whether an agent stays capable as the range of jobs and settings grows, and whether it does so at a reasonable cost. A system that handles everything but costs a fortune to run isn’t general in any way that matters. This leaderboard measures exactly that: how general your agent actually is. It evaluates agents across diverse, unfamiliar settings, each with different tools, rules, and constraints, and reports both quality and cost. So you can see not just how well a system performs, but whether it’s worth actually deploying.
当然,仅在理论上有效的通用性是没有意义的。真正重要的是,随着工作范围和环境的扩大,智能体是否能保持其能力,以及它是否能在合理的成本下完成任务。一个什么都能处理但运行成本极高的系统,在任何实际意义上都不是通用的。这个排行榜衡量的正是这一点:你的智能体到底有多通用。它在多样化、陌生的环境中评估智能体,每个环境都有不同的工具、规则和约束,并同时报告质量和成本。因此,你不仅能看到系统的表现如何,还能判断它是否真正值得部署。
It doesn’t cover every capability a general agent will eventually need. But it’s a much stronger test of how well agents work across different situations than anything previously available. And by treating the full agent system, not just the model, as the thing being measured, it makes visible what’s actually driving the results.
它并没有涵盖通用智能体最终所需的所有能力,但相比以往任何评估手段,它对智能体在不同情境下的工作表现进行了更强有力的测试。通过将完整的智能体系统(而非仅仅是模型)作为衡量对象,它清晰地展示了究竟是什么因素在驱动结果。
What we built: We assembled six benchmarks, each testing a different kind of realistic task. Together they aim to capture a broad range of working settings: coding, customer service, technical support, personal assistance, and research.
- SWE-Bench Verified — fixing real bugs in real code repositories
- BrowseComp+ — researching complex questions across the web
- AppWorld — completing personal tasks across hundreds of apps and actions
- tau2-Bench Airline & Retail — customer service following company policies
- tau2-Bench Telecom — technical support following company policies
我们构建了什么:我们汇集了六个基准测试,每个测试都针对一种不同类型的现实任务。它们共同旨在涵盖广泛的工作场景:编程、客户服务、技术支持、个人助理和研究。
- SWE-Bench Verified —— 在真实代码库中修复真实 Bug
- BrowseComp+ —— 在网络上研究复杂问题
- AppWorld —— 在数百个应用程序和操作中完成个人任务
- tau2-Bench Airline & Retail —— 遵循公司政策的客户服务
- tau2-Bench Telecom —— 遵循公司政策的技术支持
Each is an established benchmark, created and reviewed by the research community. They weren’t chosen because any single one captures general agency. They were chosen because together they test very different things: real code changes, open-ended research, broad action spaces, rule-bound conversations. That mix is what makes the evaluation meaningful.
每一个都是由研究社区创建和审查的成熟基准测试。我们选择它们并非因为其中任何一个能单独代表通用智能能力,而是因为它们组合在一起可以测试非常不同的方面:真实的代码变更、开放式的研究、广泛的动作空间以及受规则约束的对话。这种组合正是评估具有意义的原因所在。
These benchmarks were each designed to test one kind of task in one kind of way. Making them work together meant giving them a shared structure. We introduced a unified protocol that gives every benchmark the same shape: a task (what to do), a context (what to know), and a set of actions (what’s allowed). Instead of each agent speaking each benchmark’s language, they all speak one.
这些基准测试最初都是为了以特定方式测试特定任务而设计的。要让它们协同工作,意味着必须赋予它们一个共享的结构。我们引入了一个统一的协议,使每个基准测试都具有相同的形式:任务(要做什么)、上下文(需要知道什么)和一组动作(允许做什么)。智能体不再需要学习每个基准测试的“语言”,它们现在只需遵循同一种规范。
This standardization isn’t trivial. Each benchmark comes with its own assumptions, instructions, and interaction patterns. Making sure these don’t clash with how different agents work internally requires deep understanding of both sides. It’s one of the reasons this work took time, and one of the reasons results may differ from what you see on individual benchmark leaderboards. But the payoff is real: the benchmarks keep their original design, the agents keep their native tools and interfaces, and the protocol gives them a common way to connect.
这种标准化并不简单。每个基准测试都有其自身的假设、指令和交互模式。确保这些模式与不同智能体的内部工作方式不冲突,需要对双方都有深刻的理解。这也是这项工作耗时较长的原因之一,也是结果可能与你在单个基准测试排行榜上看到的结果有所不同的原因之一。但回报是实实在在的:基准测试保持了其原始设计,智能体保留了其原生工具和接口,而协议为它们提供了一种通用的连接方式。
How to read the leaderboard: Each row is a full agent system: a specific agent paired with a specific model, evaluated across all six benchmarks. For every configuration, you see the average success rate, the average cost per task, and per-benchmark breakdowns. Here’s what the current top five looks like:
如何解读排行榜:每一行代表一个完整的智能体系统:即特定的智能体与特定的模型配对,并在所有六个基准测试中进行评估。对于每种配置,你可以看到平均成功率、每个任务的平均成本以及各基准测试的细分数据。以下是当前排名前五的情况:
Look at the top three. All use the same model. Yet they differ in both score and cost because the agent systems wrapped around that model are different. Same model, different agents, different results — the agent matters. The cost gap is just as striking. The most efficient configuration in the top five runs at a fraction of the price of the strongest one. The full picture becomes clear when you plot every configuration by quality and cost:
看看前三名。它们都使用相同的模型,但在分数和成本上却各不相同,因为围绕该模型构建的智能体系统不同。相同的模型,不同的智能体,产生不同的结果——智能体至关重要。成本差距同样惊人:前五名中最经济的配置,其运行成本仅为最强配置的一小部分。当你将每种配置按质量和成本绘制成图表时,全貌便一目了然:
When the agent implementation is visible alongside the model, you can start to untangle what’s driving the results: which gains came from the model, which from the agent design, and which components generalize across settings. That’s what this leaderboard is built to show. A note on results: agents here are tested as general-purpose systems without benchmark-specific tuning, and without the prompt and environment optimizations that model developers often apply to individual benchmarks. So scores may differ. See the paper for details.
当智能体的实现方式与模型一同呈现时,你就可以开始理清是什么在驱动这些结果:哪些提升来自模型,哪些来自智能体设计,哪些组件在不同环境下具有通用性。这正是本排行榜旨在展示的内容。关于结果的说明:此处的智能体均作为通用系统进行测试,没有进行针对特定基准测试的微调,也没有应用模型开发者通常针对单个基准测试所做的提示词(Prompt)和环境优化。因此,分数可能会有所不同。详情请参阅论文。
What we’re already learning: One finding surprised us: general-purpose agents are already competitive with specialized ones. In several cases, agents with no benchmark-specific tuning matched…
我们已经学到的:一个发现让我们感到惊讶:通用智能体已经具备了与专用智能体竞争的能力。在几个案例中,没有经过特定基准测试微调的智能体,其表现与……(原文截断)