AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas:超越大模型智能体(LLM Agents)的结果排行榜

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). 大语言模型智能体目前已广泛应用于代码库、浏览器、操作系统、日历、文件系统及各类工具生态中。然而,用于评估这些智能体的基准测试却十分碎片化:每个基准测试侧重的衡量指标各不相同(如最终任务成功率、工具调用有效性、重复执行一致性、轨迹安全性或攻击鲁棒性)。

A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. 2024年至2025年间的一系列研究达成共识:对于可部署的智能体而言,单一的准确率指标已不再是合适的比较维度。

AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model’s apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. AgentAtlas 通过四个核心组件扩展了这一研究方向:(i) 六状态控制决策分类法(行动/询问/拒绝/停止/确认/恢复);(ii) 包含两个正交层级标签(主要错误来源、影响)的九类轨迹故障分类法;(iii) 一种对比“分类感知”与“分类盲测”的方法,用于衡量模型表现出的能力在多大程度上源于提示词(Prompt)中的监督信息;以及 (iv) 一项基准覆盖审计,将十五个智能体基准测试映射到六个行为维度上。

To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. 为了验证该方法,我们选取了一组固定的八个模型(共生成 1,342 个样本,包含四个前沿闭源模型和四个开源权重模型),并在两种提示模式下进行了测试。

Removing the explicit label menu drops every model’s trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. 移除显式标签菜单后,无论模型属于哪个系列,其轨迹准确率均下降了 14-40 个百分点,最终收敛至 0.54-0.62 的狭窄区间。此外,没有任何单一模型能在控制准确率、轨迹诊断和工具上下文效用保留这三个维度上同时胜出。

We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release. 我们将此次合成运行视为一种测量协议的演示,而非正式的基准测试发布。