AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas：超越大模型智能体（LLM Agents）的结果排行榜

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). 大语言模型智能体目前已广泛应用于代码库、浏览器、操作系统、日历、文件系统及各类工具生态中。然而，用于评估这些智能体的基准测试却十分碎片化：每个基准测试侧重的衡量指标各不相同（如最终任务成功率、工具调用有效性、重复执行一致性、轨迹安全性或攻击鲁棒性）。

A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. 2024年至2025年间的一系列研究达成共识：对于可部署的智能体而言，单一的准确率指标已不再是合适的比较维度。

AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model’s apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. AgentAtlas 通过四个核心组件扩展了这一研究方向：(i) 六状态控制决策分类法（行动/询问/拒绝/停止/确认/恢复）；(ii) 包含两个正交层级标签（主要错误来源、影响）的九类轨迹故障分类法；(iii) 一种对比“分类感知”与“分类盲测”的方法，用于衡量模型表现出的能力在多大程度上源于提示词（Prompt）中的监督信息；以及 (iv) 一项基准覆盖审计，将十五个智能体基准测试映射到六个行为维度上。

To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. 为了验证该方法，我们选取了一组固定的八个模型（共生成 1,342 个样本，包含四个前沿闭源模型和四个开源权重模型），并在两种提示模式下进行了测试。

Removing the explicit label menu drops every model’s trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. 移除显式标签菜单后，无论模型属于哪个系列，其轨迹准确率均下降了 14-40 个百分点，最终收敛至 0.54-0.62 的狭窄区间。此外，没有任何单一模型能在控制准确率、轨迹诊断和工具上下文效用保留这三个维度上同时胜出。

We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release. 我们将此次合成运行视为一种测量协议的演示，而非正式的基准测试发布。