ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
ITBench-AA:前沿模型在首个企业级智能体 IT 任务基准测试中得分低于 50% — 由 Artificial Analysis 和 IBM 发布
Artificial Analysis and IBM Software Innovation Lab are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50%.
Artificial Analysis 与 IBM 软件创新实验室联合推出了 ITBench-AA,这是一系列旨在评估模型在企业级智能体 IT 任务中表现的全新基准测试的首作。该测试首先聚焦于站点可靠性工程(SRE)任务,目前前沿模型的得分均低于 50%。
ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models and agents must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by IBM, leveraging deep expertise in enterprise IT operations. Artificial Analysis has worked closely with IBM over the last 6 months to develop an implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time.
ITBench-AA 的 SRE 任务旨在评估模型在 Kubernetes 事件响应中的表现。模型和智能体必须通过读取日志、追踪依赖关系以及识别复杂基础设施中的根本原因实体来诊断实时系统。底层的 ITBench 数据集由 IBM 开发,利用了其在企业 IT 运维方面的深厚专业知识。在过去 6 个月中,Artificial Analysis 与 IBM 紧密合作,将该数据集应用于前沿 AI 评估,首先从站点可靠性工程(SRE)开始,未来将扩展至财务运营(FinOps)和首席信息安全官(CISO)相关任务。
Key findings: Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%. All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench.
关键发现:Claude Opus 4.7(自适应推理,最大努力模式)以 47% 的得分领先,紧随其后的是 GPT-5.5(xhigh,46%)和 Qwen3.7 Max(42%)。所有前沿模型的得分均低于 50%,这使得 ITBench-AA SRE 成为我们测试套件中饱和度最低的智能体基准测试之一。作为对比,前沿模型在 Terminal-Bench 上的得分要高得多。
Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives.
交互轮次(Turn counts)差异近 3 倍,且更长的推理路径并不意味着更高的准确性。GPT-5.5 (xhigh) 平均每个任务交互 31 轮,得分为 46%;而 Gemini 3.1 Pro Preview 平均交互 83 轮,得分仅为 30%。过度调查的模型往往会将上游故障注入机制或并发症状误报为根本原因。
GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%.
GLM-5.1(推理版)以 40% 的得分领跑开源权重模型,与 Gemini 3.5 Flash (high) 基本持平。DeepSeek V4 Pro(推理版,最大努力模式)以 38% 紧随其后,Gemma 4 31B(推理版)为 37%,均领先于 Gemini 3.1 Pro Preview(30%)。
ITBench-AA SRE Overview (ITBench-AA SRE 概览)
- 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks. 共 59 个 SRE 任务: 40 个公开任务和 19 个全新的保留任务。
- Task content: Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. 任务内容: 每个任务提供一个 Kubernetes 事件快照,包含警报、事件、追踪、指标、日志和应用拓扑。
- Goal: The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident. 目标: 模型必须识别出导致该事件的最小独立根本原因 Kubernetes 实体集合。
- Fault types: Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions. 故障类型: 故障涵盖典型的 SRE 失效模式,包括基础设施、服务、应用以及混沌注入事件,例如资源配额耗尽、发布失败、连接池耗尽和网络分区。
Methodology Details (方法论细节)
- Agentic harness: Each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task. 智能体测试框架: 每个任务均由运行在我们开源 Stirrup 参考框架中的模型解决,模型拥有对包含相关日志和快照的沙盒文件系统的 Shell 访问权限。每个任务上限 100 轮交互,重复 3 次。
- Scoring: Models submit a list of root-cause entities. Scoring uses average precision at full recall: if a model misses any ground-truth root causes, it scores 0.0. If it identifies all, it is awarded a score equal to its precision (true positives / (true positives + false positives)). 评分标准: 模型提交根本原因实体列表。评分采用全召回率下的平均精度:如果模型遗漏了任何真实根本原因,得分为 0.0;如果全部识别,则得分等于其精确率(真阳性 / (真阳性 + 假阳性))。
- Comparison: The harness (Stirrup) is held constant, allowing an apples-to-apples comparison between models. 对比: 测试框架(Stirrup)保持不变,确保模型之间能够进行公平的对比。
Highlights (亮点)
- Investigation process: Tasks require agents to investigate Kubernetes incident snapshots through shell commands and submit a structured JSON diagnosis. In one public SRE task, the agent uses shell commands to inspect the offline snapshot, reviews alerts, traces/logs, and topology to identify a network policy blocking the frontend. 调查过程: 任务要求智能体通过 Shell 命令调查 Kubernetes 事件快照,并提交结构化的 JSON 诊断报告。在一个公开的 SRE 任务中,智能体通过 Shell 命令检查离线快照,审查警报、追踪/日志和拓扑,最终识别出导致前端被阻塞的网络策略。
- Efficiency vs. Accuracy: More turns do not mean better answers. Models that submit additional contributing entities beyond the true root cause get penalized (false positives). 效率与准确性: 更多的交互轮次并不意味着更好的答案。提交超出真实根本原因的额外实体会受到惩罚(假阳性)。
- Cost-effectiveness: Open weights models sit on the cost frontier. Gemma 4 31B (Reasoning) scores 37% at $0.14 per task, outperforming Gemini 3.1 Pro Preview ($2.23 per task, 30%) on both score and cost. 成本效益: 开源权重模型处于成本前沿。Gemma 4 31B(推理版)以每个任务 0.14 美元的成本获得 37% 的得分,在得分和成本上均优于 Gemini 3.1 Pro Preview(每个任务 2.23 美元,得分 30%)。