AI evals are becoming the new compute bottleneck

AI 评估正成为新的算力瓶颈

AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Exgentic’s $22,000 sweep across agent configurations found a 33× cost spread on identical tasks, isolating scaffold choice as a first-order cost driver, and UK-AISI recently scaled agentic steps into the millions to study inference-time compute.

AI 评估已经跨越了一个成本门槛，这改变了谁有能力进行评估的格局。Holistic Agent Leaderboard (HAL) 最近花费了约 4 万美元，在 9 个模型和 9 个基准测试上运行了 21,730 次智能体（agent）部署。在尖端模型上进行一次 GAIA 运行，在缓存前可能耗资 2,829 美元。Exgentic 在智能体配置上进行的 2.2 万美元全面测试发现，在相同任务上成本差异高达 33 倍，这表明“脚手架”（scaffold）的选择是影响成本的首要因素；而英国人工智能安全研究所（UK-AISI）最近将智能体步骤扩展到了数百万次，以研究推理阶段的算力消耗。

In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

在科学机器学习领域，评估一种新架构需要约 960 个 H100 小时，而进行完整的四基准测试则需要 3,840 个 H100 小时。虽然针对静态基准测试已经提出了压缩技术，但新的智能体基准测试具有噪声大、对脚手架敏感且仅能部分压缩的特点。“训练循环中”（Training-in-the-loop）的基准测试本身就非常昂贵，而当你试图提高这些评估的可靠性时，重复运行会进一步成倍增加成本。

Making static LLM benchmarks cheaper

降低静态大模型基准测试的成本

The cost problem started before agents. When Stanford’s CRFM released HELM in 2022, the paper’s own per-model accounting showed API costs ranging from $85 for OpenAI’s code-cushman-001 to $10,926 for AI21’s J1-Jumbo (178B), and 540 to 4,200 GPU-hours for the open models, with BLOOM (176B) and OPT (175B) at the top end. Perlitz et al. (2023) restate the larger HELM cost pattern, and IBM Research notes that putting Granite-13B through HELM “can consume as many as 1,000 GPU hours.” Across HELM’s 30 models and 42 scenarios, the aggregate of reported costs and GPU compute came to roughly $100,000.

成本问题在智能体出现之前就已经存在。当斯坦福大学 CRFM 在 2022 年发布 HELM 时，论文中每个模型的核算显示，API 成本从 OpenAI 的 code-cushman-001 的 85 美元到 AI21 的 J1-Jumbo (178B) 的 10,926 美元不等；开源模型的 GPU 耗时从 540 到 4,200 小时不等，其中 BLOOM (176B) 和 OPT (175B) 处于最高端。Perlitz 等人（2023）重申了 HELM 更广泛的成本模式，IBM Research 指出，让 Granite-13B 通过 HELM 测试“可能消耗多达 1,000 个 GPU 小时”。在 HELM 的 30 个模型和 42 个场景中，报告的成本和 GPU 算力总计约为 10 万美元。

Another shocking observation came from Perlitz et al.’s analysis of EleutherAI’s Pythia checkpoints: developers pay for evaluation repeatedly during model development. Pythia released 154 checkpoints for each of 16 models spanning 8 sizes, or 2,464 checkpoints if each model checkpoint is counted separately, so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: Perlitz et al. (2024) noted that evaluation costs “may even surpass those of pretraining when evaluating checkpoints.” For small models, evaluation becomes the dominant compute line item across the whole development cycle.

另一个令人震惊的观察来自 Perlitz 等人对 EleutherAI 的 Pythia 检查点（checkpoints）的分析：开发者在模型开发过程中需要反复支付评估费用。Pythia 为 16 个模型（涵盖 8 种规模）中的每一个发布了 154 个检查点，如果每个模型检查点单独计算，总计 2,464 个检查点，以便社区研究训练动态。在所有这些检查点上运行 LM Evaluation Harness 会使评估成为训练成本的乘数：Perlitz 等人（2024）指出，在评估检查点时，评估成本“甚至可能超过预训练成本”。对于小型模型，评估成为整个开发周期中主要的算力支出项目。

When we scale inference-time compute, we scale evaluation costs. Perlitz et al. then asked how much of HELM actually carried the rankings. The result was striking: a 100× to 200× reduction in compute preserved nearly the same ordering, with larger reductions still useful for coarse grouping under the paper’s tiered analysis. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM’s compute was confirming rankings that the field could have inferred much more cheaply.

当我们扩展推理阶段的算力时，我们也扩展了评估成本。Perlitz 等人随后探讨了 HELM 中到底有多少部分真正决定了排名。结果令人震惊：将算力减少 100 到 200 倍，几乎保留了相同的排序，而在论文的分层分析下，更大程度的缩减对于粗略分组仍然有效。Flash-HELM 将这一发现转化为一种“由粗到细”的流程：先进行廉价评估，然后仅对表现最好的候选者投入高分辨率算力。HELM 的大部分算力实际上是在确认那些本可以用更低成本推断出的排名。

Agent evals are messier

智能体评估更为复杂

A very nice public accounting of agent evaluation comes from the Holistic Agent Leaderboard (Kapoor et al., ICLR 2026). HAL runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga’s independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

关于智能体评估，Holistic Agent Leaderboard (Kapoor 等人，ICLR 2026) 提供了一份非常详尽的公开核算。HAL 在涵盖编码、网页导航、科学任务和客户服务的 9 个基准测试中运行标准化的智能体工具，并采用共享脚手架和集中式成本跟踪。其核心成本数据为：在 9 个模型和 9 个基准测试上进行 21,730 次部署，耗资 4 万美元。到 2026 年 4 月，排行榜已增长至 26,597 次部署。Ndzomga 的独立复现得出了几乎相同的数字：242 次智能体运行耗资 4.6 万美元。

Behind these numbers is a blunt pricing fact. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40, a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark “the model” in isolation. They benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×. Worse, higher spend does not reliably buy better results.

这些数字背后是一个残酷的定价事实。Claude Opus 4.1 的收费标准是每百万输入 Token 15 美元，每百万输出 Token 75 美元。而 Gemini 2.0 Flash 的收费分别是 0.10 美元和 0.40 美元，仅输入成本就相差了两个数量级。智能体基准测试很少单独评估“模型”本身。它们评估的是“模型 × 脚手架 × Token 预算”的乘积，而微小的脚手架选择可能会使成本增加 10 倍。更糟糕的是，更高的支出并不一定能带来更好的结果。