Is it agentic enough? Benchmarking open models on your own tooling

Is it agentic enough? Benchmarking open models on your own tooling

足够“智能体化”了吗?在你的工具集上基准测试开源模型

Benchmarking transformers revisions across different metrics 跨不同指标对 Transformers 版本进行基准测试

This is a human-made, agent-focused blogpost. Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library, writes the calls, runs them, and debugs its own mistakes. When the library gets in the way, it will happily bypass it and rewrite the logic from scratch. 这是一篇由人类撰写、聚焦于智能体(Agent)的博文。编程智能体正越来越多地代替我们与软件交互:你只需描述一个任务,智能体就会选择库、编写调用代码、运行它们并调试自己的错误。当库的使用变得碍手碍脚时,它会很乐意绕过它并从头重写逻辑。

This introduces a new concept in library development: the code should not only be correct and fast, but should be designed so that an agent can drive it effectively. A clunky API or stale docs annoy us developers, but it now also sends the agent down a longer, more expensive path. 这在库开发中引入了一个新概念:代码不仅要正确且快速,还应经过专门设计,以便智能体能高效地驱动它。笨拙的 API 或过时的文档不仅会困扰我们开发者,现在还会导致智能体走上一条更漫长、成本更高的路径。

Most benchmarks just look at the final answer. We wanted the whole process instead: not just whether the agent got it right, but how much work it took to get there, and how that shifts across models, library revisions, and tasks. We measured exactly that, using transformers as our case study. 大多数基准测试只关注最终答案。而我们想要的是整个过程:不仅是智能体是否做对了,还包括它为此付出了多少努力,以及这种努力在不同模型、库版本和任务之间是如何变化的。我们以 Transformers 为案例研究,准确地测量了这一点。

Here, we will introduce a tool specific benchmark focusing on how the answer was found, and provide a simple implementation of one such harness, running entirely on open models driven by the pi coding agent, with the full sweep of models × revisions × tasks fanned out across Hugging Face Jobs so every run sees identical hardware. 在这里,我们将介绍一个专门针对工具的基准测试,重点关注答案是如何被找到的,并提供一个此类测试框架的简单实现。该框架完全运行在由 pi 编程智能体驱动的开源模型上,并通过 Hugging Face Jobs 将模型 × 版本 × 任务的全量组合并行展开,确保每次运行都在相同的硬件环境下进行。

But, how do you optimize software for agents? We’re strong believers in the following two software principles: 那么,如何为智能体优化软件呢?我们坚信以下两条软件原则:

  • If it isn’t tested, then it doesn’t work
  • If it isn’t documented, then it doesn’t exist
  • 如果没有测试,它就是坏的
  • 如果没有文档,它就是不存在的

This remains the same within the realm of agentic-optimized tooling, and, for once, the two are directly tied to each other. You want your tool to exist for an agent: it needs to be discoverable. The API needs to be clear and the docs need to be extensive. They need to be structured in a way that the agent has rapid access to the useful files and examples. If you want your tool to work for an agent, then you should test it for agentic-use. 在智能体优化工具的领域中,这一点依然适用,而且这两者首次被直接联系在了一起。如果你希望你的工具能被智能体“感知”,它必须是可发现的。API 需要清晰,文档需要详尽。它们需要以一种智能体能快速访问有用文件和示例的方式进行组织。如果你希望你的工具能为智能体所用,那么你就应该针对智能体的使用场景对其进行测试。

Testing software for agentic-use 针对智能体使用场景测试软件

We’ll use transformers as an example throughout this blogpost: agents using it to solve ML tasks (classifying text, captioning images, transcribing audio), not contributing code to it; though the harness was designed to work with any tool that can be operated from the command line. 在整篇博文中,我们将以 Transformers 为例:智能体使用它来解决机器学习任务(文本分类、图像描述、音频转录),而不是向其贡献代码;尽管该测试框架的设计初衷是适用于任何可以通过命令行操作的工具。

Our intuition on transformers was that usage could be dramatically simplified with a few changes: a CLI, a Skill, and self-contained, task-specific examples. This is the same recipe recently applied to the hf CLI, redesigned to be agent-optimized, where agents used 1.3–1.8× (and up to 6×) fewer tokens. We wanted to know whether that kind of win generalizes, and whether it could be useful for transformers as well. 我们对 Transformers 的直觉是,通过一些改动可以极大地简化其使用:引入 CLI、Skill(技能)以及自包含的、特定任务的示例。这与最近应用于 hf CLI 的方案相同,该方案经过重新设计以实现智能体优化,使智能体减少了 1.3–1.8 倍(最高达 6 倍)的 Token 消耗。我们想知道这种优势是否具有普适性,以及它是否同样适用于 Transformers。

Intuition is a powerful tool, but we wanted more evidence before we opened PRs that add several thousand lines of code to such a widely used codebase as transformers. We set out to measure what success looks like. 直觉是一个强大的工具,但在向 Transformers 这样广泛使用的代码库提交增加数千行代码的 PR 之前,我们需要更多的证据。我们着手去衡量什么是“成功”。

Not all successes are equal 并非所有的成功都一样

Two agents can both produce the correct label for a sentiment-classification task, but one: writes a 40-line Python script, imports transformers, debugs a shape error, re-runs twice, and finally prints the answer; while the other types transformers classify --model ... --text "..." and is done in one call. 两个智能体都可以为情感分类任务生成正确的标签,但其中一个:编写了 40 行 Python 脚本,导入 Transformers,调试形状错误,重新运行两次,最后打印出答案;而另一个只需输入 transformers classify --model ... --text "...",一次调用即可完成。

Both methods reach the same result. But they have very different profiles in cost, latency, token usage, and failures. If your evaluation only checks the final string, you’re blind to these as well as whether a change you shipped to the library (a CLI improvement, better error messages, a Skill) actually helped agents. Our goal with this harness is to evaluate how much work an agent has to do to perform a given task, and whether changes to the library improve performance. 两种方法都达到了相同的结果。但它们在成本、延迟、Token 使用量和失败率方面有着截然不同的表现。如果你的评估只检查最终字符串,你将无法察觉这些差异,也无法判断你对库所做的更改(CLI 改进、更好的错误信息、Skill)是否真正帮助了智能体。我们开发这个框架的目标是评估智能体完成给定任务所需的工作量,以及库的更改是否提升了性能。

How do we run evaluations? 我们如何进行评估?

A few words on how we’ll evaluate agents here. We run every task under three variants (or “tiers”); three different ways an agent can come at transformers: 关于我们如何在此评估智能体,简单说明如下。我们在三种变体(或称“层级”)下运行每个任务;这是智能体使用 Transformers 的三种不同方式:

  • bare: pip install transformers, and nothing else
  • clone: the full transformers source, checked out in the working directory
  • skill: a packaged Skill: the CLI’s docs + task examples, loaded in context
  • bare(裸机):仅 pip install transformers,不包含其他任何内容
  • clone(克隆):完整的 Transformers 源码,检出在工作目录中
  • skill(技能):打包的 Skill:CLI 文档 + 任务示例,加载到上下文中

These aren’t nested: skill doesn’t contain clone (it ships curated docs, not the source tree), and neither strictly contains the other, each gives the agent a different kind of help. As we’ll see, a model can sometimes do better on clone than on skill. 这些并不是嵌套关系:skill 不包含 clone(它提供的是精选文档,而不是源码树),两者也不严格包含对方,每种方式都为智能体提供了不同类型的帮助。正如我们将看到的,模型有时在 clone 上的表现会优于 skill。

A few more choices: For now we only focus on deterministic tasks which can provide an exact match, as they provide a very nice ground for experimentation. Model-as-a-judge and other schemes are the obvious next steps for other tasks. Every run is its own Hugging Face Job: one per (model × revision × task), so the whole sweep runs in parallel on identical hardware. 还有几点说明:目前我们只关注可以提供精确匹配的确定性任务,因为它们为实验提供了非常好的基础。对于其他任务,“模型作为裁判”(Model-as-a-judge)和其他方案是显而易见的后续步骤。每次运行都是一个独立的 Hugging Face Job:每个(模型 × 版本 × 任务)组合对应一个任务,因此整个测试集可以在相同的硬件上并行运行。