olmo-eval: An evaluation workbench for the model development loop

olmo-eval：用于模型开发循环的评估工作台

While you’re building an LLM, you evaluate it over and over across many interventions. Every adjustment to its data, architecture, or hyperparameters — and every step up in scale — sends you back through the same loop: adding or reconfiguring benchmarks, re-running them on each new model checkpoint, noting the results, and checking whether something that helped in a small experiment still holds up on the full training run.

在构建大语言模型（LLM）的过程中，你需要针对各种干预措施反复进行评估。对数据、架构或超参数的每一次调整，以及规模上的每一次提升，都会让你回到同一个循环：添加或重新配置基准测试，在每个新的模型检查点上重新运行它们，记录结果，并检查在小规模实验中有效的改进是否在完整训练运行中依然有效。

Most evaluation tools aren’t designed for this—they’re either built to run established benchmarks across finished models or run a model through multi-step, tool-using problems in a sandbox. They don’t keep up with a model that’s constantly changing, nor do they reflect how a model might behave under specific real-world conditions.

大多数评估工具并非为此设计——它们要么是为在成品模型上运行既定基准测试而构建，要么是在沙盒中让模型处理多步骤、工具使用类问题。它们无法跟上不断变化的模型，也无法反映模型在特定现实条件下的表现。

Our last project to address this evaluation challenge was OLMES, the Open Language Model Evaluation Standard. Introduced in 2024, it was meant to make LLM benchmark scores easier to compare across releases. The same models were being scored on the same benchmarks in different ways — aspects like prompt formatting and task formulation often varied from paper to paper — so claims about which models performed best often weren’t reproducible. OLMES pinned benchmarking choices down in an open, documented standard, and it became the basis for evaluating our open models from Olmo to Tulu.

我们之前解决这一评估挑战的项目是 OLMES（开放语言模型评估标准）。该标准于 2024 年推出，旨在使不同版本的 LLM 基准测试分数更易于比较。由于同一模型在不同基准测试中的评分方式各异（例如提示词格式和任务表述在不同论文中往往不同），关于哪个模型表现最好的结论往往难以复现。OLMES 通过一个开放且有据可查的标准固定了基准测试的选择，并成为我们从 Olmo 到 Tulu 等开放模型评估的基础。

But a model’s final score is only part of the evaluation process—which is why we’re releasing olmo-eval, a new workbench that builds on OLMES and extends it across the rest of LLM development. Compared to OLMES, olmo-eval cuts down the work of implementing new evaluations, offers more flexibility in defining where and how they run, and makes it easier to compose individual components into larger workflows. Agentic and multi-turn evaluation is supported as a first-class use case, and stronger analysis tools help you judge whether an intervention actually improved on the baseline or the difference amounts to noise.

但模型的最终分数只是评估过程的一部分，这就是我们发布 olmo-eval 的原因。这是一个基于 OLMES 构建并将其扩展到整个 LLM 开发流程的新工作台。与 OLMES 相比，olmo-eval 减少了实现新评估的工作量，在定义评估运行位置和方式上提供了更大的灵活性，并使将各个组件组合成更大的工作流变得更加容易。它将智能体（Agentic）和多轮对话评估作为一等用例提供支持，更强大的分析工具能帮助你判断干预措施是否真正改进了基准，还是仅仅是噪声带来的差异。

How olmo-eval differs from existing tools

olmo-eval 与现有工具有何不同

olmo-eval overlaps in some ways with Harbor, an open framework for evaluating AI agents inside containerized, sandboxed environments. But the two tools differ in their scope. Harbor is aimed mainly at running and publishing agent benchmarks; olmo-eval was built for the everyday work of developing a model—adding and configuring benchmarks, running them across checkpoints, and analyzing the results prompt by prompt instead of as a single overall score.

olmo-eval 在某些方面与 Harbor（一个用于在容器化沙盒环境中评估 AI 智能体的开放框架）有重叠。但这两个工具的适用范围不同。Harbor 主要旨在运行和发布智能体基准测试；而 olmo-eval 是为模型开发的日常工作而构建的——包括添加和配置基准测试、在不同检查点运行它们，以及逐个提示词地分析结果，而不是仅仅给出一个总分。

Harbor runs everything the same way—inside sealed, reproducible containers. Because containers can be resource-intensive, olmo-eval lets you choose how each benchmark runs instead. A benchmark that just needs a model to answer questions can run directly, which is faster and cheaper; a benchmark that needs a locked-down environment — say, one that runs code the model wrote — gets an isolated container setup. The lightweight path is the default, and olmo-eval only opts for the heavy setup when a benchmark actually requires it.

Harbor 以统一的方式运行所有任务——在封闭、可复现的容器内。由于容器可能占用大量资源，olmo-eval 允许你选择每个基准测试的运行方式。仅需模型回答问题的基准测试可以直接运行，这样更快且成本更低；而需要受限环境的基准测试（例如运行模型编写的代码）则会获得隔离的容器设置。轻量级路径是默认选项，olmo-eval 仅在基准测试确实需要时才会选择重量级的设置。

Harbor’s process for adding a benchmark is built for evals you plan to publish and share publicly, with the extra verification steps that entails. olmo-eval is built for moving quickly while you develop, and how you add a benchmark depends on what the benchmark needs: a short definition for a basic eval, with options to let a model use tools as it works through a benchmark, or — for a benchmark that already has its own code and procedure — a thin wrapper so olmo-eval can run it as is and report the results alongside other benchmark scores in the same format.

Harbor 添加基准测试的流程是为计划公开发布和分享的评估而构建的，包含额外的验证步骤。olmo-eval 则是为了在开发过程中快速迭代而构建的，添加基准测试的方式取决于基准测试的需求：对于基础评估，只需简短定义，并可选择让模型在执行基准测试时使用工具；对于已有代码和流程的基准测试，则提供一个轻量级封装，以便 olmo-eval 可以直接运行它，并以相同的格式与其他基准测试分数一起报告结果。

Both Harbor and olmo-eval keep benchmarks separate from the runtime policy (how the model is run to produce its answers) so you can change one without rewriting the other, but olmo-eval is designed for greater modularity. In olmo-eval, the model being evaluated, the tools it can use, the containerized environment, and any helper models – like an LLM-as-a-judge – are all swappable components. You can reuse a tool across many harnesses, or plug a grading model into one benchmark without perturbing the others, and adjust small settings (e.g., the exact wording of the prompt) without extensive effort.

Harbor 和 olmo-eval 都将基准测试与运行时策略（模型如何运行以生成答案）分离开来，因此你可以更改其中一个而不必重写另一个，但 olmo-eval 在设计上具有更高的模块化程度。在 olmo-eval 中，被评估的模型、它可使用的工具、容器化环境以及任何辅助模型（如作为评判者的 LLM）都是可替换的组件。你可以在多个测试工具（harnesses）中复用工具，或者将评分模型插入到一个基准测试中而不影响其他测试，并能轻松调整细微设置（例如提示词的具体措辞）。

Harbor reports an overall score for each model. olmo-eval reports those scores too, each with a standard error and a minimum detectable effect (the smallest difference that can be reliably distinguished from noise). But the more useful view lines the same questions up across two model checkpoints and compares them one by one, with all else held fixed. This helps you to see whether a tiny change in an overall average might indicate a real improvement or simply noise.

Harbor 为每个模型报告一个总分。olmo-eval 也会报告这些分数，并附带标准误差和最小可检测效应（即可以可靠地与噪声区分开来的最小差异）。但更有用的视图是将两个模型检查点中的相同问题对齐，并在保持其他条件不变的情况下逐一比较。这有助于你判断总平均值的微小变化是代表了真正的改进，还是仅仅是噪声。