Life After Benchmark Saturation: A Case Study of CORE-Bench

基准测试饱和后的生活：以 CORE-Bench 为例

Abstract: When a benchmark’s accuracy saturates, it is often retired and replaced with a more challenging version. We show that this approach privileges accuracy and misses the opportunity to study six other key dimensions of agent performance: construct validity issues such as shortcuts, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold, and uplift from human-agent collaboration.

摘要： 当一个基准测试的准确率达到饱和时，它通常会被弃用并替换为更具挑战性的版本。我们指出，这种方法过度偏重准确率，从而错失了研究智能体性能其他六个关键维度的机会，这些维度包括：构念效度问题（如捷径学习）、分布外泛化能力、效率、可靠性、模型与脚手架（scaffold）的相对重要性，以及人机协作带来的提升。

We use CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate that measuring agents along these dimensions yields meaningful insights into agent performance even after accuracy saturates. First, we surface threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents. We introduce an improved benchmark, CORE-Bench v1.1, and an out-of-distribution task suite, CORE-Bench OOD.

我们以用于科学代码计算可复现性的基准测试 CORE-Bench Hard 为案例研究，证明了即使在准确率饱和后，从这些维度衡量智能体仍能提供关于其性能的有意义的见解。首先，我们揭示了 CORE-Bench Hard 中存在的构念效度威胁，这些威胁在能力较弱的智能体上难以预见。我们引入了一个改进后的基准测试 CORE-Bench v1.1，以及一个分布外任务套件 CORE-Bench OOD。

Second, we find that despite accuracy saturation, CORE-Bench v1.1 remains useful for measuring efficiency, reliability, model performance, and scaffold performance. Finally, we conduct a small-scale randomized experiment to measure uplift from human-agent collaboration on real-world computational reproducibility tasks. We find a statistically significant speedup by about a factor of two — likely underestimated due to one-fifth of human-only reproductions reaching the time limit before completing — and describe various other findings.

其次，我们发现尽管准确率已经饱和，CORE-Bench v1.1 在衡量效率、可靠性、模型性能和脚手架性能方面依然有效。最后，我们进行了一项小规模随机实验，以衡量在现实世界的计算可复现性任务中，人机协作所带来的提升。我们发现速度有统计学意义上的显著提升，大约提高了两倍——考虑到五分之一的纯人工复现任务在完成前达到了时间限制，这一数据可能还被低估了——并描述了其他多项研究发现。

Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.

总之，我们的贡献为当前主流的以准确率为中心的评估范式提供了一种更严谨的替代方案。