WorkBench Revisited: Workplace Agents Two Years On

WorkBench Revisited: Workplace Agents Two Years On

重访 WorkBench:工作场所智能体两年回顾

Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%.

摘要: 2024 年 3 月,WorkBench 上表现最好的智能体 GPT-4 完成了 43% 的任务,但在 26% 的任务中采取了非预期的有害行动(例如向错误的人发送电子邮件)。我们在 2026 年 6 月重新评估了该基准测试,发现迄今为止表现最好的智能体 Claude Opus 4.8 完成了 89% 的任务,且非预期有害行动的发生率降至 2.5%。

Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage.

除了前沿智能体性能取得的显著进步外,有三点值得关注。首先,在 WorkBench 上,能力与安全性是相辅相成的,而非此消彼长;因此,完成任务最多的模型,其造成的非预期损害也最少。

Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person.

其次,虽然几类错误已被彻底消除,但前沿模型仍会犯一些基础性错误,偶尔导致不可逆的损害,例如将电子邮件发送给错误的人。

Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

第三,开源权重模型的兴起大幅降低了成本,使得以往只有专有模型才能达到的性能水平变得触手可及,而前沿模型的成本则保持相对稳定。我们发布了该基准测试的更新版本,改进了数据和代码质量,提供了新的模型评分,并分析了自 2024 年以来 WorkBench 上智能体的发展进展。