CEO-Bench: Can Agents Play the Long Game?

CEO-Bench：智能体能打好“持久战”吗？

Abstract: Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal.

摘要： 语言模型智能体在软件工程和客户服务等孤立的、短周期的任务中正变得日益精通。然而，现实世界的挑战需要多种复杂技能的结合，而这些技能在智能体身上大多尚未得到验证：(1) 在不确定性中应对长周期任务；(2) 在嘈杂的环境中获取信息；(3) 适应不断变化的世界；(4) 协调多个动态部分以实现统一目标。

We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO.

我们推出了 CEO-Bench，通过模拟一个具有代表性的现实任务——运营一家初创公司 500 天，来综合评估这些能力。智能体通过可编程的 Python 接口管理一家虚拟公司的定价、营销、预算及其他诸多方面，其运行环境和面临的挑战与人类 CEO 完全一致。

Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences.

成功需要分析嘈杂且相互关联的商业数据库，将信号转化为合理的策略，并通过编程协调多项决策。最强大的智能体能够编写复杂的代码，通过模拟客户群体来预测未来现金流，并挖掘谈判历史以发现潜在的客户偏好。

Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

即便如此，大多数最先进的模型在这种环境下仍表现吃力。只有 Claude Opus 4.8 和 GPT-5.5 的最终余额超过了 100 万美元的初始资金，且两者都未能持续实现盈利。CEO-Bench 为衡量驱动长期、自适应进步所需的智能迈出了第一步。