Patronus AI lands $50M to build ‘digital worlds’ that stress-test AI agents

Patronus AI lands $50M to build ‘digital worlds’ that stress-test AI agents

Patronus AI 融资 5000 万美元,打造用于压力测试 AI 智能体的“数字世界”

AI agents are becoming more sophisticated. They are evolving from answering questions to autonomously executing multi-step complex tasks. But before these agents can be trusted to book trips or conduct financial analysis on behalf of users, model providers and the startups building such agents want to ensure that they perform reliably across a vast range of scenarios. AI 智能体正变得越来越复杂。它们正在从回答问题演变为自主执行多步骤的复杂任务。但在这些智能体能够被信任去代表用户预订行程或进行财务分析之前,模型提供商和开发此类智能体的初创公司必须确保它们在各种场景下都能可靠地运行。

AI labs often use benchmarks to show off their model’s prowess, but a high score, even on an agent-oriented benchmark, doesn’t actually prove that an AI can accomplish various complex, real-world jobs correctly. Patronus AI, a startup founded in 2023 by former Meta AI researchers Anand Kannappan and Rebecca Qian, is helping model makers and companies fine-tune models to do just that by building simulated digital environments in which to evaluate the agents’ performance. AI 实验室经常使用基准测试来展示其模型的能力,但即使是在面向智能体的基准测试中获得高分,也并不能真正证明 AI 能够正确完成各种复杂的现实世界工作。Patronus AI 是一家由前 Meta AI 研究员 Anand Kannappan 和 Rebecca Qian 于 2023 年创立的初创公司,该公司通过构建模拟数字环境来评估智能体的表现,从而帮助模型制造商和企业对模型进行微调,以实现这一目标。

The San Francisco-based startup must be solving an important problem. Virtually every frontier AI lab and many emerging startups are now customers, according to Glenn Solomon, a managing director at Notable Capital, who describes demand for the company’s simulated environments as nearly insatiable. Patronus’ revenue has grown 15-fold over the past year, fueling significant investor interest. On Thursday, the company announced a $50 million Series B round led by Greenfield Partners, with participation from Notable Capital, Lightspeed, Datadog, and Samsung. The round brings the company’s total funding to $70 million. 这家总部位于旧金山的初创公司显然正在解决一个重要问题。Notable Capital 的董事总经理 Glenn Solomon 表示,几乎每一家前沿 AI 实验室和许多新兴初创公司现在都是其客户,他形容市场对该公司模拟环境的需求几乎是“贪得无厌”的。Patronus 的收入在过去一年中增长了 15 倍,引发了投资者的浓厚兴趣。周四,该公司宣布完成 5000 万美元的 B 轮融资,由 Greenfield Partners 领投,Notable Capital、Lightspeed、Datadog 和三星参投。此轮融资使该公司的总融资额达到 7000 万美元。

Patronus uses what it calls “digital world models” to create replicas of websites and internal systems. In these environments, agents are stress-tested after training using reinforcement learning, which iteratively rewards successful task completion and penalizes errors. AI labs see great value in these digital simulations because they give agents a chance to try different, sometimes unpredictable, scenarios. Patronus 使用其所谓的“数字世界模型”来创建网站和内部系统的副本。在这些环境中,智能体在经过强化学习训练后会接受压力测试,强化学习通过迭代奖励成功完成任务的行为并惩罚错误来优化模型。AI 实验室认为这些数字模拟非常有价值,因为它们让智能体有机会尝试各种不同的、有时是不可预测的场景。

The company compares its approach to how Waymo trained autonomous cars by first building synthetic worlds to test vehicles against rare hazards, such as severe weather or a child running after a ball. The difference with AI agents is that they tend to take shortcuts, which means they fail to complete the task correctly. “Patronus is really good at spotting the hacks and making sure they are holding the models accountable,” Solomon said. 该公司将其方法比作 Waymo 训练自动驾驶汽车的方式:首先构建合成世界,以测试车辆在面对罕见危险(如恶劣天气或儿童追逐球跑动)时的表现。与 AI 智能体的不同之处在于,智能体往往会走捷径,这意味着它们无法正确完成任务。“Patronus 非常擅长发现这些‘投机取巧’的行为,并确保模型能够承担责任,”Solomon 说道。

Patronus is currently providing its simulated digital worlds for software engineering and finance, but these are just the start, according to Kannappan. “Today we’re very focused on the problems that are verifiable, so the problems that you can immediately check and verify, but there are a ton more areas that are very non-verifiable or very hard to verify,” he said. Just because these processes are verifiable doesn’t mean they are simple. “We want to be able to actually create the environment in which you can operate an agent that can run for 10 hours or 10 days or 10 weeks,” Kannappan said. 据 Kannappan 介绍,Patronus 目前为其模拟数字世界提供软件工程和金融领域的应用,但这仅仅是个开始。“今天我们非常专注于那些可验证的问题,即你可以立即检查和核实的问题,但还有大量领域是非常不可验证或极难验证的,”他说。仅仅因为这些过程是可验证的,并不意味着它们很简单。“我们希望能够真正创造出一种环境,让你能够运行一个可以持续工作 10 小时、10 天甚至 10 周的智能体,”Kannappan 说道。

As for rivals, Patronus believes it is primarily competing against the internal teams AI labs have already built to evaluate agent behavior. While human-data firms like Mercor and Surge help model makers with reinforcement learning, Patronus operates differently by evaluating how agents behave without any human involvement. 至于竞争对手,Patronus 认为其主要竞争对手是 AI 实验室内部已经建立的用于评估智能体行为的团队。虽然像 Mercor 和 Surge 这样的人类数据公司通过强化学习帮助模型制造商,但 Patronus 的运作方式不同,它是在没有任何人类参与的情况下评估智能体的行为。