New Microsoft tool lets devs spin up AI behavior tests using text descriptions

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

微软推出新工具,允许开发者通过文本描述快速构建 AI 行为测试

AI researchers and labs have advanced by leaps and bounds in evaluating AI models for everything from safety and compliance to sycophancy and alignment. But it appears companies and developers are faced with a new, specific need: making sure their AI system behaves as intended for their specific product or service. 在评估 AI 模型(涵盖从安全性、合规性到谄媚行为及对齐等各个方面)方面,AI 研究人员和实验室已经取得了飞跃性的进展。但目前看来,企业和开发者面临着一项新的具体需求:确保他们的 AI 系统能够按照其特定产品或服务的预期进行运作。

In a bid to make that testing process simpler, Microsoft on Tuesday took the wraps off ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. The open source framework, Microsoft says, makes evaluating application-specific AI behavior easy by using AI to turn high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests that can be investigated. 为了简化这一测试流程,微软于周二发布了 ASSERT,即“用于评估和回归测试的自适应规范驱动评分”(Adaptive Spec-driven Scoring for Evaluation and Regression Testing)的缩写。微软表示,该开源框架通过利用 AI 将关于目标、策略或预期行为的高级自然语言描述转化为可深入调查的全面评分测试,从而简化了针对特定应用场景的 AI 行为评估。

ASSERT takes plain-language descriptions of an AI model’s expected behavior and policies, turns them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the target system, and scores the results. It can also record the paths the AI system takes, including intermediate actions and tool calls, so developers can inspect where failures happen. ASSERT 能够接收 AI 模型预期行为和策略的通俗语言描述,将其转化为结构化的可接受与不可接受行为集,生成问题场景和测试用例,在目标系统上运行这些用例,并对结果进行评分。它还可以记录 AI 系统的执行路径,包括中间操作和工具调用,以便开发者检查故障发生的环节。

Devs can provide system context, tools, and constraints, too, if they want to further customize what the evaluations cover. For example, a developer could specify that a document research AI agent shouldn’t send emails to people outside the company, and it should limit confidential information to C-level executives and provide concise summaries with prior context in mind. ASSERT will use those rules to generate test cases that check whether the system follows those rules on an ongoing basis. 如果开发者希望进一步自定义评估范围,还可以提供系统上下文、工具和约束条件。例如,开发者可以指定一个文档研究 AI 代理不应向公司外部人员发送电子邮件,应将机密信息限制在公司高管范围内,并根据先前的上下文提供简洁的摘要。ASSERT 将利用这些规则生成测试用例,持续检查系统是否遵循这些规则。

The framework, according to Microsoft, fills a gap that broader, more general evaluations cannot when AI models are intended to behave in a manner that is shaped by an application or product’s context, policies, and tools. 微软表示,当 AI 模型需要根据特定应用或产品的上下文、策略和工具来表现时,该框架填补了更广泛、更通用的评估方法所无法覆盖的空白。

“One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” said Sarah Bird, chief product officer of Responsible AI at Microsoft. “Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar … What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.” “我们学到的一点是,评估对于做出正确的决策至关重要,”微软负责任 AI 的首席产品官 Sarah Bird 表示。“因为如果你不了解 AI 系统的行为,就很难知道它是否达到了你们组织的标准……我们发现,如果你真的想要一个值得信赖的系统,就应该评估更多与应用相关的维度。”

Bird said ASSERT can be used to evaluate systems when they’re being built, after deployment, and even for continuous monitoring. The release comes amidst a gradual but broader shift in the AI industry. As models grow more capable, researchers are focusing on repeatable testing and regression checks, with Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR rolling out benchmarks to measure how models behave under different conditions. Bird 表示,ASSERT 可用于系统构建期间、部署之后,甚至用于持续监控。此次发布正值 AI 行业发生渐进但广泛的转变之际。随着模型能力不断增强,研究人员正专注于可重复的测试和回归检查,斯坦福大学的 HELM、MLCommons 的 AILuminate 以及 METR 等评估组织纷纷推出基准测试,以衡量模型在不同条件下的表现。