EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
Introduction
Voice agent failures are often highly domain-specific. A system that flawlessly processes alphanumeric confirmation codes in flight re-booking transactions might stumble when handling complex policies in HR systems. Different domains test an agent’s ability to adapt to different vocabulary, workflow complexities and user expectations.
语音智能体的故障往往具有高度的领域特定性。一个能够完美处理航班改签交易中字母数字确认码的系统,在处理人力资源系统中的复杂政策时可能会遇到困难。不同的领域测试了智能体适应不同词汇、工作流复杂性和用户期望的能力。
So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from our original release. Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair.
因此,随着本次发布,EVA-Bench 从单一的企业领域扩展到了三个:航空客户服务管理 (CSM)、企业 IT 服务管理 (ITSM) 以及医疗保健人力资源服务交付 (HRSD)。它们总共涵盖了 121 个工具下的 213 个评估场景,场景覆盖范围较最初版本增加了约 4 倍。每个场景都通过三个前沿模型(OpenAI GPT-5.4、Google Gemini 3.1 Pro 和 Anthropic Claude Opus 4.6)进行了可解性验证,确保了该基准测试既具有挑战性又公平。
All three datasets are open-source and available for download: 所有三个数据集均为开源,可供下载:
from datasets import load_dataset
# Airline Customer Service Management (CSM) — 50 scenarios
airline = load_dataset("ServiceNow-AI/eva-bench", "airline", split="test")
# Enterprise IT Service Management (ITSM) — 80 scenarios
itsm = load_dataset("ServiceNow-AI/eva-bench", "itsm", split="test")
# Healthcare HR Service Delivery (HRSD) — 83 scenarios
hrsd = load_dataset("ServiceNow-AI/eva-bench", "medical", split="test")
EVA-Bench is built for multiple audiences. If you’re evaluating a voice agent, you can run it against a diverse set of realistic enterprise scenarios spanning 35+ distinct workflows. If you’re building your own evaluation dataset, this post describes our end-to-end generation and validation process in enough detail to serve as a practical reference. We walk through how each domain was designed and generated and take a deep dive into the two new additions. We also preview our upcoming multilingual extension, which widens the benchmark’s reach beyond English-only enterprise deployments.
EVA-Bench 旨在服务于多类受众。如果您正在评估语音智能体,可以利用涵盖 35 个以上不同工作流的各种真实企业场景对其进行测试。如果您正在构建自己的评估数据集,本文详细描述了我们的端到端生成和验证过程,可作为实用的参考。我们介绍了每个领域的设计与生成方式,并深入探讨了两个新增领域。此外,我们还预告了即将推出的多语言扩展功能,这将使该基准测试的覆盖范围超越仅限英语的企业部署。
Data Design Principles
Five principles guided the design of the EVA-Bench datasets across all three domains.
数据设计原则
五个原则指导了 EVA-Bench 数据集在所有三个领域的设计。
Voice-first scope. Not every enterprise workflow belongs in a voice benchmark. We started by identifying which tasks within each domain are handled over the phone in practice, then selected the most common flows from that subset. This kept the scenarios grounded in realistic call patterns.
语音优先范围。 并非所有的企业工作流都适合语音基准测试。我们首先确定了每个领域中哪些任务在实践中是通过电话处理的,然后从该子集中选择了最常见的工作流。这使得场景能够基于真实的通话模式。
Realism. Tool schemas were modeled after the kinds of APIs a production platform uses. Scenario policies were drawn from actual enterprise constraints. For the Healthcare HRSD domain, this meant grounding scenarios in actual US healthcare policy and administration systems, including NPI numbers, FMLA, and insurance coverage, so that the benchmark reflects the domain as practitioners encounter it in real life.
真实性。 工具架构是根据生产平台使用的 API 类型进行建模的。场景策略取自实际的企业约束。对于医疗保健 HRSD 领域,这意味着将场景建立在实际的美国医疗保健政策和管理系统之上,包括 NPI 号码、FMLA 和保险覆盖范围,从而使基准测试能够反映从业者在现实生活中所遇到的领域情况。
Variety. Scaling a dataset by simply repeating identical tasks offers limited evaluation signal. To avoid this, we defined specific workflows for each domain and sampled across three scenario types: single-intent calls, multi-intent calls with up to four intents in a single conversation, and adversarial calls where callers attempt to bypass troubleshooting steps, misclassify urgency, or access records they are not authorized to view. Within single and multi-intent scenarios, we also included cases where the user’s goal is not satisfiable, because real call volume is not all happy-path, and in our experience models tend to struggle more with unsatisfiable goals than with successful interactions.
多样性。 通过简单重复相同任务来扩展数据集提供的评估信号有限。为了避免这种情况,我们为每个领域定义了特定的工作流,并对三种场景类型进行了采样:单意图通话、单次对话中包含多达四个意图的多意图通话,以及对抗性通话(即呼叫者试图绕过故障排除步骤、错误分类紧急程度或访问其无权查看的记录)。在单意图和多意图场景中,我们还包含了用户目标无法满足的情况,因为真实的通话量并非全是“理想路径”,根据我们的经验,模型在处理无法满足的目标时往往比处理成功的交互时更吃力。
Authentication. Prior work, (EVA-Bench and τ-Voice), has identified authentication as one of the most consistent failure points for voice agents. Every domain in EVA-Bench includes authentication flows, and the specific mechanisms are calibrated to the task. For example, OTP-based elevation appears where a production system would actually require it, not uniformly across all scenarios.
身份验证。 先前的研究(EVA-Bench 和 τ-Voice)已确定身份验证是语音智能体最常见的故障点之一。EVA-Bench 中的每个领域都包含身份验证流程,且具体机制针对任务进行了校准。例如,基于 OTP(一次性密码)的权限提升仅出现在生产系统实际需要的地方,而不是在所有场景中统一出现。
Reproducibility. Without reproducible scenarios, it is difficult to know whether a score difference reflects a genuine capability gap or an artifact of how the scenario played out. We designed the dataset so that every scenario has exactly one correct resolution path. User goal construction ensures the simulator always has the information and instructions it needs to behave consistently, and scenario generation explicitly checks for and eliminates any cases where multiple valid action sequences could achieve the same outcome.
可复现性。 如果没有可复现的场景,就很难判断分数差异是反映了真正的能力差距,还是场景执行方式带来的偏差。我们设计数据集时确保每个场景只有一个正确的解决路径。用户目标的构建确保了模拟器始终拥有其保持一致行为所需的信息和指令,并且场景生成过程会明确检查并消除任何可能导致多种有效操作序列达成相同结果的情况。
Scenario Generation
Joint generation. Scenarios are generated using SyGra, a graph-based synthetic data generation pipeline, with GPT-5.4 as the backbone. Each scenario requires three jointly consistent components which are generated together to prevent inconsistencies that arise when components are produced independently:
场景生成
联合生成。 场景使用 SyGra(一种基于图的合成数据生成流水线)生成,并以 GPT-5.4 为骨干。每个场景都需要三个联合一致的组件,这些组件被同时生成,以防止因独立生产组件而产生的不一致性:
User goal. Reproducibility requires that the user simulator behaves the same way every time a scenario is run. A vague statement of intent does not achieve this: the simulator will make different judgment calls across runs, producing inconsistent evaluation signals. To eliminate this, the user goal is structured as a decision tree that covers every situation the simulator is likely to encounter. The user goal specifies exactly which things the user should ask for along with a negotiation sequence that specifies exactly when to push back, when to ask for alternatives, and when to accept. Common edge cases…
用户目标。 可复现性要求用户模拟器在每次运行场景时表现一致。模糊的意图陈述无法实现这一点:模拟器会在不同运行中做出不同的判断,从而产生不一致的评估信号。为了消除这种情况,用户目标被构建为决策树,涵盖了模拟器可能遇到的每种情况。用户目标明确规定了用户应该要求什么,以及一个协商序列,该序列详细说明了何时进行反驳、何时要求替代方案以及何时接受。常见的边缘情况……