Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

深入 VAKRA：智能体的推理、工具使用与失效模式

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.

我们最近推出了 VAKRA，这是一个基于工具、可执行的基准测试，旨在评估 AI 智能体在企业级环境中进行推理和行动的能力。与测试孤立技能的传统基准不同，VAKRA 通过完整的执行轨迹来衡量跨 API 和文档的组合推理能力，从而评估智能体是否能够可靠地完成多步骤工作流。

VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints. As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.

VAKRA 提供了一个可执行环境，智能体可以在其中与 8,000 多个本地托管的 API 进行交互，这些 API 由涵盖 62 个领域的真实数据库以及与领域对齐的文档集提供支持。任务可能需要 3 到 7 步的推理链，在自然语言工具使用约束下，将结构化 API 交互与非结构化检索相结合。如下所示，模型在 VAKRA 上的表现不佳——在本篇博客中，我们提供了关于 VAKRA 任务的更多数据集细节，并分析了我们在不同任务中观察到的失效模式。

Task Description

任务描述

As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities. 如下所示，VAKRA 基准测试包含四个任务，每个任务测试一组不同的能力。

Capability 1: API Chaining using Business Intelligence APIs 能力 1：使用商业智能 API 进行 API 链式调用

This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer.

该能力包含跨 54 个领域的 2,077 个测试实例，需要使用来自 SLOT-BIRD 和 SEL-BIRD 集合（Elder 等人，2026 年）的工具。与 Elder 等人的设置相比，SLOT-BIRD 和 SEL-BIRD 中的工具库通过包含更多的领域得到了扩展。每个领域仅限于一个工具集合，任务涉及将 1 到 12 次工具调用链接起来以得出最终答案。

As shown above, each instance has an associated JSON data source from which the answer must be derived. The MCP servers supporting this task include a special tool, called get_data(tool_universe_id=id), which must be called at the beginning of each instance. This tool initializes the data source, returns a lightweight preview of the data, and stores the full dataset server-side to avoid large data transfers. This prevents the inefficient transfer of large data over the MCP protocol.

如上所示，每个实例都有一个关联的 JSON 数据源，必须从中推导出答案。支持此任务的 MCP 服务器包含一个名为 get_data(tool_universe_id=id) 的特殊工具，必须在每个实例开始时调用它。该工具会初始化数据源，返回数据的轻量级预览，并将完整数据集存储在服务器端，以避免大量数据传输。这防止了通过 MCP 协议进行低效的大数据传输。

Capability 2: Tool Selection using Dashboard APIs 能力 2：使用仪表板 API 进行工具选择

This capability includes 1,597 instances across 17 domains, requiring tools from an expanded REST-BIRD collection (Elder et al.). These use endpoint-style interfaces that provide highly specific, query-aligned endpoints that encapsulate most computation. They are served as REST APIs running in a FastAPI server, which is wrapped by the MCP server. This task requires selecting the correct APIs from the domain-specific tool set. Each domain contains a minimum of 6 to a maximum of 328 tools (with an average of 116 tools).

该能力包含跨 17 个领域的 1,597 个实例，需要来自扩展的 REST-BIRD 集合（Elder 等人）的工具。这些工具使用端点式接口，提供高度特定、与查询对齐的端点，封装了大部分计算。它们作为运行在 FastAPI 服务器中的 REST API 提供服务，并由 MCP 服务器封装。此任务需要从特定领域的工具集中选择正确的 API。每个领域包含最少 6 个、最多 328 个工具（平均 116 个工具）。

The OpenAI API Specification restricts the tool list input to a maximum length of 128 tools. This restriction requires an agent builder using this API to manage the length of the tool list directly via a shortlisting mechanism. In the baseline agents in our repository, a simple shortlisting capability handles this challenge.

OpenAI API 规范将工具列表输入的长度限制为最多 128 个。这一限制要求使用此 API 的智能体构建者必须通过筛选机制直接管理工具列表的长度。在我们存储库的基准智能体中，一种简单的筛选功能处理了这一挑战。

Capability 3: Multi-Hop Reasoning using Dashboard APIs 能力 3：使用仪表板 API 进行多跳推理

The Capability 3 segment of the benchmark has 869 test instances drawn from 38 subject domains. These instances rely again on the REST-BIRD API collection. 该基准测试的能力 3 部分包含从 38 个学科领域中抽取的 869 个测试实例。这些实例再次依赖于 REST-BIRD API 集合。