Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor:缓解智能体基准测试生成中的“工件漂移”问题

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. 人工智能智能体(AI agents)正开始完成具有价值的长期业务运营任务,但针对企业工作的训练和评估环境在平衡真实性、可验证性和规模化方面仍面临挑战。

Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. 环境和任务的创建经常遭遇一种我们称之为“工件漂移”(artifact drift)的失效模式:当指令、环境、预言机(oracles)和验证器由松散耦合的流程创建时,它们往往对任务的需求存在分歧,从而导致生成出的环境无法求解、容易被奖励作弊(reward-hackable)或前后不一致。

We introduce Anchor, a task-generation pipeline that formalizes domain experts’ specifications of business workflows into constraint optimization programs. 我们引入了 Anchor,这是一个任务生成流水线,它将领域专家对业务工作流的规范形式化为约束优化程序。

From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. 通过单一的参数化规范,该流水线可以联合生成自然语言指令、环境配置、经求解器认证的基准答案(ground-truth solution)以及基于状态的验证器。

With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. 使用 Anchor,修改参数即可生成难度可控且已知最优解的新任务,从而构建出与测试框架无关(harness-agnostic)的环境,其奖励仅取决于最终状态的业务正确性。

We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. 我们应用 Anchor 生成了 ERP-Bench:这是一个包含 300 个长期任务的基准测试,涵盖了生产级 ERP 系统中的采购和制造工作流。

We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. 我们发现,生成参数可以预测实际难度;前沿模型在 26.1% 的试验中满足了明确的任务约束,但仅在 17.4% 的试验中达到了完全最优解。

Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. 总的来说,我们证明了 Anchor 和 ERP-Bench 为构建可审计的评估环境提供了一种具体方案,适用于具有经济价值的智能体工作。