Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Evoflux：紧凑型智能体可执行工具工作流的推理时演化

Abstract: Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs.

摘要： 紧凑型语言模型（LM）降低了工具智能体的成本、延迟和部署风险。然而，MCP 风格的工具使用不仅仅是孤立的函数调用：智能体必须从实时目录中发现工具、满足模式要求、在中间输出中保持依赖关系，并基于执行证据来支撑最终响应。小型规划器通常能生成看似合理的工作流图，但在工具解析、参数验证、依赖跟踪或执行阶段往往会失败。我们认为，小语料库蒸馏难以有效处理这种故障模式。几百条教师轨迹可以教授工作流格式，但很难涵盖在不断变化的工具目录中修复失败计划所需的恢复行为。

We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

我们引入了 Evoflux，这是一种推理时演化搜索方法，将紧凑型工具使用视为可执行工具工作流的修复过程。它通过结构化编辑、执行反馈、自适应强度、元引导重设计和多样性剪枝来演化类型化的工作流图。在涵盖实时 MCP 服务器和 250 个工具的留出（held-out）MCP-Bench 任务上，Evoflux 将小型规划器的执行可行性从约 3% 提高到了 17-24%。相比之下，在相同搜索挖掘数据上进行的 SFT 和 SFT+DPO 方法表现平平、表现不佳，甚至低于零样本（zero-shot）性能；ReAct 虽然能达到更高的峰值，但方差和 Token 成本也更高。这些结果表明，在教师轨迹预算有限的情况下，基于执行的搜索更为可靠。