Recursive Self-Evolving Agents via Held-Out Selection

通过留出选择实现递归自进化智能体

Abstract: LLM agents are increasingly improved without weight updates by evolving a natural-language artifact, such as reflections, workflows, playbooks, cheatsheets, or optimized prompts, that conditions a frozen policy. Such methods are typically reported as wins on the single benchmark where they help. We study them apples-to-apples and surface a sharper picture.

摘要： 大语言模型（LLM）智能体正越来越多地在不更新权重的情况下，通过演化自然语言工件（如反思、工作流、手册、备忘单或优化提示词）来改进，这些工件用于调节冻结的策略。此类方法通常在它们适用的单一基准测试中被报告为成功。我们对这些方法进行了公平的对比研究，并揭示了更清晰的图景。

We introduce RSEA, a Recursive Self-Evolving Agent that carries a compact three-layer natural-language state: an imperative strategy, reusable skills, and a procedural playbook. Across generations, RSEA rewrites all three layers from its own trajectories and commits a candidate only if it does not regress on a disjoint held-out split, using a strict keep-better gate.

我们引入了 RSEA（递归自进化智能体），它携带一个紧凑的三层自然语言状态：指令策略、可重用技能和程序化手册。在代际演化中，RSEA 根据自身的轨迹重写所有三层内容，并且仅在不出现留出数据集（held-out split）性能倒退的情况下，通过严格的“择优保留”门控机制提交候选方案。

Across four diverse benchmarks, ALFWorld, GAIA, $\tau$-bench, and WebShop, and six faithful baselines, ReAct, Reflexion, GEPA, AWM, ACE, and Dynamic Cheatsheet, all evaluated on one shared local backbone, we find three main results. First, no artifact universally wins. RSEA is the strongest single-pass method on ALFWorld, reaching 69.3% compared with 64.6% for ReAct (McNemar (p=0.015)), and reaches 79.4% with retry, the best overall result. However, concrete-workflow induction, represented by AWM, is best on the strong-backbone tool-use tasks.

在四个不同的基准测试（ALFWorld、GAIA、$\tau$-bench 和 WebShop）以及六个忠实的基线模型（ReAct、Reflexion、GEPA、AWM、ACE 和 Dynamic Cheatsheet）上，所有模型均在同一个共享的本地主干模型上进行评估，我们得出了三个主要结论。首先，没有一种工件是万能的。RSEA 是 ALFWorld 上最强的单次执行方法，达到了 69.3%，而 ReAct 为 64.6% (McNemar (p=0.015))；在重试机制下，RSEA 达到了 79.4%，是整体最佳结果。然而，以 AWM 为代表的具体工作流归纳在强主干模型的工具使用任务中表现最佳。

Second, unguarded context evolution is high-variance and unsafe. Dynamic Cheatsheet, which curates context online without a held-out gate, is near-best on ALFWorld at 70.7%, yet collapses on WebShop, with a score of 0.14 compared with 0.43 for ReAct.

其次，缺乏保护的上下文演化具有高方差且不安全。Dynamic Cheatsheet 在没有留出数据集门控的情况下在线整理上下文，在 ALFWorld 上表现接近最佳（70.7%），但在 WebShop 上却表现崩盘，得分为 0.14，而 ReAct 为 0.43。

Third, RSEA’s strict held-out selection is what makes recursive self-evolution monotone-safe: it never significantly underperforms the base agent on any benchmark and falls back to vanilla ReAct when evolved context would hurt.

第三，RSEA 严格的留出选择机制使得递归自进化具有单调安全性：它在任何基准测试中都不会显著低于基础智能体的表现，并且当演化后的上下文产生负面影响时，它会回退到原始的 ReAct。