SPEAR: Code-Augmented Agentic Prompt Optimization

Abstract: Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools — evaluate, python, set_prompt, finish — that decides autonomously how and when to use them.

摘要： 自动提示工程（APE）通过重写提示词来提升下游任务的性能，但现有的 APE 循环通常将优化器本身视为一个固定的流水线。我们将 CodeAct（Wang 等人，2024a）的“代码即行动”（code-as-action）范式引入 APE，并提出了 SPEAR（带主动回滚的沙盒提示词工程师）。这是一个自由形式的代理优化器，配备了四个工具——evaluate（评估）、python（执行 Python）、set_prompt（设置提示词）和 finish（完成），能够自主决定如何以及何时使用这些工具。

The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor.

其核心特色工具是 Python 沙盒：优化器可以在当前的评估数据帧（DataFrame）上编写并执行任意 Python 代码，从而进行由代理自主设计的结构化错误分析（如混淆矩阵、错误聚类、分组指标等）。为了将这种长周期的代理转化为单调递增的优化器，我们设置了两道护栏：指标回归时的自动回滚机制，以及可选的指标基准下限。

We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($\kappa$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $\kappa$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484.

我们在三个工业级“LLM 作为裁判”（LLM-as-judge）套件（涵盖招聘筛选、对话记忆和查询优化系统中的 13 项裁判任务）以及七项 BBH 任务和 GSM8K 上进行了评估。SPEAR 在所有工业任务的主要指标上均表现优异（在工具选择任务中 $\kappa$ 值为 0.857 对比 0.359；在过滤相关性任务中 F1-macro 为 0.815 对比 0.763；在最难的提取维度上 $\kappa$ 值为 0.254 对比 0.218）。在 BBH-7 任务中，SPEAR 的平均准确率为 0.938，而 GEPA 为 0.628，TextGrad 为 0.484。

Ablations show the Python tool is the largest single lever on complex judge tasks ($\Delta \approx +0.79\kappa$ on the 5-class tool-selection judge, $\Delta \approx +0.35\kappa$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.

消融实验表明，Python 工具是处理复杂裁判任务时最关键的杠杆（在 5 类工具选择裁判任务中 $\Delta \approx +0.79\kappa$，在最难的提取维度中移除该工具会导致 $\Delta \approx +0.35\kappa$ 的下降）；其不可替代的贡献在于类对混淆聚合（class-pair confusion aggregation），这是长上下文 LLM 无法从原始评估数据帧中可靠提取的信息。