Reflective Prompt Tuning through Language Model Function-Calling

通过大语言模型函数调用实现反射式提示词微调

Abstract: Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility.

摘要： 大语言模型（LLMs）在遵循指令和复杂推理方面的能力日益增强，使得提示词（Prompting）成为一种无需更新参数即可适配模型的灵活接口。然而，提示词设计仍然是一项劳动密集型工作，且对格式、措辞和指令顺序高度敏感，这促使人们寻求能够减少人工投入同时保持推理阶段灵活性的自动化提示词优化方法。

However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers.

然而，现有的方法通常是在提示词候选集中进行搜索，或者使用由单个示例或小批量数据驱动的固定“批判-修正”流程，这限制了它们捕捉系统性错误模式并基于失败历史进行针对性修改的能力。我们提出了反射式提示词微调（Reflective Prompt Tuning, RPT），这是一个利用大语言模型函数调用来模拟人类提示词工程师迭代工作流的框架。

An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection.

大语言模型优化器会调用一个诊断函数，该函数在整个优化集上评估目标模型，总结重复出现的失败模式，并返回一份结构化的诊断报告。优化器利用这份报告，结合之前积累的报告记忆，来修订下一轮迭代的提示词。此外，RPT 通过在诊断反馈和最终提示词选择中使用校准信号，支持了具备置信度感知能力的优化。

Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

在三项推理任务中，RPT 较初始提示词提升了高达 12.9 个百分点，保持了与当前最先进技术（SOTA）的竞争力，并改善了置信度校准。我们的分析表明，RPT 在多跳推理和数学推理任务中尤为有效，能够产生与诊断出的失败模式相一致的针对性提示词修订，从而提升任务性能和校准效果。