What Drives Interactive Improvement from Feedback?

是什么驱动了基于反馈的交互式改进？

Abstract: We study when natural-language feedback produces improvement beyond the gains obtainable from repeated attempts alone. In multi-turn language agent setting, higher final accuracy can reflect useful feedback, but it can also arise from resampling, format correction, or additional test-time computation. To separate these effects, we introduce a controlled student-teacher protocol across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI1, evaluating thirteen open-weight models in both student and teacher roles.

摘要： 我们研究了自然语言反馈在何种情况下能产生超越单纯重复尝试所带来的增益。在多轮语言智能体设置中，最终准确率的提高可能反映了反馈的有效性，但也可能源于重采样、格式修正或额外的测试时计算。为了区分这些效应，我们在 Omni-MATH、Codeforces、BBEH Linguini 和 ARC-AGI1 等基准测试中引入了一种受控的“学生-教师”协议，并评估了十三种开源权重模型在学生和教师角色中的表现。

We compare external feedback, self-feedback, and unguided self-refinement, while varying interaction history, task difficulty, and teacher access to privileged task information. Across settings, we find that multi-turn improvement is often not evidence of feedback use: self-generated feedback adds little beyond unguided self-refinement, whereas the strongest external teachers produce substantially larger feedback-specific gains, suggesting that useful feedback must provide guidance beyond generic retry.

我们对比了外部反馈、自我反馈和无引导的自我优化，同时改变了交互历史、任务难度以及教师对特权任务信息的访问权限。在各种设置中，我们发现多轮改进往往并不能证明反馈得到了有效利用：自我生成的反馈在无引导的自我优化之外几乎没有增加额外价值，而最强大的外部教师则能产生显著更大的、特定于反馈的增益，这表明有效的反馈必须提供超越通用重试机制的指导。

Dense student-teacher interaction matrices further show that interactive gains are driven more by the student’s ability to use feedback than by the teacher’s identity, although teacher choice remains important for a fixed student. These results suggest that feedback-based agents should be evaluated against repeated-attempt baselines, and that ability to act on feedback, not merely feedback availability, is a central bottleneck for interactive improvement. We release our controlled student-teacher evaluation framework at this https URL.

密集的“学生-教师”交互矩阵进一步表明，交互式增益更多地是由学生利用反馈的能力驱动的，而非教师的身份，尽管对于固定的学生而言，教师的选择依然重要。这些结果表明，基于反馈的智能体应与重复尝试的基准进行对比评估，且能够根据反馈采取行动的能力（而非仅仅是反馈的可用性）才是交互式改进的核心瓶颈。我们已在 [链接] 发布了我们的受控“学生-教师”评估框架。