ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

ICRL：通过强化学习实现自我批判的内化

Abstract: Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique’s guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement.

摘要： 基于大语言模型的智能体难免会犯错，但批判（Critique）往往能引导模型走向正确的行为。然而，一旦移除批判，模型在面对相同查询时可能会再次失败，这表明它尚未将批判的指导内化为其底层的能力。与此同时，固定的批判者（Critic）无法随时间推移提升其反馈质量，从而限制了迭代式自我改进的潜力。

To address this, we propose learning to internalize self-critique with reinforcement learning (ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver’s subsequent performance gain, incentivizing actionable feedback.

为了解决这一问题，我们提出了“通过强化学习实现自我批判内化”（ICRL）这一新框架。该框架通过共享的主干网络联合训练求解器（Solver）和批判者，旨在将批判引导下的成功转化为求解器独立解决问题的能力。批判者根据求解器随后的性能提升获得奖励，从而激励其提供更具可操作性的反馈。

To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver’s own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior.

为了解决“批判条件下的行为”与“无批判行为”之间的分布偏移问题，ICRL 引入了一种分布校准重加权比率（distribution-calibration re-weighting ratio），有选择地迁移那些与求解器自身提示分布相兼容的、由批判引导的改进。此外，基于角色的组优势估计（role-wise group advantage estimation）稳定了两个角色之间的联合优化过程。这些机制共同确保了求解器学会如何在没有外部批判的情况下自我提升，而不是产生对批判条件的依赖。

We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at this https URL.

我们在涵盖智能体任务和数学推理任务的多个基准测试中评估了 ICRL，并使用 Qwen3-4B 和 Qwen3-8B 作为主干网络。结果显示，模型性能得到了持续提升：在智能体任务上，相比 GRPO 平均提升了 6.4 个百分点；在数学推理任务上，平均提升了 7.0 个百分点。值得注意的是，训练出的 8B 批判者在性能上可与 32B 批判者媲美，且使用的 Token 数量显著更少。代码已在指定链接开源。