Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径降低信用分配方差

Abstract: Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement.

摘要： 大语言模型（LLM）在多步推理任务中的强化学习通常依赖于稀疏的终端奖励，这会导致信用分配（Credit Assignment）效果不佳，即最终反馈被平均分配到所有中间决策中。这种情况会导致梯度方差过大、训练不稳定以及大量无效更新，最终导致模型训练失败，无法实现持续的性能提升。

We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals.

我们引入了一种基于反事实比较的信用分配框架，该框架在相同输入下采样多条推理轨迹。通过将这些轨迹之间的差异视为对替代决策的隐式近似，我们构建了一个隐式过程级优势估计器，将稀疏的终端奖励转化为对步骤敏感的学习信号。

Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.

基于此，我们提出了隐式行为策略优化（Implicit Behavior Policy Optimization, IBPO）。该方法显著提高了模型在数学和代码推理基准测试中的训练稳定性和性能上限，为释放大语言模型的性能潜力指明了一个有前景的方向。

Paper Details:

Authors: Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou, Zijian Zeng
arXiv ID: 2605.16302
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

论文详情：

作者： Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou, Zijian Zeng
arXiv ID: 2605.16302
学科分类： 机器学习 (cs.LG)；人工智能 (cs.AI)；计算与语言 (cs.CL)