Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
适者生存!自适应幂平均策略优化提升大模型推理能力
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model’s evolving reasoning capabilities.
摘要: 基于可验证奖励的强化学习(RLVR)是增强大语言模型(LLM)推理能力的重要范式。然而,现有的方法通常依赖于静态的策略优化方案,这与模型不断演进的推理能力并不匹配。
To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean.
为了解决这一问题,我们提出了自适应幂平均策略优化(APMPO),它包含两项主要创新:幂平均策略优化(PMPO)和反馈自适应裁剪(FAC)。具体而言,PMPO 引入了一个广义的幂平均目标函数。这使得模型能够从算术平均的“信号放大”行为,自适应地过渡到几何平均的“一致性强化”行为。
FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance.
FAC 则根据实时奖励统计数据自适应地调整裁剪边界,从而克服了静态机制的局限性。得益于这些创新,APMPO 改善了学习动态并提升了推理性能。
Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.
在涵盖三个推理任务的九个数据集上进行的广泛实验表明,APMPO 优于当前最先进的基于 RLVR 的基线模型。例如,在使用 Qwen2.5-3B-Instruct 模型时,APMPO 在数学推理基准测试上的平均 Pass@1 得分比 GRPO 提高了 3.0 个百分点。