Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

基于自由能驱动强化学习与自适应优势塑形的大语言模型无监督推理研究

Abstract: Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model’s evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision.

摘要： 无监督强化学习（RL）已成为实现大语言模型（LLM）自我提升的一种极具前景的范式。然而，现有的基于无监督强化学习的方法往往缺乏在训练过程中适应模型推理能力演进的能力。因此，在缺乏真实标签（ground-truth）监督的情况下，这些方法可能会误导策略优化。

To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards.

为了解决这一问题，我们引入了 FREIA，这是一种基于两项关键创新的新型强化学习算法：（1）自由能驱动奖励（FER）：基于自由能原理调整奖励，以平衡共识与探索；（2）自适应优势塑形（AAS）：根据采样奖励的统计特征自适应地调整学习信号。

Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.

在涵盖三个推理任务的九个数据集上的实证评估表明，FREIA 的表现优于其他无监督强化学习基线模型。值得注意的是，在数学推理任务中，使用 DeepSeek-R1-Distill-Qwen-1.5B 模型时，FREIA 的 Pass@1 指标平均超过其他方法 0.5 到 3.5 个百分点。