Verifiable Rewards for Calibrated Probabilistic Forecasting
Verifiable Rewards for Calibrated Probabilistic Forecasting
用于校准概率预测的可验证奖励机制
Reinforcement learning with verifiable rewards can in principle train calibrated probabilistic forecasters, since a proper scoring rule such as the Brier score is computed from outcomes alone and is minimized in expectation by the true probability. In practice it degrades calibration, and existing remedies address epistemic uncertainty, where a model’s confidence accompanies a verifiably correct or incorrect answer.
原则上,带有可验证奖励的强化学习可以训练出经过校准的概率预测模型,因为像 Brier 分数这类适当的评分规则仅根据结果计算,且其期望值在真实概率下达到最小。但在实践中,这种方法会降低校准度;现有的补救措施主要针对认知不确定性(epistemic uncertainty),即模型的置信度伴随着可验证的正确或错误答案。
We study aleatoric forecasting, where the forecast itself is the output and the label is one stochastic outcome, taking NFL in-game win probability as a testbed with the betting market as a reference. Rewarding the realized per-play outcome fails, because the single outcome is a noisy target and the policy gradient corrupts the chain of thought.
我们研究了偶然性预测(aleatoric forecasting),即预测本身就是输出,而标签是一个随机结果。我们以 NFL 比赛中的胜率预测作为测试平台,并以博彩市场作为参考。直接奖励单次比赛结果是无效的,因为单一结果是一个带有噪声的目标,且策略梯度会破坏思维链(chain of thought)。
We introduce a verifiable, label-free reward, a state-conditioned empirical win rate estimated from past outcomes, that removes the label noise, and we keep the gradient off the reasoning, by direct prediction or a gradient mask, so it cannot be corrupted. Trained with this reward alone, without human labels or supervised fine-tuning, a 7B model reaches the calibration of the betting market by direct prediction and is better calibrated than a zero-shot frontier model.
我们引入了一种可验证的、无需标签的奖励机制——即根据过去结果估算的条件状态经验胜率。该方法消除了标签噪声,并通过直接预测或梯度掩码(gradient mask)将梯度从推理过程中剥离,从而防止推理过程被破坏。仅使用该奖励进行训练,在无需人工标注或监督微调的情况下,一个 7B 参数的模型通过直接预测达到了博彩市场的校准水平,且校准效果优于零样本(zero-shot)的前沿模型。
That frontier model and a tabular estimator reach the same Brier score as this model, identifying the market’s small remaining edge as live in-game information beyond their shared inputs. Masking the gradient, rather than dropping the chain of thought, preserves reasoning from which the forecast follows, which ordinary chain-of-thought training corrupts.
该前沿模型和一个表格估计器与本模型达到了相同的 Brier 分数,这表明市场剩余的微小优势源于其共享输入之外的实时比赛信息。通过掩盖梯度而非丢弃思维链,模型保留了预测所依赖的推理过程,而普通的思维链训练往往会破坏这一过程。