RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

RICE-PO：将检索交互转化为推理智能体的信用信号

Abstract: Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. 摘要： 检索技术正日益从“一次性匹配”转向“交互式推理”，即语言智能体通过迭代检查证据、重构查询并再次搜索来完成任务。

Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. 训练此类智能体带来了一个“信用分配”（credit-assignment）难题：诸如查询或摘要等可执行动作可以直接由检索器进行评估，而潜在的推理步骤则无法直接观察，且仅会影响未来的可执行动作。

This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. 这种不对称性使得基于结果的奖励分配变得不可靠，因为相同的最终奖励可能会归功于那些实际上并未对检索成功起到决定性作用的推理步骤。

We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. 我们提出了 RICE-PO，这是一个无需评论员（critic-free）的策略优化框架，它将检索交互转化为局部化的学习信号。

RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. RICE-PO 选择高不确定性的可执行动作作为锚点，利用检索指标评估局部反事实分支，并仅在“推理对动作的影响力”较强且“未来残余效应”稳定的情况下，才将信用传播给潜在的推理步骤。

On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. 在 BRIGHT 和 BEIR 数据集上，在相同的检索器设置下，RICE-PO 的表现始终优于基于提示（prompt-based）的智能体和基于组的强化学习（group-based RL）基线模型。

These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents. 这些结果表明，智能体与环境交互的结构本身，就能为训练基于推理的检索智能体提供有用的监督信号。