CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA：用于可靠思维链推理的置信度-逻辑对齐方法

Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence—rationale alignment: whether a model’s confidence in its committed answer is justified by its generated rationale.

摘要： 思维链（CoT）推理能够提升大语言模型（LLM）的性能，但当伴随的思维链逻辑看似合理却不完整或缺乏充分支撑时，模型对答案的高置信度可能会产生误导。我们研究了“置信度-逻辑对齐”问题：即模型对其所选答案的置信度是否能被其生成的逻辑过程所支撑。

We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge.

我们引入了一种基于 GRPO（组相对策略优化）的强化学习框架，该框架联合奖励答案的正确性、所选答案的概率以及基于准则的逻辑支撑度。其中，该准则在不向评估器泄露标准答案的前提下，对逻辑的扎实性、连贯性、任务匹配度以及与所选答案的关联性进行评估。

Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence—rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration.

在 MedQA、MathQA 和 OpenBookQA 数据集上，使用三个开源权重 LLM 进行测试，结果显示，与未经微调的检查点、SFT（监督微调）以及仅针对正确性进行优化的 GRPO 相比，我们的方法将置信度-逻辑对齐误差降低了高达 26.51%，同时保持了具有竞争力的准确率，并通常能改善模型的校准能力。

These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

这些结果表明，可靠的思维链推理不仅需要自信的答案，还需要能够实质性支撑这些答案的逻辑过程。