CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
CoRA:用于可靠思维链推理的置信度-逻辑对齐方法
Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence—rationale alignment: whether a model’s confidence in its committed answer is justified by its generated rationale.
摘要: 思维链(CoT)推理能够提升大语言模型(LLM)的性能,但当伴随的思维链逻辑看似合理却不完整或缺乏充分支撑时,模型对答案的高置信度可能会产生误导。我们研究了“置信度-逻辑对齐”问题:即模型对其所选答案的置信度是否能被其生成的逻辑过程所支撑。
We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge.
我们引入了一种基于 GRPO(组相对策略优化)的强化学习框架,该框架联合奖励答案的正确性、所选答案的概率以及基于准则的逻辑支撑度。其中,该准则在不向评估器泄露标准答案的前提下,对逻辑的扎实性、连贯性、任务匹配度以及与所选答案的关联性进行评估。
Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence—rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration.
在 MedQA、MathQA 和 OpenBookQA 数据集上,使用三个开源权重 LLM 进行测试,结果显示,与未经微调的检查点、SFT(监督微调)以及仅针对正确性进行优化的 GRPO 相比,我们的方法将置信度-逻辑对齐误差降低了高达 26.51%,同时保持了具有竞争力的准确率,并通常能改善模型的校准能力。
These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.
这些结果表明,可靠的思维链推理不仅需要自信的答案,还需要能够实质性支撑这些答案的逻辑过程。