Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
通过基于方差感知的评分准则奖励与 GRPO 提升大模型在心脏医学问答中的表现
Abstract: Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning.
摘要: 大语言模型(LLMs)在医疗应用中展现出了巨大的潜力。然而,由于数据隐私限制、推理成本以及在边缘设备或端侧部署的适用性有限,通用模型在现实场景中的部署仍然困难重重。这些挑战促使研究人员开发更小、更高效的模型,并需要稳健的训练后策略来确保其医疗推理的可靠性。
In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes.
在这项工作中,我们研究了组相对策略优化(GRPO)在心脏医学问答任务中的应用,通过基于 RaR-Medicine 的评分准则(Rubric)监督对大模型进行训练。我们提出了一种“方差感知奖励框架”,该框架扩展了“评分准则作为奖励”(Rubrics as Rewards)中的显式和隐式聚合策略,通过基于准则级评分结果的连续分析奖励函数,取代了加权二元准则聚合和单一的整体李克特量表评分。
This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1).
这种公式化方法为稀疏、多准则且难以自动验证的反馈提供了更丰富的优化信号,并实现了更稳定的同策略(on-policy)强化学习。在 HealthBench 的心脏相关测试集上,我们表现最好的 GRPO 变体相较于 Qwen3-14B 基座模型,将准确率从 0.362 提升至 0.502,F1 分数从 0.532 提升至 0.668,同时保持了与 GPT-OSS-120B(准确率 0.508,F1 分数 0.674)相当的竞争力。
Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.
我们的研究结果表明,精心设计的基于评分准则的奖励机制为提升大模型在心脏医学问答中的表现提供了一种实用的策略,并具有扩展到其他基于评分准则任务的潜力。