CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards
CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards
CSRP:基于强化学习与效率感知奖励的中文文本纠错思维链推理
Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction.
基于大语言模型(LLM)的中文语法纠错(CGEC)系统面临两个关键挑战:通用模型缺乏处理细微语法区别的专业语言先验知识,且采用最大似然估计(MLE)的有监督微调(SFT)无法针对以精度为导向的指标进行优化,从而导致系统性的过度纠错问题。
We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits.
我们提出了 CSRP,这是一个三阶段框架,通过在 590 万个平衡样本上进行持续预训练(CPT)以内化领域知识,通过带有显式错误推理的思维链 SFT 实现诊断透明化,并结合一种新颖的“效率感知奖励”进行组相对策略优化(Group Relative Policy Optimization),从而明确惩罚不必要的编辑。
On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models.
在 NACGEC 基准测试中,CSRP 达到了 50.99 的 $F_{0.5}$ 分数和 57.17 的精度,取得了最先进的性能,大幅超越了以往的最佳结果,同时有效缓解了 MLE 训练模型中固有的过度纠错偏差。
Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction.
我们的方法还将 CSCD 拼写纠错性能提升至 59.61 F1,超过 GPT-4 达 5.20 个百分点。全面的消融实验表明,强化学习(RL)对齐阶段相比 SFT 基线带来了 8% 的相对增益,且该增益与大规模 CPT 的贡献正交,验证了针对编辑效率进行显式优化对于高质量语法纠错至关重要。
Our code is available at this https URL.
我们的代码已在以下链接开源。