CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP：基于强化学习与效率感知奖励的中文文本纠错思维链推理

Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction.

基于大语言模型（LLM）的中文语法纠错（CGEC）系统面临两个关键挑战：通用模型缺乏处理细微语法区别的专业语言先验知识，且采用最大似然估计（MLE）的有监督微调（SFT）无法针对以精度为导向的指标进行优化，从而导致系统性的过度纠错问题。

We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits.

我们提出了 CSRP，这是一个三阶段框架，通过在 590 万个平衡样本上进行持续预训练（CPT）以内化领域知识，通过带有显式错误推理的思维链 SFT 实现诊断透明化，并结合一种新颖的“效率感知奖励”进行组相对策略优化（Group Relative Policy Optimization），从而明确惩罚不必要的编辑。

On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models.

在 NACGEC 基准测试中，CSRP 达到了 50.99 的 $F_{0.5}$ 分数和 57.17 的精度，取得了最先进的性能，大幅超越了以往的最佳结果，同时有效缓解了 MLE 训练模型中固有的过度纠错偏差。

Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction.

我们的方法还将 CSCD 拼写纠错性能提升至 59.61 F1，超过 GPT-4 达 5.20 个百分点。全面的消融实验表明，强化学习（RL）对齐阶段相比 SFT 基线带来了 8% 的相对增益，且该增益与大规模 CPT 的贡献正交，验证了针对编辑效率进行显式优化对于高质量语法纠错至关重要。

Our code is available at this https URL.

我们的代码已在以下链接开源。