Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

自我验证蒸馏:你的语言模型其实就是它自己的合成数据流水线

Abstract: Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding.

摘要: 经过后训练的大语言模型(LLM)能否在没有外部教师或工具反馈的情况下,仅使用无标签提示词进一步自我提升?我们研究了这一设定:仅从没有标准答案的无标签种子问题出发,涵盖数学、科学和编程这三个推理领域。

We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark’s use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes.

我们提出了“自我验证蒸馏”(Self-Verified Distillation),这是一种简单的后训练优化算法。在该算法中,模型针对种子问题生成候选答案,利用基于提示词的自我验证进行筛选,并基于最终生成的自策划数据集进行训练。受 UQ 基准测试利用多个验证器筛选难题候选答案的启发,我们将这种基于验证的过滤思想应用于自我训练:模型通过循环一致性、事实性和正确性检查这三个阶段的级联来过滤其生成的答案,只有当答案通过所有阶段并获得一致投票时才会被采纳。

We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains.

我们发现,在构建训练数据时,采样更多的候选生成结果并使用更大的验证预算,可以产生更高质量的自策划数据,进而训练出推理能力更强的模型。随后,我们使用自我验证蒸馏技术训练了多个规模的 Qwen3 模型,并在上述三个领域均取得了性能提升。

For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.

对于 Qwen3-4B 模型,我们的方法使数学(AIME26 和 HMMT)的汇总留存 pass@1 提高了 16.7 个百分点,科学(GPQA Diamond 和 HLE)提高了 11.1 个百分点,编程(LCBv5 和 LCBv6)提高了 8.3 个百分点,且这些增益同样适用于 0.6B 和 8B 模型。与仅在测试时通过消耗额外计算资源来提升性能的基线方法(UQ-TTC)相比,自我验证蒸馏在大多数设置下表现更优,且在测试时仅需进行一次推理调用。