SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

SEAD:基于熵引导监督的胜任力感知在线策略蒸馏

Abstract: On-policy distillation (OPD) has a property absent in offline distillation and RL: teacher supervision quality depends on student competence. Incoherent rollouts yield noisy gradients; already-mastered tokens yield redundant ones. This creates waste at three scales (tokens, training phases, and prompts) yet existing methods supervise uniformly.

摘要: 在线策略蒸馏(OPD)具有离线蒸馏和强化学习所不具备的特性:教师监督的质量取决于学生的胜任能力。不连贯的展开(rollouts)会产生噪声梯度;而已掌握的标记(tokens)则会产生冗余梯度。这在标记、训练阶段和提示词三个维度上造成了浪费,然而现有的方法大多采用统一的监督方式。

We introduce SEAD, which uses entropy as a unified probe of this competence-dependent degradation at three scales: (1) joint teacher-student entropy partitions tokens into zones receiving tailored divergences or zero gradient (approx. 50% skipped); (2) a cosine schedule anneals from forward to reverse KL as competence grows; (3) a competence-gated curriculum introduces prompts easy-to-hard.

我们引入了 SEAD,它利用熵作为这种胜任力相关退化的统一探测指标,涵盖三个维度:(1)联合教师-学生熵将标记划分为不同区域,从而应用定制的散度计算或零梯度处理(约 50% 的标记被跳过);(2)随着胜任能力的提升,采用余弦调度从前向 KL 散度平滑过渡到反向 KL 散度;(3)引入胜任力门控课程,按从易到难的顺序提供提示词。

These components are symbiotically necessary: token selection requires coherent rollouts (curriculum), annealing requires monotonic improvement (also curriculum). On OLMo-3 (7B to 32B), SEAD achieves +4.8 avg accuracy over vanilla OPD across six math benchmarks, with ablations confirming super-additive interactions.

这些组件之间存在共生关系:标记选择需要连贯的展开(依赖课程),而退火过程需要单调的性能提升(同样依赖课程)。在 OLMo-3(7B 到 32B)模型上,SEAD 在六个数学基准测试中比传统的 OPD 平均准确率提升了 4.8 个百分点,消融实验证实了各组件之间存在超加性交互作用。