SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

SEAD：基于熵引导监督的胜任力感知在线策略蒸馏

Abstract: On-policy distillation (OPD) has a property absent in offline distillation and RL: teacher supervision quality depends on student competence. Incoherent rollouts yield noisy gradients; already-mastered tokens yield redundant ones. This creates waste at three scales (tokens, training phases, and prompts) yet existing methods supervise uniformly.

摘要： 在线策略蒸馏（OPD）具有离线蒸馏和强化学习所不具备的特性：教师监督的质量取决于学生的胜任能力。不连贯的展开（rollouts）会产生噪声梯度；而已掌握的标记（tokens）则会产生冗余梯度。这在标记、训练阶段和提示词三个维度上造成了浪费，然而现有的方法大多采用统一的监督方式。

We introduce SEAD, which uses entropy as a unified probe of this competence-dependent degradation at three scales: (1) joint teacher-student entropy partitions tokens into zones receiving tailored divergences or zero gradient (approx. 50% skipped); (2) a cosine schedule anneals from forward to reverse KL as competence grows; (3) a competence-gated curriculum introduces prompts easy-to-hard.

我们引入了 SEAD，它利用熵作为这种胜任力相关退化的统一探测指标，涵盖三个维度：（1）联合教师-学生熵将标记划分为不同区域，从而应用定制的散度计算或零梯度处理（约 50% 的标记被跳过）；（2）随着胜任能力的提升，采用余弦调度从前向 KL 散度平滑过渡到反向 KL 散度；（3）引入胜任力门控课程，按从易到难的顺序提供提示词。

These components are symbiotically necessary: token selection requires coherent rollouts (curriculum), annealing requires monotonic improvement (also curriculum). On OLMo-3 (7B to 32B), SEAD achieves +4.8 avg accuracy over vanilla OPD across six math benchmarks, with ablations confirming super-additive interactions.

这些组件之间存在共生关系：标记选择需要连贯的展开（依赖课程），而退火过程需要单调的性能提升（同样依赖课程）。在 OLMo-3（7B 到 32B）模型上，SEAD 在六个数学基准测试中比传统的 OPD 平均准确率提升了 4.8 个百分点，消融实验证实了各组件之间存在超加性交互作用。