Data and Evaluation Closed-Loop for Model Capability Enhancement
Data and Evaluation Closed-Loop for Model Capability Enhancement
模型能力增强的数据与评估闭环
Abstract: Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies — benchmark names and per-sample correctness versus data sources, domains, and quality labels — so this inference is usually intuition, not method.
摘要: 模型能力是大语言模型(LLM)预训练中的核心变量,但它从未被直接观测到:数据在事前塑造了模型能力,而评估仅在事后揭示它,并将样本、提示词、解码过程和评分规则压缩为一个充满噪声的分数。实际的优化过程往往是反向的:工程师首先观察到模型失败,然后必须推断出语料库的修复方案。由于评估端(基准名称、单样本正确性)与数据端(数据来源、领域、质量标签)使用着互不兼容的术语,这种推断通常依赖直觉而非科学方法。
We close this gap with the capability slice: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint — precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention.
我们通过“能力切片”(capability slice)弥合了这一差距:这是一组共享背景条件、任务类型、求解操作和输出约束的评估样本。它既精确到足以定位单一弱点,又稳定到足以进行聚合分析,这与过于粗糙的基准名称或过于嘈杂的单一样本形成了鲜明对比。以该单元为核心,评估分类法、非指令数据分类法以及映射规则共同构成了一个闭环,将基准测试层面的失败转化为有针对性且可测试的数据干预。
We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by $-46.82%$, but diagnosis traces this to a single masked \texttt{\textless EOS\textgreater} loss rather than weakened reasoning; restoring it recovers BBH to $66.44$, above the original checkpoint, without changing the data.
我们在两个方向截然不同的案例研究中测试了该闭环。首先,该闭环排除了数据问题:持续预训练导致 BBH 分数下降了 $-46.82%$,但诊断发现这是由单一的掩码 \texttt{\textless EOS\textgreater} 损失引起的,而非推理能力减弱;恢复该损失后,BBH 分数回升至 $66.44$,超过了原始检查点,且无需更改任何数据。
Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from $6.67$/$0.00$ to $26.67$ each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.
其次,该闭环确认了数据问题:通过求解操作将持续存在的数学推理弱点分解为特定的失败组合,并据此构建针对弱点的采样程序,将 AIME2025/AIME2026 的 Pass@128 分数分别从 $6.67$/$0.00$ 提升至 $26.67$。同一个未经修改的闭环在两个案例中得出了相反且正确的结论,这表明从评估到数据的推断过程可以是常规化、可审计且经实验验证的,而不再仅仅依赖直觉。