Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
数据缩放:作为预测贡献谱的渐进式覆盖
We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. 我们研究了这样一个假设:真实数据的缩放定律(Scaling Laws)是由潜在预测贡献谱的渐进式覆盖所决定的,而不仅仅是由词元频率(token-frequency)的尾部特征所驱动。
We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. 我们利用文本语料库的后缀自动机(suffix-automaton)表示,定义了一种数据内在的全局 KL 预测贡献谱。在该谱中,每个状态的贡献度由其经验质量(empirical mass)乘以其相对于全局下一词元基准的 KL 散度(KL deviation)来决定。
Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. 在 12 个真实语料库的测试中,该谱的尾部斜率与固定小型 GPT 学习器的经验数据缩放指数之间存在极强的相关性。
We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. 在此基础上,我们进一步超越了斜率相关性分析,通过将观测到的超额损失(excess loss)与预处理的 1000k 全局 KL 谱的剩余尾部质量进行匹配,为每个训练规模 N 定义了一个有效截断秩 K(N)。
Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. 实验结果表明,log K 与 log N 近似呈线性关系,原始谱的汇总 R^2 约为 0.96,平滑谱的 R^2 约为 0.90。
These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss. 这些发现为一种简单的机制图景提供了强有力的实证支持:训练规模的扩大推动了预测状态谱中的有效前沿,而该谱的剩余尾部质量则追踪了剩余的超额损失。