The Power of Power Law: Asymmetry Enables Compositional Reasoning

幂律的力量：非对称性赋能组合推理

Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions.

摘要： 自然语言数据遵循幂律分布，其中大部分知识和技能出现的频率极低。尽管一种常见的直觉认为，通过重加权或筛选数据使其趋向于均匀分布，可能有助于模型更好地学习这些长尾技能，但我们发现了一个反直觉的结果：在广泛的组合推理任务（如状态跟踪和多步算术）中，在幂律分布下进行训练的效果始终优于在均匀分布下进行训练。

To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills.

为了理解这一优势，我们引入了一个极简的技能组合任务，并证明在幂律分布下学习所需的训练数据量显著更少。我们的理论分析表明，幂律采样诱导了一种有益的非对称性，改善了病态的损失函数地形（loss landscape），这使得模型能够以较低的数据复杂度先习得高频技能组合，进而将其作为垫脚石，高效地学习稀有的长尾技能。

Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

我们的研究结果为“什么样的训练数据分布对模型训练最有效”这一问题提供了另一种视角。