The Power of Power Law: Asymmetry Enables Compositional Reasoning
The Power of Power Law: Asymmetry Enables Compositional Reasoning
幂律的力量:非对称性赋能组合推理
Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions.
摘要: 自然语言数据遵循幂律分布,其中大部分知识和技能出现的频率极低。尽管一种常见的直觉认为,通过重加权或筛选数据使其趋向于均匀分布,可能有助于模型更好地学习这些长尾技能,但我们发现了一个反直觉的结果:在广泛的组合推理任务(如状态跟踪和多步算术)中,在幂律分布下进行训练的效果始终优于在均匀分布下进行训练。
To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills.
为了理解这一优势,我们引入了一个极简的技能组合任务,并证明在幂律分布下学习所需的训练数据量显著更少。我们的理论分析表明,幂律采样诱导了一种有益的非对称性,改善了病态的损失函数地形(loss landscape),这使得模型能够以较低的数据复杂度先习得高频技能组合,进而将其作为垫脚石,高效地学习稀有的长尾技能。
Our results offer an alternative perspective on what constitutes an effective data distribution for training models.
我们的研究结果为“什么样的训练数据分布对模型训练最有效”这一问题提供了另一种视角。