Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

持续学习,持续混合:一种高效且简洁的通用数据混合方法

Abstract: Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether.

摘要: 数据混合决定了如何组合不同来源或类型的数据,这是语言模型训练过程中一个至关重要的问题。在预训练阶段,数据构成是决定模型质量的关键因素;在持续学习和适应阶段,它则决定了模型能够保留和习得哪些知识。然而,现有的数据混合方法通常一次只能解决生命周期中的一个阶段:有些方法需要依赖绑定于特定训练阶段的小型代理模型,有些则假设领域集是固定的,而持续学习领域目前完全缺乏原则性的指导。

We argue that data mixing is fundamentally an online decision making problem — one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model’s actual learning dynamics.

我们认为,数据混合从根本上是一个在线决策问题——它在整个训练过程中反复出现,因此需要一个单一、统一的解决方案。我们引入了 OP-Mix(On-Policy Mix),这是一种能够贯穿整个语言模型训练生命周期的数据混合算法。我们的核心洞察是:可以通过在当前模型上直接训练的低秩适配器(Low-rank adapters)进行插值,从而以极低的成本模拟候选数据混合方案。这种方法消除了对独立代理模型的需求,并确保了搜索过程始终基于模型实际的学习动态。

Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.

在预训练、持续中期训练和持续指令微调中,OP-Mix 始终能找到近乎最优的混合方案,且计算成本仅为基准方法的一小部分。在预训练中,OP-Mix 将平均困惑度(Perplexity)较无混合训练提升了 6.3%。在持续学习任务中,OP-Mix 在达到与重训练(Retraining)和策略内蒸馏(On-policy distillation)相当的性能的同时,总计算量分别减少了 66% 和 95%。OP-Mix 为语言模型训练提供了一种新的视角:它不再是一系列截然不同的阶段,而是一个从数据中学习的单一连续过程。