GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
GEM:用于大语言模型数据优化的几何熵混合方法
Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy.
摘要: 大语言模型(LLM)预训练的有效性日益取决于数据构成,而非单纯的数据量。然而,最优的数据混合策略受到分类缺陷的阻碍:人类定义的分类体系存在本体论上的偏差,而欧几里得聚类方法则无法解决嵌入空间的各向异性问题。
We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics.
我们引入了 GEM(几何熵混合),这是一个将数据整理重新表述为超球面上变分问题的框架,并增加了一个混合平衡正则化项。通过解耦生成先验,并利用可证明的 MM(小化-极大化)算法优化目标函数,GEM 有效地抵消了聚类坍缩,从而发现了欧几里得启发式方法无法察觉的平衡语义结构。
We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation.
我们采用教师-学生蒸馏技术,将这种几何保真度扩展到网络规模的语料库,并引入了几何影响分数(GIS),用于生成可解释的分类体系。
Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.
在 11 亿参数模型上的实验表明,当 GEM 集成到 DoReMi 和 RegMix 等混合策略中时,它确立了新的行业领先水平,将平均下游任务准确率提高了 1.2%,并为可预测的数据混合提供了一个稳健的坐标系。