GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

GEM：用于大语言模型数据优化的几何熵混合方法

Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy.

摘要： 大语言模型（LLM）预训练的有效性日益取决于数据构成，而非单纯的数据量。然而，最优的数据混合策略受到分类缺陷的阻碍：人类定义的分类体系存在本体论上的偏差，而欧几里得聚类方法则无法解决嵌入空间的各向异性问题。

We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics.

我们引入了 GEM（几何熵混合），这是一个将数据整理重新表述为超球面上变分问题的框架，并增加了一个混合平衡正则化项。通过解耦生成先验，并利用可证明的 MM（小化-极大化）算法优化目标函数，GEM 有效地抵消了聚类坍缩，从而发现了欧几里得启发式方法无法察觉的平衡语义结构。

We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation.

我们采用教师-学生蒸馏技术，将这种几何保真度扩展到网络规模的语料库，并引入了几何影响分数（GIS），用于生成可解释的分类体系。

Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.

在 11 亿参数模型上的实验表明，当 GEM 集成到 DoReMi 和 RegMix 等混合策略中时，它确立了新的行业领先水平，将平均下游任务准确率提高了 1.2%，并为可预测的数据混合提供了一个稳健的坐标系。