EMO: Pretraining mixture of experts for emergent modularity

EMO: Pretraining mixture of experts for emergent modularity

EMO:用于涌现模块化的专家混合模型预训练

Today we’re releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts - just 12.5% of the total - for a given task while keeping near full-model performance, and still works as a strong general-purpose model when all experts are used together.

今天,我们发布了 EMO,这是一种全新的专家混合(MoE)模型。它通过端到端预训练,使模块化结构直接从数据中涌现,而无需依赖人类定义的先验知识。EMO 允许你在执行特定任务时仅使用其一小部分专家(仅占总数的 12.5%),同时保持接近全模型性能;而在调用所有专家时,它依然是一个强大的通用模型。

Large language models are typically trained and deployed as monolithic systems: a single model is initialized, pretrained, fine-tuned, and served as one unified entity. But applications often need only a subset of capabilities, such as code generation, mathematical reasoning, or domain-specific knowledge. As frontier language models routinely reach trillions of parameters, using and adapting the full model becomes impractical for most users and incurs unnecessary computational cost and memory to host parameters that may not even be needed.

大型语言模型通常作为单体系统进行训练和部署:单个模型经过初始化、预训练、微调,并作为一个统一的实体提供服务。但实际应用往往只需要其中的一部分能力,例如代码生成、数学推理或特定领域的知识。随着前沿语言模型的参数量动辄达到数万亿,对于大多数用户而言,使用和适配整个模型变得不切实际,且为了托管那些可能根本用不到的参数,会产生不必要的计算成本和内存开销。

Mixture-of-experts (MoE) models seem like a natural way to relax this constraint. Instead of using one large feedforward network at each layer, MoEs contain many smaller ones, called experts, and activate only a small subset for each input token. In principle, a task that only needs one capability could load only the relevant experts. In practice, however, existing MoEs still need the full model to work well. Even within a single input, different tokens often activate different experts, so a task can end up using all the experts during its generation.

专家混合(MoE)模型似乎是缓解这一限制的自然方案。MoE 不在每一层使用一个大型前馈网络,而是包含许多较小的网络(称为“专家”),并仅为每个输入 Token 激活其中的一小部分。原则上,仅需某种特定能力的任务可以只加载相关的专家。然而在实践中,现有的 MoE 模型仍需完整模型才能良好运行。即使在单个输入中,不同的 Token 也往往会激活不同的专家,因此任务在生成过程中最终可能会用到所有的专家。

As we show in our paper, this happens partly because experts in standard MoEs often specialize in low-level lexical patterns like prepositions or punctuation rather than higher-level domains or capabilities. As a result, small subsets of experts are not reliably usable on their own. We instead want MoE models whose experts organize into coherent groups that can be selectively used and composed.

正如我们在论文中所展示的,这种情况的部分原因是标准 MoE 中的专家往往专注于介词或标点符号等低级词汇模式,而非更高级的领域或能力。因此,专家的小子集无法独立可靠地使用。我们希望构建一种 MoE 模型,其专家能够组织成连贯的组,从而实现选择性使用和组合。

One way to encourage this during pretraining is to route tokens to experts based on predefined semantic domains, such as math, biology, or code. Prior work like BTX and our FlexOlmo project has tried this. However, predefined domains come with important limitations. They require domain labels across the pretraining corpus, which can be ambiguous and expensive to obtain, and they may inject too much human bias into how the model is allowed to organize itself. More importantly, fixing the domains upfront also fixes the model’s modular structure: if a new domain or capability emerges at inference time, it isn’t obvious which experts should be used.

在预训练中鼓励这种特性的方法之一,是根据预定义的语义领域(如数学、生物学或代码)将 Token 路由到专家。之前的研究(如 BTX 和我们的 FlexOlmo 项目)尝试过这种方法。然而,预定义领域存在重大局限性:它们要求预训练语料库具备领域标签,这不仅获取成本高昂且可能存在歧义,还可能将过多的人为偏见注入到模型的组织方式中。更重要的是,预先固定领域也会固定模型的模块化结构:如果在推理时出现了新的领域或能力,我们很难判断应该使用哪些专家。

That’s where EMO comes in. We show that EMO - a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens - supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance. At the same time, when all experts are used together, EMO remains a strong general-purpose model. In contrast, a standard MoE of equal architecture trained on the same data shows severe degradation when selectively using its expert subsets.

这就是 EMO 的用武之地。我们展示了 EMO——一个在 1 万亿 Token 上训练的 MoE 模型(激活参数 1B,总参数 14B,共 128 个专家,每次激活 8 个)——支持选择性专家使用:对于给定的任务或领域,我们仅使用一小部分专家(仅占总数的 12.5%)即可保持接近全模型的性能。同时,当调用所有专家时,EMO 依然是一个强大的通用模型。相比之下,在相同数据上训练的同架构标准 MoE 模型,在选择性使用专家子集时性能会严重下降。

EMO is an MoE trained with modularity as a first-class objective. For a given domain (e.g., math, code, biomedical), users can select a small subset of experts of any size and retain near full-model performance. This turns a single model into a composable architecture, enabling flexible deployment with improved memory-accuracy tradeoffs for large, sparse MoEs.

EMO 是将模块化作为核心目标进行训练的 MoE 模型。对于特定领域(如数学、代码、生物医学),用户可以选择任意规模的一小部分专家,并保持接近全模型的性能。这使得单一模型转变为一种可组合的架构,从而为大型稀疏 MoE 模型提供了更灵活的部署方式,并改善了内存与准确率之间的权衡。

How do we get modularity to emerge? In an MoE, a small network called the router decides which experts each token activates. We want the router to learn that tokens from similar domains should activate similar subsets of experts. Our key observation is that tokens from the same document usually come from the same domain. We therefore use document boundaries as a weak supervisory signal: during training, all tokens in a document are restricted to choose their active experts from a shared expert pool.

我们是如何实现模块化涌现的呢?在 MoE 中,一个称为“路由(Router)”的小型网络决定了每个 Token 激活哪些专家。我们希望路由学习到:来自相似领域的 Token 应激活相似的专家子集。我们的关键观察是,来自同一文档的 Token 通常属于同一领域。因此,我们将文档边界作为一种弱监督信号:在训练过程中,文档中的所有 Token 被限制从一个共享的专家池中选择其激活的专家。

(Left) In a standard MoE, each token independently selects its top-k experts. Across tokens, all experts are used. (Right) In EMO, the router first selects a subset of experts for each document, and all tokens are constrained to route within this subset. This enforces consistent expert usage across the document, encouraging groups of experts to form domain specialization.

(左图)在标准 MoE 中,每个 Token 独立选择其 Top-k 专家。在所有 Token 范围内,所有专家都会被用到。(右图)在 EMO 中,路由首先为每个文档选择一个专家子集,所有 Token 都被限制在此子集内进行路由。这强制了文档内专家使用的一致性,从而鼓励专家组形成领域专业化。

For example, in an MoE with 10 total experts and 2 active experts per token, all tokens in a document are restricted to route within the same pool of 4 experts, as shown in the figure above. This pool is chosen by the router itself: we average the router’s expert preferences across all tokens in the document, then select the most-used experts as the document’s shared pool. Different documents can use different pools, allowing recurring expert groups to emerge directly from the training data.

例如,在一个总共 10 个专家、每个 Token 激活 2 个专家的 MoE 中,文档内的所有 Token 都被限制在同一个包含 4 个专家的池中路由,如上图所示。这个池由路由自行选择:我们对文档中所有 Token 的专家偏好进行平均,然后选择使用频率最高的专家作为该文档的共享池。不同的文档可以使用不同的池,从而允许重复出现的专家组直接从训练数据中涌现。

There are a few considerations when implementing the system: Load balancing. One technical challenge is load balancing. In standard MoE training, the load-balancing objective is used to prevent the model from collapsing onto only a small number of experts. At first glance, this seems to conflict with EMO’s training objective: we are explicitly restricting each document to use only a subset of experts. The conflict comes from the scale at which load balancing is usually applied. In many MoE implementations, load balancing is computed locally, often within a micro-batch containing only a small number of documents.

在实现该系统时有几点考量:负载均衡。一个技术挑战是负载均衡。在标准 MoE 训练中,负载均衡目标用于防止模型仅集中在少数几个专家上。乍一看,这似乎与 EMO 的训练目标相冲突:我们明确限制每个文档只能使用专家子集。这种冲突源于负载均衡通常应用的规模。在许多 MoE 实现中,负载均衡是在局部计算的,通常是在仅包含少量文档的微批次(micro-batch)内进行的。