Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel
Accelerating Transformers Fine-Tuning with NVIDIA NeMo AutoModel
使用 NVIDIA NeMo AutoModel 加速 Transformers 微调
HuggingFace Transformers has become the foundation of the open-source AI ecosystem, and the recent Transformers v5 release strengthened it with first-class support for Mixture-of-Experts (MoE) models, now the dominant architecture for frontier models. v5 ships the MoE foundations: expert backends, dynamic weight loading, and distributed execution that make MoE extensible and easy to build on.
HuggingFace Transformers 已成为开源 AI 生态系统的基石。近期发布的 Transformers v5 版本通过对混合专家(MoE)模型的一流支持进一步巩固了这一地位,而 MoE 目前已成为前沿模型的主流架构。v5 版本提供了 MoE 的基础功能:专家后端、动态权重加载以及分布式执行,使得 MoE 架构更具扩展性且易于构建。
NVIDIA NeMo AutoModel is an open library part of the NVIDIA NeMo framework for building custom generative AI models at scale. NeMo AutoModel builds cleanly on top of v5, adding Expert Parallelism, DeepEP fused all-to-all dispatch, and TransformerEngine kernels, and it leans on v5’s dynamic weight loading to bring those optimizations to a broad and growing set of model families. The payoff is 3.4-3.7x higher training throughput and 29-32% less GPU memory on fine-tuning MoE models than native Transformers v5, using the same from_pretrained() API: a single import line, with no other code changes. This blog details how this combination works and how users can fine-tune MoE models faster without changing their APIs.
NVIDIA NeMo AutoModel 是 NVIDIA NeMo 框架的一部分,是一个用于大规模构建定制生成式 AI 模型的开源库。NeMo AutoModel 基于 v5 版本构建,增加了专家并行(Expert Parallelism)、DeepEP 融合 all-to-all 分发以及 TransformerEngine 内核。它利用 v5 的动态权重加载功能,将这些优化扩展到广泛且不断增长的模型系列中。其成果是:在微调 MoE 模型时,相比原生 Transformers v5,训练吞吐量提升了 3.4-3.7 倍,GPU 显存占用降低了 29-32%,且使用的是相同的 from_pretrained() API——只需修改一行导入代码,无需其他代码更改。本博客详细介绍了这种组合的工作原理,以及用户如何在不更改 API 的情况下更快地微调 MoE 模型。
Background
背景
The rise of MoE models has introduced new challenges to efficient training: Routing tokens across hundreds of experts, fusing expert matmuls into a single kernel, sharding weights across GPUs, and overlapping communication with computation all require infrastructure beyond what a general-purpose library provides out of the box. Transformers v5 (“v5”) introduced first-class MoE support such as expert backends, dynamic weight loading, and tensor parallel plans for distributed execution. In addition, v5 made distributed training first-class by integrating PyTorch’s DeviceMesh directly into from_pretrained().
MoE 模型的兴起为高效训练带来了新的挑战:在数百个专家之间路由 Token、将专家矩阵乘法融合到单个内核中、跨 GPU 分片权重以及重叠通信与计算,所有这些都需要通用库开箱即用之外的基础设施支持。Transformers v5(“v5”)引入了对 MoE 的一流支持,例如专家后端、动态权重加载以及用于分布式执行的张量并行方案。此外,v5 通过将 PyTorch 的 DeviceMesh 直接集成到 from_pretrained() 中,使分布式训练成为了一等公民。
NeMo AutoModel builds on top of v5 by subclassing AutoModelForCausalLM, and adding Expert Parallelism (EP), DeepEP fused all-to-all dispatch, and TransformerEngine kernels. DeepEP is the piece v5 doesn’t have yet: it overlaps communication with expert compute. And because NeMo AutoModel rides v5’s reversible weight conversion to load each model, it can focus its engineering on these reusable core ops instead of per-model checkpoint plumbing, while save_pretrained() still emits standard HF checkpoints that tools like vLLM and SGLang can load.
NeMo AutoModel 通过继承 AutoModelForCausalLM 并添加专家并行(EP)、DeepEP 融合 all-to-all 分发以及 TransformerEngine 内核,在 v5 的基础上进行了构建。DeepEP 是 v5 尚未具备的部分:它实现了通信与专家计算的重叠。由于 NeMo AutoModel 利用 v5 的可逆权重转换来加载每个模型,因此它可以将工程重点放在这些可重用的核心操作上,而不是针对每个模型的检查点处理;同时,save_pretrained() 依然会生成标准的 HF 检查点,供 vLLM 和 SGLang 等工具加载。
NeMo AutoModel: Same API, More Performance
NeMo AutoModel:相同的 API,更高的性能
One of NeMo AutoModel’s goals is API compatibility with HuggingFace Transformers to enable open-source community. NeMoAutoModelForCausalLM subclasses AutoModelForCausalLM, so any code that works with HF models works with AutoModel too. Here’s what loading a model looks like in both. Only the import changes:
NeMo AutoModel 的目标之一是与 HuggingFace Transformers 保持 API 兼容,以赋能开源社区。NeMoAutoModelForCausalLM 继承自 AutoModelForCausalLM,因此任何适用于 HF 模型的代码也同样适用于 AutoModel。以下是两者加载模型的方式对比,仅需更改导入语句:
That single import does a lot of work. For popular MoE architectures like Qwen3, NVIDIA Nemotron, GPT-OSS, and DeepSeek V3, NeMo AutoModel ships hand-tuned implementations with TransformerEngine attention, fused linear layers, and custom expert kernels. For everything else, it falls back to vanilla HF while still applying optimizations like Liger kernel patching, among others. And whichever path it takes, the resulting model is ready to scale: pass a device_mesh and you have multi-GPU training without further rewrites.
这一行导入代码背后做了大量工作。对于 Qwen3、NVIDIA Nemotron、GPT-OSS 和 DeepSeek V3 等主流 MoE 架构,NeMo AutoModel 提供了经过手工调优的实现,包含 TransformerEngine 注意力机制、融合线性层和自定义专家内核。对于其他模型,它会回退到原生 HF,同时仍应用 Liger 内核补丁等优化。无论采用哪种路径,生成的模型都已准备好进行扩展:只需传入一个 device_mesh,即可在无需进一步重写代码的情况下进行多 GPU 训练。
Performance Comparison
性能对比
We evaluated NeMo AutoModel in two regimes: full fine-tuning a frontier-scale 550B model across 16 nodes, and training two 30B MoE models on a single node. The 550B result shows why Expert Parallelism is essential at scale; the 30B results quantify the per-GPU speedup over Transformers v5.
我们从两个维度评估了 NeMo AutoModel:跨 16 个节点对前沿规模的 550B 模型进行全量微调,以及在单节点上训练两个 30B MoE 模型。550B 的结果展示了专家并行在大规模场景下的必要性;30B 的结果则量化了相比 Transformers v5 在单 GPU 上的加速效果。
(Note: The original article continues with detailed tables and technical specifications regarding the 550B model fine-tuning.) (注:原文后续包含关于 550B 模型微调的详细表格和技术规格。)