Building Blocks for Foundation Model Training and Inference on AWS
Building Blocks for Foundation Model Training and Inference on AWS
AWS 基础模型训练与推理的构建模块
For a long time, “scaling” in foundation models mostly meant one thing: spend more compute on pre-training and capabilities rise. That intuition was supported by empirical work such as Kaplan et al. (2020), which reported predictable power-law trends in loss as you scale model parameters, dataset size, and training compute. In practice, these trends justified sustained investment in large-scale accelerator capacity and the surrounding distributed infrastructure needed to keep it efficiently utilized.
长期以来,基础模型的“扩展”(scaling)主要意味着一件事:投入更多的计算资源进行预训练,能力就会随之提升。这一直觉得到了 Kaplan 等人(2020 年)等实证研究的支持,该研究指出,随着模型参数、数据集规模和训练计算量的增加,损失函数呈现出可预测的幂律趋势。在实践中,这些趋势证明了持续投资大规模加速器容量以及维持其高效利用所需的周边分布式基础设施的合理性。
But the frontier has evolved—and scaling is no longer a single curve. NVIDIA’s “from one to three scaling laws” framing usefully emphasizes that, beyond pre-training, performance increasingly scales through post-training (e.g., supervised fine-tuning (SFT) and reinforcement learning (RL)-based methods) and through test-time compute (“long thinking,” search/verification, multi-sample strategies).
但前沿技术已经演进——扩展不再仅仅是一条单一的曲线。NVIDIA 提出的“从一条到三条扩展定律”的框架有效地强调了,除了预训练之外,性能正越来越多地通过后训练(例如监督微调 (SFT) 和基于强化学习 (RL) 的方法)以及推理时计算(“深度思考”、搜索/验证、多样本策略)来实现扩展。
Taken together, these scaling regimes push the foundation-model lifecycle—pre-training, post-training, and inference—toward convergent infrastructure requirements: tightly coupled accelerator compute, a high-bandwidth low-latency network, and a distributed storage backend. They also raise the importance of orchestration for resource management, and of application- and hardware-level observability to maintain cluster health and diagnose performance pathologies at scale.
总而言之,这些扩展范式推动了基础模型生命周期(预训练、后训练和推理)向趋同的基础设施需求发展:紧密耦合的加速器计算、高带宽低延迟网络以及分布式存储后端。它们还提升了资源管理编排的重要性,以及应用层和硬件层可观测性的重要性,以维护集群健康并在大规模环境下诊断性能病态。
Another key trend is the increasing reliance of the foundation-model lifecycle on an open-source software (OSS) ecosystem that spans model development frameworks, cluster resource management, and operational tooling. At the cluster layer, resource management is typically provided by systems such as Slurm and Kubernetes. Model development and distributed training are commonly implemented in frameworks such as PyTorch and JAX. Monitoring and visualization—that is, observability—are often achieved using Prometheus for metrics collection and Grafana for visualization and alerting, positioned as an operational layer atop infrastructure and resource management.
另一个关键趋势是基础模型生命周期对开源软件 (OSS) 生态系统的依赖日益加深,该生态系统涵盖了模型开发框架、集群资源管理和运维工具。在集群层,资源管理通常由 Slurm 和 Kubernetes 等系统提供。模型开发和分布式训练通常在 PyTorch 和 JAX 等框架中实现。监控和可视化(即可观测性)通常使用 Prometheus 进行指标收集,并使用 Grafana 进行可视化和告警,作为基础设施和资源管理之上的运维层。
This post is intended for machine learning engineers and researchers involved in foundation model training and inference, with particular attention to workflows built atop OSS frameworks. It analyzes how AWS infrastructure—including multi-node accelerator compute, high-bandwidth low-latency networking, distributed shared storage, and associated managed services—interacts with common OSS stacks across the foundation model lifecycle.
本文旨在为参与基础模型训练和推理的机器学习工程师和研究人员提供参考,特别关注构建在开源框架之上的工作流。文章分析了 AWS 基础设施(包括多节点加速器计算、高带宽低延迟网络、分布式共享存储以及相关的托管服务)如何与基础模型生命周期中的常见开源技术栈进行交互。
The AWS Building Blocks: The remainder of this series examines how this layered architecture is realized on AWS, progressing through infrastructure, resource orchestration, the ML software stack, and observability. The following sections preview each layer.
AWS 构建模块:本系列的后续部分将探讨这种分层架构如何在 AWS 上实现,内容涵盖基础设施、资源编排、机器学习软件栈和可观测性。以下章节将对每一层进行预览。
Infrastructure: Compute, Network, and Storage: As illustrated in Figure 1, infrastructure is anchored by three coupled building blocks—accelerated compute with large device memory, wide-bandwidth interconnect for collective communication, and scalable distributed storage for data and checkpoints.
基础设施:计算、网络和存储:如图 1 所示,基础设施由三个耦合的构建模块支撑——具有大设备内存的加速计算、用于集合通信的宽带互连,以及用于数据和检查点的可扩展分布式存储。
Accelerated compute forms the foundation of large-scale foundation model pre-training, post-training, and inference. AWS offers several generations of NVIDIA GPUs as part of its Amazon EC2 accelerated computing instances, including the Amazon EC2 P instance family. The P5 instance family includes p5.48xlarge with eight NVIDIA H100 GPUs, p5.4xlarge with a single H100 GPU for smaller-scale workloads, and p5e.48xlarge/p5en.48xlarge variants with NVIDIA H200 GPUs. The P6 instance family introduces NVIDIA Blackwell B200 architecture with p6-b200.48xlarge and Blackwell Ultra B300 with p6-b300.48xlarge.
加速计算构成了大规模基础模型预训练、后训练和推理的基础。AWS 提供多代 NVIDIA GPU 作为其 Amazon EC2 加速计算实例的一部分,包括 Amazon EC2 P 实例系列。P5 实例系列包括配备 8 个 NVIDIA H100 GPU 的 p5.48xlarge、适用于较小规模工作负载的配备单个 H100 GPU 的 p5.4xlarge,以及配备 NVIDIA H200 GPU 的 p5e.48xlarge/p5en.48xlarge 变体。P6 实例系列引入了 NVIDIA Blackwell B200 架构(p6-b200.48xlarge)和 Blackwell Ultra B300(p6-b300.48xlarge)。
Across these generations, the dominant scaling axes are peak Tensor throughput, HBM capacity and bandwidth, and interconnect bandwidth (within and across nodes). As a first-order approximation, peak Tensor Core throughput—measured in floating point operations per second (FLOPS)—helps situate these accelerators on a common axis.
在这些代际演进中,主要的扩展维度是峰值 Tensor 吞吐量、HBM 容量和带宽,以及互连带宽(节点内和节点间)。作为一阶近似,以每秒浮点运算次数 (FLOPS) 为单位衡量的峰值 Tensor Core 吞吐量,有助于将这些加速器置于同一坐标轴上进行比较。