Meta’s AI Storage Blueprint at Scale

Meta’s AI Storage Blueprint at Scale

Meta 的大规模 AI 存储蓝图

By Sidharth Bajaj, Venkatraghavan Srinivasan 作者:Sidharth Bajaj, Venkatraghavan Srinivasan

Over the past several years, model capabilities and training dataset sizes have experienced exponential growth. During the past year or so, the time between new-frontier-model releases has gone down from months to weeks. Reliable and fast access to storage is important to both the speed and computational cost of this AI innovation. If AI is the brain, storage is the memory: Capability and speed are highly dependent on the size of memory and speed of retrieval. 在过去的几年里,模型能力和训练数据集规模经历了指数级增长。在过去一年左右的时间里,新前沿模型发布的时间间隔已从几个月缩短到几周。可靠且快速的存储访问对于 AI 创新的速度和计算成本至关重要。如果说 AI 是大脑,那么存储就是记忆:能力和速度高度依赖于内存的大小和检索的速度。

Yet while AI compute performance has roughly tripled every two years, storage and interconnect performance growth have been more modest. As a result, storage bottlenecks continue to be one of the primary contributors to GPU stalls for AI workloads, directly impacting expenditures and time to market. Aside from GPU utilization, storage architecture also directly impacts the speed of iteration in AI research; with GPUs increasingly becoming geo-distributed and dataset sizes increasingly becoming massive, researchers spend a significant amount of time ingesting and moving data across regions, thus impacting research velocity. In this blog post, we discuss how Meta’s BLOB-storage architecture evolved to address two primary challenges: maximizing GPU utilization and maximizing research velocity. 然而,尽管 AI 计算性能大约每两年翻三倍,但存储和互连性能的增长却相对缓慢。因此,存储瓶颈仍然是导致 AI 工作负载中 GPU 停滞的主要因素之一,直接影响了支出和上市时间。除了 GPU 利用率之外,存储架构还直接影响 AI 研究的迭代速度;随着 GPU 越来越趋向于地理分布,数据集规模也日益庞大,研究人员花费大量时间在不同区域间摄取和移动数据,从而影响了研究进度。在这篇博文中,我们将讨论 Meta 的 BLOB 存储架构如何演进以应对两个主要挑战:最大化 GPU 利用率和最大化研究速度。

Storage Architecture Overview

存储架构概述

Meta operates hundreds of exabyte-scale storage clusters that serve all of Meta’s external and internal products, including Facebook, Instagram, Reality Labs, Meta AI, Ads, Data Warehouse, and internal Databases. Our storage service exposes object storage, file systems, and block-device APIs, and these API abstractions are built on top of a horizontally scalable foundational block layer called Tectonic. The Tectonic layer is a regional, multi-tenant storage fabric that provides high durability and availability leveraging erasure-coding techniques, supports tiering across media types (e.g., HDD and flash), and manages smart placement of hot, cold, and warm data for efficient utilization of I/O across tenants. Meta 运营着数百个艾字节(exabyte)规模的存储集群,服务于 Meta 所有的外部和内部产品,包括 Facebook、Instagram、Reality Labs、Meta AI、广告、数据仓库和内部数据库。我们的存储服务提供对象存储、文件系统和块设备 API,这些 API 抽象构建在一个名为 Tectonic 的可水平扩展的基础块层之上。Tectonic 层是一个区域性的多租户存储架构,利用纠删码技术提供高持久性和高可用性,支持跨媒体类型(如 HDD 和闪存)的分层,并管理热数据、冷数据和温数据的智能放置,以实现跨租户的高效 I/O 利用。

The BLOB-storage layers that operate on top of Tectonic expose a global, infinitely scalable storage fabric, and expose policies that let users make tradeoffs between durability and availability. In a previous @Scale talk titled, “Training Llama: A Storage Perspective,” we discussed how Meta trained Llama directly over the Tectonic block layer by exposing an NFS-like FileSystem interface on top of it. While this architecture continues to be used widely within Meta, our modern training stack has been migrating slowly on top of the BLOB-storage interface, as is the case across the industry. This transition is motivated by the need for unified storage access to massive data lakes in the BLOB-storage layer as well as the need for high performance. 运行在 Tectonic 之上的 BLOB 存储层提供了一个全球性的、可无限扩展的存储架构,并提供让用户在持久性和可用性之间进行权衡的策略。在之前名为“训练 Llama:存储视角”的 @Scale 演讲中,我们讨论了 Meta 如何通过在 Tectonic 块层之上暴露类似 NFS 的文件系统接口来直接训练 Llama。虽然这种架构在 Meta 内部仍被广泛使用,但我们的现代训练栈正如整个行业一样,正在缓慢迁移到 BLOB 存储接口上。这种转变的动力源于对 BLOB 存储层中海量数据湖的统一存储访问需求,以及对高性能的需求。

Maximizing GPU Utilization

最大化 GPU 利用率

Modern AI workloads are “data hungry” and have very different workload characteristics than traditional web applications: bursty and sustained high throughput, predictable and bounded pMax latencies, and variable I/O patterns. The focus for BLOB storage, in recent years, has largely shifted to maximizing GPU utilization. 现代 AI 工作负载是“数据饥渴型”的,其工作负载特征与传统 Web 应用截然不同:具有突发且持续的高吞吐量、可预测且有界的 pMax 延迟,以及多变的 I/O 模式。近年来,BLOB 存储的重点已在很大程度上转向最大化 GPU 利用率。

Why Latency Matters

为什么延迟很重要

To see why bounded and low-pMax latencies are important, let’s consider model training. During that training, hundreds of thousands of GPUs iterate over vast amounts of data in storage multiple times (i.e., over multiple epochs), and the GPUs train datasets in batches. Periodically, after every certain number of steps or batches, the GPUs synchronize their state among themselves. If one GPU is slow, this step will slow down all GPUs as well as the entire training. 为了理解为什么有界且低 pMax 的延迟很重要,让我们考虑模型训练。在训练过程中,数十万个 GPU 会多次遍历存储中的海量数据(即经过多个 epoch),并且 GPU 以批次(batch)为单位训练数据集。定期地,在每经过一定数量的步骤或批次后,GPU 之间会同步它们的状态。如果其中一个 GPU 变慢,这一步骤就会拖慢所有 GPU 以及整个训练过程。

Figure 1 shows a data-loading pipeline across two GPUs. The dataloader in every GPU host prefetches the next dataset batch, while the GPU is processing the current batch for maximum compute or I/O overlap. In the case of GPU1, the storage-fetch latency is well within bounds, so the GPU is never stalled waiting on I/O. In the case of GPU2, there are two instances where storage fetch exhibits high latency, stalling GPU. As a result of these stalls, the overall step-completion time is delayed. 图 1 展示了跨两个 GPU 的数据加载流水线。每个 GPU 主机中的数据加载器会预取下一个数据集批次,同时 GPU 正在处理当前批次,以实现计算或 I/O 的最大化重叠。对于 GPU1,存储获取延迟完全在范围内,因此 GPU 从未因等待 I/O 而停滞。对于 GPU2,有两次存储获取表现出高延迟,导致 GPU 停滞。由于这些停滞,整体步骤完成时间被延迟了。

(Figure 1: Dataloading across two GPUs.) (图 1:跨两个 GPU 的数据加载。)

Legacy BLOB-Storage Architecture Wasn’t AI-Ready

旧版 BLOB 存储架构并未针对 AI 做好准备

Over the years, BLOB storage evolved organically, adding layers on top of layers in a true service-oriented fashion. Many of these layers were stateful and maintained their own metadata stores. While these metadata-access latencies typically weren’t the bottleneck for the traditional use cases served by global HDDs, they were showstoppers for AI workloads with millisecond access to data in flash. 多年来,BLOB 存储以一种真正的面向服务的方式有机演进,层层叠加。其中许多层是有状态的,并维护着各自的元数据存储。虽然这些元数据访问延迟对于由全球 HDD 服务支持的传统用例来说通常不是瓶颈,但对于需要毫秒级访问闪存数据的 AI 工作负载而言,它们却是致命的障碍。

Figure 2 shows the request flow for a typical getObject(“/bucket/path”) API. After the request arrives at the API server, the server does many metadata lookups across the namelayer, volumeslayer, and containerlayer before resolving the path to a set of (blockId, offset, size) tuples. Some of these lookups can cross regions, and it’s not uncommon for latencies to add up to hundreds of milliseconds; one slow response from any of the lookups was sufficient. After the lookups, the API server proxies the data from the Tectonic layer to the client. 图 2 展示了典型 getObject(“/bucket/path”) API 的请求流程。请求到达 API 服务器后,服务器会在 namelayer、volumeslayer 和 containerlayer 之间进行多次元数据查找,然后将路径解析为一组 (blockId, offset, size) 元组。其中一些查找可能跨越区域,延迟累积到数百毫秒并不罕见;任何一次查找的缓慢响应都足以造成影响。查找完成后,API 服务器将数据从 Tectonic 层代理给客户端。

(Figure 2: Old request flow for getObject API.) (图 2:旧版 getObject API 请求流程。)

While this architecture served conventional workloads well, the foundational assumptions that dictated design tradeoffs have since shifted. Some of these are: 虽然这种架构很好地服务了传统工作负载,但决定设计权衡的基础假设已经发生了变化。其中包括:

  • Performance and latency: As discussed, while latency needs for conventional workloads were modest, AI workloads demand predictable and bounded latencies all the way up to pMax.
  • 性能和延迟: 如前所述,虽然传统工作负载对延迟的需求较为温和,但 AI 工作负载要求从始至终(直到 pMax)都具备可预测且有界的延迟。
  • Reliability and durability: The legacy architecture was designed to be highly durable and available, even in the face of region outages; data and metadata were globally replicated by default. While AI workloads demand very high availability, the global-by-default design choice no longer holds.
  • 可靠性和持久性: 旧架构的设计旨在实现高持久性和高可用性,即使在区域中断的情况下也是如此;数据和元数据默认进行全局复制。虽然 AI 工作负载要求极高可用性,但“默认全局”的设计选择已不再适用。
  • Cost efficiency: Legacy stack was built on top of HDDs and highly optimized for cost per byte. The IOPS demands for AI workloads necessit…
  • 成本效率: 旧栈构建在 HDD 之上,并针对每字节成本进行了高度优化。AI 工作负载对 IOPS 的需求迫使……