Migrating Data Ingestion Systems at Meta Scale

Meta 规模化数据摄取系统的迁移实践

By Zihao Tao, Mohan Perumal Swamy, Grace Gong, Ailyn Tong, Peishan Wang, Nilay Kapadia, Md Mustafijur Rahman Faysal, Saurav Sen, Jameel Mohamed 作者：Zihao Tao, Mohan Perumal Swamy, Grace Gong, Ailyn Tong, Peishan Wang, Nilay Kapadia, Md Mustafijur Rahman Faysal, Saurav Sen, Jameel Mohamed

Meta’s data ingestion system, which our engineering teams leverage for up-to-date snapshots of the social graph, has recently undergone a significant revamp to enhance its reliability at scale. Moving from our legacy system to our new architecture required a large-scale migration of our entire data ingestion system. We’re sharing the solutions and strategies that enabled a successful large-scale system migration, as well as the key factors that influenced our architectural decisions. Meta 的数据摄取系统是工程团队获取社交图谱最新快照的关键工具，近期我们对其进行了重大改造，以提升其在超大规模下的可靠性。从旧系统迁移到新架构需要对整个数据摄取系统进行大规模迁移。我们在此分享促成此次大规模系统迁移成功的解决方案与策略，以及影响我们架构决策的关键因素。

At Meta, our social graph is powered by one of the largest MySQL deployments in the world. Every day, our data ingestion system incrementally scrapes several petabytes of social graph data from MySQL into the data warehouse to power the analytics, reporting, and downstream data products that teams across the company utilize for tasks ranging from day-to-day decision-making to machine learning model training and product development. 在 Meta，我们的社交图谱由全球最大的 MySQL 部署之一提供支持。每天，我们的数据摄取系统都会从 MySQL 中增量抓取数 PB 的社交图谱数据并存入数据仓库，为全公司团队的分析、报告及下游数据产品提供支持，这些数据被广泛应用于从日常决策到机器学习模型训练及产品开发的各项任务中。

We’ve recently revamped our data ingestion system’s architecture to significantly enhance its efficiency and reliability. The new architecture moves away from customer-owned pipelines, which functioned effectively at a small scale, to a simpler self-managed data warehouse service that still operates efficiently at hyperscale. We’ve successfully transitioned 100% of the workload and fully deprecated the legacy system. But migrating a data ingestion system of this scale was a major challenge. Several important solutions and strategies helped make a migration of this scope successful. 我们近期对数据摄取系统的架构进行了重构，以显著提升其效率和可靠性。新架构摒弃了在小规模下运行良好但难以扩展的“客户自管流水线”模式，转而采用更简洁的自管理数据仓库服务，该服务在超大规模下依然能高效运行。我们已成功迁移了 100% 的工作负载，并完全弃用了旧系统。然而，迁移如此规模的数据摄取系统是一项重大挑战。多项重要的解决方案和策略助力我们成功完成了此次大规模迁移。

The Migration Challenge

迁移挑战

As our operations grew in scale, our legacy data ingestion system began to show signs of instability under the increasingly strict data landing time requirements. We knew we needed to migrate to a new system. But we also knew that meant facing challenges around not only how to make sure each job would be migrated seamlessly but also how to perform large scale migration itself. 随着业务规模的扩大，我们的旧数据摄取系统在日益严格的数据落地时间要求下开始显现出不稳定性。我们意识到必须迁移到新系统，但也深知这意味着不仅要确保每个任务都能无缝迁移，还要解决如何执行大规模迁移本身的难题。

Ensuring a Seamless Transition

确保无缝过渡

Ensuring a seamless migration meant we had to effectively track the migration lifecycle for thousands of jobs and put robust rollout and rollback controls in place to handle issues that might arise during the migration process. 确保无缝迁移意味着我们必须有效地跟踪数千个任务的迁移生命周期，并建立稳健的发布和回滚控制机制，以应对迁移过程中可能出现的任何问题。

The Migration Lifecycle

迁移生命周期

Our first step was to establish a clear migration job lifecycle to ensure data integrity and operational reliability throughout the process. Each job needed to be verified for correctness and had to meet defined success criteria before moving to the next step of the migration lifecycle: 我们的第一步是建立清晰的迁移任务生命周期，以确保整个过程中的数据完整性和操作可靠性。每个任务在进入迁移生命周期的下一步之前，都必须经过正确性验证并满足既定的成功标准：

No data quality issues. There is no difference between the data delivered by the old system and the new system. We verify this by comparing both the row count and the checksum of the data, ensuring complete consistency between the two systems. 无数据质量问题。 旧系统和新系统交付的数据之间没有差异。我们通过比较数据的行数和校验和（checksum）来验证这一点，确保两个系统之间完全一致。
No landing latency regression is observed. The data delivered by the new system should exhibit improved landing latency, or at minimum, match the performance of the old system. 无落地延迟回归。 新系统交付的数据应表现出更好的落地延迟，或者至少与旧系统的性能持平。
No resource utilization regression is observed. The compute and storage usage of the job running in the new system should be improved, or at minimum, be comparable to that of the old system. 无资源利用率回归。 新系统中运行的任务在计算和存储使用上应有所改善，或者至少与旧系统相当。

For the critical table migration, we defined and agreed on extra migration criteria with the teams who were reliant on the service. 对于关键表的迁移，我们与依赖该服务的团队共同定义并商定了额外的迁移标准。

Phase 1: The Shadow Phase

第一阶段：影子（Shadow）阶段

In the first step of the lifecycle we set up shadow jobs in the pre-production environment to be delivered via the new system. This is essentially a production-realistic test that each shadow job consumed the same source as the production job but delivered data to a different table called the shadow table. This setup can help reveal issues because it exposes the new system to real production data and behavior, while still providing an isolated place to inspect outcomes and deploy fixes quickly. 在生命周期的第一步，我们在预生产环境中设置了通过新系统交付的影子任务。这本质上是一次贴近生产环境的测试，每个影子任务消耗与生产任务相同的数据源，但将数据交付到名为“影子表”的不同表中。这种设置有助于发现问题，因为它让新系统接触到真实的生产数据和行为，同时提供了一个隔离的环境来检查结果并快速部署修复程序。

We continuously monitored row count and checksum mismatches between the production jobs and the shadow jobs. When mismatches occurred, we quickly investigated the root cause and deployed fixes to the pre-production environment, then verified that the mismatch was resolved. During this step, we also measured the compute and storage quotas for the shadow jobs to ensure that the production environment had sufficient resources before proceeding. If the shadow job met the above criteria it moved to the production environment and made sure the job could still run reliably in the production environment before moving to the next step. 我们持续监控生产任务与影子任务之间的行数和校验和差异。当出现不匹配时，我们会迅速调查根本原因并向预生产环境部署修复程序，然后验证差异是否已解决。在此步骤中，我们还测量了影子任务的计算和存储配额，以确保在继续下一步之前生产环境拥有足够的资源。如果影子任务满足上述标准，它将进入生产环境，并在进入下一阶段前确保任务能在生产环境中可靠运行。

Phase 2: The Reverse Shadow Phase

第二阶段：反向影子（Reverse Shadow）阶段

Once the production job and the shadow job were running reliably in the production environment, we began the reverse shadow phase. In this phase, the shadow job’s data was written to the production table, effectively making the shadow job the new production job. Meanwhile, the production job’s data was written to the shadow table, so the original production job then acted as the shadow job. This approach provided two key benefits. First, we could still get ongoing data-quality signals after rollout by continuing to compare outputs from the two systems. Second, we could roll back fast if discrepancies were detected, without needing to recreate or reconfigure the old system job. 一旦生产任务和影子任务在生产环境中可靠运行，我们就开始反向影子阶段。在此阶段，影子任务的数据被写入生产表，实际上使影子任务成为了新的生产任务。同时，生产任务的数据被写入影子表，因此原始的生产任务此时充当了影子任务。这种方法提供了两个关键优势：首先，通过持续比较两个系统的输出，我们可以在发布后继续获得数据质量信号；其次，如果检测到差异，我们可以快速回滚，而无需重新创建或重新配置旧系统任务。

Phase 3: Migration Cleanup

第三阶段：迁移清理

We continued to monitor and compare the data delivered by both jobs. If no discrepancies were detected, the shadow job, now running on the old system, was removed. The new system then took over and continued delivering data through the production job, marking the completion of the migration. 我们继续监控并比较两个任务交付的数据。如果没有检测到差异，则移除在旧系统上运行的影子任务。随后，新系统接管并继续通过生产任务交付数据，标志着迁移工作的完成。

Custom Data Quality Analysis Tooling

自定义数据质量分析工具

We also built a comprehensive set of debugging tools to help team members efficiently identify and resolve issues that might arise during the migration. We developed a data quality analysis tool to ensure that edge cases across jobs are effectively captured and addressed. For each landed shadow table partition, the system would read the corresponding product… 我们还构建了一套全面的调试工具，帮助团队成员高效识别并解决迁移过程中可能出现的问题。我们开发了一种数据质量分析工具，以确保能够有效捕获并处理跨任务的边缘情况。对于每个落地的影子表分区，系统会读取相应的产品……