KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure

KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure

KernelEvolve:Meta 的排序工程师智能体如何优化 AI 基础设施

By Gang Liao, Yavuz Yetim, Ruichao Xiao, Zewei Jiang, Raghav Boinepalli, Sheela Yadawad, Liyuan Li, Nathan Yan, Ajit Mathews, Chunqiang (CQ) Tang, Carole-Jean Wu, Gaoxiang Liu

作者:Gang Liao, Yavuz Yetim, Ruichao Xiao, Zewei Jiang, Raghav Boinepalli, Sheela Yadawad, Liyuan Li, Nathan Yan, Ajit Mathews, Chunqiang (CQ) Tang, Carole-Jean Wu, Gaoxiang Liu

This is the second post in the Ranking Engineer Agent blog series exploring the autonomous AI capabilities accelerating Meta’s Ads Ranking innovation. The previous post introduced Ranking Engineer Agent’s ML exploration capability, which autonomously designs, executes, and analyzes ranking model experiments. This post covers how to optimize the low-level infrastructure that makes those models run efficiently at scale. We introduce KernelEvolve, an agentic kernel authoring system used by Ranking Engineer Agent and generally applicable to a range of AI models beyond Ads Ranking.

这是“排序工程师智能体”(Ranking Engineer Agent)系列博客的第二篇文章,旨在探讨加速 Meta 广告排序创新的自主 AI 能力。上一篇文章介绍了排序工程师智能体的机器学习探索能力,它可以自主设计、执行和分析排序模型实验。本文将探讨如何优化使这些模型能够在大规模环境下高效运行的底层基础设施。我们引入了 KernelEvolve,这是一个由排序工程师智能体使用的智能体内核编写系统,不仅适用于广告排序,还广泛适用于各类 AI 模型。

Summary

摘要

Meta operates a large fleet of heterogeneous hardware — NVIDIA GPUs, AMD GPUs, Meta’s custom MTIA silicon chips, and CPUs. Using this hardware effectively and efficiently requires developing software that translates high-level model operations into efficient, chip-specific instructions called optimized kernels. Authoring and optimizing kernels must be done for each new chip generation and ML model architecture. Beyond standard kernel operators like general matrix multiplications (GEMMs) and convolutions covered by vendor libraries, production workloads require many custom operators across ranking models. With the number of models and number of hardware types and generations, hand-tuning by kernel experts doesn’t scale.

Meta 运营着庞大的异构硬件集群,包括 NVIDIA GPU、AMD GPU、Meta 自研的 MTIA 芯片以及 CPU。要有效且高效地利用这些硬件,需要开发能够将高级模型操作转换为高效、特定于芯片的指令(即“优化内核”)的软件。针对每一代新芯片和机器学习模型架构,都必须进行内核的编写与优化。除了供应商库中涵盖的通用矩阵乘法(GEMM)和卷积等标准内核算子外,生产工作负载还需要在排序模型中实现许多自定义算子。随着模型数量以及硬件类型和代际的增加,依靠内核专家进行手动调优已无法满足需求。

To address the volume of performance optimization work required by the increasing number of models X number of hardware types & generations, we built KernelEvolve, an agent to optimize performance used by Meta’s Ranking Engineer Agent. It enables:

为了应对因“模型数量 × 硬件类型及代际数量”增长而带来的海量性能优化工作,我们构建了 KernelEvolve,这是一个供 Meta 排序工程师智能体使用的性能优化智能体。它实现了:

  • Faster development: Compresses weeks of expert engineering time optimizing kernels, including profiling, optimizing, and cross-hardware debugging, into hours of automated search and evaluation, freeing engineers for other work.

  • 更快的开发速度: 将原本需要专家工程师数周时间进行的内核优化工作(包括性能分析、优化和跨硬件调试)压缩至数小时的自动化搜索与评估,从而将工程师从繁琐工作中解放出来。

  • Better performance: Over 60% inference throughput improvement for the Andromeda Ads model on NVIDIA GPUs and over 25% training throughput improvement for an ads model on Meta’s custom MTIA silicon chips.

  • 更优的性能: Andromeda 广告模型在 NVIDIA GPU 上的推理吞吐量提升超过 60%,在 Meta 自研 MTIA 芯片上的广告模型训练吞吐量提升超过 25%。

  • Broad applicability: Optimizes across public and proprietary hardware including NVIDIA GPUs, AMD GPUs, MTIA chips and CPUs, generating kernels in high-level DSLs like Triton, Cute DSL, and FlyDSL, as well as low-level languages including CUDA, HIP, and MTIA C++.

  • 广泛的适用性: 优化范围涵盖公共和专有硬件,包括 NVIDIA GPU、AMD GPU、MTIA 芯片和 CPU;生成的内核既包括 Triton、Cute DSL 和 FlyDSL 等高级领域特定语言(DSL),也包括 CUDA、HIP 和 MTIA C++ 等底层语言。

KernelEvolve treats kernel optimization as a search problem: a purpose-built job-harness evaluates each candidate kernel, feeds diagnostics back to the LLM, and drives a continuous search over hundreds of alternatives, exceeding the performance of human expert generated kernels. More details are available in the paper, “KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta,” which will appear at the 53rd International Symposium on Computer Architecture (ISCA) 2026.

KernelEvolve 将内核优化视为一个搜索问题:一个专门构建的作业工具(job-harness)会评估每个候选内核,将诊断信息反馈给大语言模型(LLM),并驱动对数百种替代方案的持续搜索,从而实现超越人类专家所编写内核的性能。更多详细信息请参阅论文《KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta》,该论文将在第 53 届国际计算机体系结构研讨会(ISCA 2026)上发表。

Every day, Meta serves billions of AI-powered experiences, from personalized recommendation to generative AI assistants, on a global infrastructure including diverse hardware from NVIDIA, AMD, and Meta’s custom MTIA silicon chips. Behind every training or inference request lies a layer of highly optimized low-level hardware kernels: small programs that translate high-level model operations into instructions a specific chip can execute efficiently. As AI models grow more complex and the hardware landscape diversifies, the number of kernels scales across hardware platforms, model architectures and operator types, resulting in thousands of configurations that can no longer realistically be tuned by human experts, creating a critical bottleneck that delays hardware enablement and performance tuning and slowing model iteration cycles that drive critical advances in ML technology and its applications.

每天,Meta 通过包括 NVIDIA、AMD 和 Meta 自研 MTIA 芯片在内的全球基础设施,为数十亿用户提供 AI 驱动的体验,从个性化推荐到生成式 AI 助手。在每一次训练或推理请求背后,都有一层高度优化的底层硬件内核:这些小程序将高级模型操作转换为特定芯片能够高效执行的指令。随着 AI 模型变得越来越复杂,硬件环境也日益多样化,内核数量在硬件平台、模型架构和算子类型之间呈指数级增长,导致出现了数以千计的配置,人类专家已无法对其进行实际调优。这造成了一个严重的瓶颈,不仅延迟了硬件的启用和性能调优,还减缓了推动机器学习技术及其应用关键进展的模型迭代周期。

Today, we are sharing KernelEvolve, an agentic AI system that improved ads model inference throughput by 60% in hours of experimentation, a task that would take human experts weeks. KernelEvolve autonomously generates and optimizes production-grade kernels for heterogeneous hardware used in training and inference, including NVIDIA GPUs, AMD GPUs, Meta’s custom MTIA silicon, and CPUs. Unlike typical large language model (LLM)-based agents that perform one-shot code generation, KernelEvolve treats kernel optimization as a search problem. It explores hundreds of alternative kernel implementations to identify a solution that often matches or exceeds human expert performance, and does so in hours instead of weeks.

今天,我们分享 KernelEvolve,这是一个智能体 AI 系统,它在数小时的实验中将广告模型推理吞吐量提高了 60%,而这一任务原本需要人类专家花费数周时间。KernelEvolve 能够为训练和推理中使用的异构硬件(包括 NVIDIA GPU、AMD GPU、Meta 自研 MTIA 芯片和 CPU)自主生成并优化生产级内核。与典型的基于大语言模型(LLM)且执行“一次性代码生成”的智能体不同,KernelEvolve 将内核优化视为一个搜索问题。它会探索数百种替代的内核实现方案,以找到通常能达到或超过人类专家水平的解决方案,且仅需数小时而非数周。

In Meta’s production environment, KernelEvolve is optimizing code that serves trillions of daily inference requests. KernelEvolve represents a fundamental shift in how we think about the relationship between AI software and hardware. Where kernel development was once a manual, expert-driven process that struggled to keep pace with hardware and model evolution, KernelEvolve makes it continuous and automated — adapting as each changes. As Meta continues to diversify its AI hardware portfolio, the ability to rapidly generate optimized kernels for new chips substantially reduces the engineering effort required to integrate heterogeneous hardware for training and inference.

在 Meta 的生产环境中,KernelEvolve 正在优化处理每天数万亿次推理请求的代码。KernelEvolve 代表了我们对 AI 软件与硬件之间关系认知的根本性转变。过去,内核开发是一个依赖专家手动完成的过程,难以跟上硬件和模型演进的步伐;而 KernelEvolve 使其变得持续且自动化,能够随着两者的变化而进行调整。随着 Meta 继续丰富其 AI 硬件组合,快速为新芯片生成优化内核的能力,将大幅减少集成异构硬件进行训练和推理所需的工程工作量。

The Challenge: The Bottleneck of Explosive Kernel Growth

挑战:内核爆炸式增长带来的瓶颈

We’re seeing explosive kernel growth because the total number of kernels scales with the product of three factors: {hardware types and generations X model architectures X number of operators}. This product results in thousands of unique kernel configurations that must be written, tested, and maintained. Hand-tuning each kernel doesn’t scale, and kernel experts alone can’t keep up with the pace.

我们正目睹内核数量的爆炸式增长,因为内核总数与三个因素的乘积成正比:{硬件类型及代际 × 模型架构 × 算子数量}。这一乘积导致了数以千计的独特内核配置,必须对其进行编写、测试和维护。手动调优每一个内核已无法扩展,仅靠内核专家已无法跟上这一节奏。

Hardware Heterogeneity

硬件异构性

Meta’s accelerator fleet now spans NVIDIA GPUs, AMD GPUs, and Meta’s custom MTIA silicon, each with fundamentally different memory architectures and hierarchies, instruction sets, and execution models. A kernel that runs optimally on one platform may perform poorly or fail.

Meta 的加速器集群目前涵盖了 NVIDIA GPU、AMD GPU 和 Meta 自研的 MTIA 芯片,每种芯片在内存架构与层级、指令集以及执行模型上都有根本性的不同。在一个平台上运行良好的内核,在另一个平台上可能表现不佳甚至无法运行。