Unlocking asynchronicity in continuous batching

解锁连续批处理中的异步性

TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference. This is the second post in a series on efficient LLM inference. The first post covered continuous batching from first principles. It introduces some concepts we build upon: KV cache, FlashAttention, attention masks, etc. 简而言之： 我们将解释如何分离 CPU 和 GPU 的工作负载，从而为推理任务带来巨大的性能提升。这是关于高效大模型（LLM）推理系列文章的第二篇。第一篇从基本原理出发介绍了连续批处理，并引入了我们后续将要用到的概念：KV Cache、FlashAttention、注意力掩码（attention masks）等。

An H200 costs around $5 an hour on Inference Endpoints. That’s cheap for an hour, but use it for a day and you are already paying $120. If this is the case, you want your GPU to be used to its fullest. We have seen that Continuous Batching improves GPU utilization by scheduling tightly packed batches, so no compute is wasted on padding. But there is a second source of waste that continuous batching does not address: by default, it is synchronous. 在 Inference Endpoints 上，一台 H200 的成本约为每小时 5 美元。单小时来看这很便宜，但如果使用一天，费用就达到了 120 美元。在这种情况下，你肯定希望 GPU 得到最充分的利用。我们已经看到，连续批处理通过调度紧凑的批次来提高 GPU 利用率，从而避免了因填充（padding）而浪费计算资源。但连续批处理并未解决第二个浪费来源：默认情况下，它是同步的。

This means the CPU and GPU take turns: while the GPU computes, the CPU waits. And while the CPU prepares the next batch, the GPU waits. In a loop running hundreds of steps per second, those idle gaps add up, and as we will show, they can account for nearly a quarter of total runtime. To ensure the GPU is busy computing 100% of the time, we need to get rid of those gaps. To achieve this, we can use asynchronous batching: we are going to disentangle CPU batch preparation from GPU batch compute, so both can run in parallel and we always have a productive GPU 🔥 这意味着 CPU 和 GPU 是轮流工作的：当 GPU 在计算时，CPU 在等待；而当 CPU 在准备下一个批次时，GPU 又在等待。在一个每秒运行数百步的循环中，这些空闲间隙会累积起来。正如我们将要展示的那样，它们可能占到总运行时间的近四分之一。为了确保 GPU 100% 的时间都在忙于计算，我们需要消除这些间隙。为此，我们可以使用异步批处理：我们将把 CPU 的批次准备工作与 GPU 的批次计算解耦，使两者能够并行运行，从而确保 GPU 始终处于高效工作状态 🔥

Synchronous batching

同步批处理

This is how naive synchronous batching works: When the CPU prepares a new batch, it selects which requests to include, updates the KV cache table, evicts requests that finished in the previous runs, and admits new ones to fill the freed space. Once that is done, it transfers the prepared inputs to the GPU. The GPU runs its forward pass and samples (i.e. chooses) a new token for each request. The results come back to the CPU, so it knows what token each request just produced, then the whole cycle repeats again. 原始的同步批处理工作流程如下：当 CPU 准备新批次时，它会选择要包含的请求，更新 KV Cache 表，剔除在前一轮运行中已完成的请求，并接纳新请求以填补空出的空间。完成后，它将准备好的输入传输到 GPU。GPU 运行前向传播并为每个请求采样（即选择）一个新 Token。结果返回给 CPU，以便 CPU 知道每个请求刚刚生成了什么 Token，然后整个循环再次重复。

Notice the red annotation on the right: after the GPU finishes computing, it goes idle. The next batch cannot start until the CPU has gone through its update step: sampling the output tokens, updating request states, re-scheduling the batch. This is the core inefficiency of synchronous batching: the CPU and GPU take turns. While the GPU is computing, the CPU is idle. While the CPU is updating, the GPU is idle. In no circumstances are they both doing useful work at the same time. 请注意右侧的红色标注：GPU 计算完成后便进入空闲状态。在 CPU 完成更新步骤（采样输出 Token、更新请求状态、重新调度批次）之前，下一个批次无法开始。这就是同步批处理的核心低效之处：CPU 和 GPU 轮流工作。当 GPU 计算时，CPU 空闲；当 CPU 更新时，GPU 空闲。在任何情况下，它们都无法同时进行有效工作。

For a single forward pass this might seem like a small price to pay, but in a continuous batching loop running hundreds of steps per second, these idle gaps accumulate into real throughput loss. To showcase this, we profile the time spent on CPU and GPU when generating 8K tokens with a batch size of 32 using an 8B model. 对于单次前向传播来说，这似乎代价很小，但在每秒运行数百步的连续批处理循环中，这些空闲间隙会累积成实质性的吞吐量损失。为了展示这一点，我们对使用 8B 模型、批大小为 32、生成 8K Token 时的 CPU 和 GPU 时间消耗进行了分析。

Creating concurrency

创建并发

Our end goal is to have concurrent execution of CPU and GPU operations. We need a way to categorize our operations, so we can let the machine know which operations can run concurrently. We can achieve this using CUDA streams. 我们的最终目标是实现 CPU 和 GPU 操作的并发执行。我们需要一种方法来对操作进行分类，以便让机器知道哪些操作可以并发运行。我们可以通过 CUDA 流（CUDA streams）来实现这一点。

What is a CUDA stream? To understand how CUDA orders its operations, we need to talk about CUDA streams. A stream is an ordered queue of GPU operations (kernel launches, memory copies, synchronization barriers) that executes in the order they were submitted. Every GPU operation is always scheduled inside a stream. Operations within the same stream are sequential: the GPU will not start the next one until the previous has completed. Operations in different streams are independent of each other and can run concurrently. 什么是 CUDA 流？要理解 CUDA 如何对操作进行排序，我们需要谈谈 CUDA 流。流是一个 GPU 操作（内核启动、内存拷贝、同步屏障）的有序队列，它按照提交的顺序执行。每个 GPU 操作总是被调度在一个流中。同一流内的操作是顺序执行的：GPU 不会启动下一个操作，直到前一个操作完成。不同流中的操作彼此独立，可以并发运行。