GPU Time-Slicing for Concurrent LLM Agents on Kubernetes
GPU Time-Slicing for Concurrent LLM Agents on Kubernetes
Kubernetes 上并发 LLM Agent 的 GPU 时间片轮转技术
Agentic AI GPU Time-Slicing for Concurrent LLM Agents on Kubernetes Kubernetes 上面向 Agent AI 的 GPU 时间片轮转技术
Stop trusting the pod status. An end-to-end framework for measuring the hidden microarchitectural costs and memory contention of GPU sharing among AI Agents in Kubernetes. 不要再盲目信任 Pod 状态了。这是一个端到端的框架,用于衡量 Kubernetes 中 AI Agent 共享 GPU 时隐藏的微架构成本和内存争用情况。
Anubhab Banerjee | Jun 14, 2026 | 22 min read | Share Anubhab Banerjee | 2026年6月14日 | 22分钟阅读 | 分享
Architectural Overview
架构概述
Toy agents run one at a time in Python. Production agents fight over the same GPU — and on one shared card, a latency-sensitive agent’s p99 latency quietly got 66% worse while every pod still reported healthy. Here is what that fight actually costs, measured to the p99, not hand-waved. 玩具级的 Agent 在 Python 中一次运行一个。而生产环境的 Agent 则会争抢同一个 GPU——在同一张共享显卡上,一个对延迟敏感的 Agent 的 p99 延迟悄无声息地恶化了 66%,而所有 Pod 的状态依然显示为健康。以下是这场“争夺战”的真实代价,我们通过 p99 指标进行了精确测量,而非空谈。
This is Part 2 of the “Production-Grade Agentic Inference” series. Each part removes one kind of redundant work from an agentic LLM pipeline. Part 1 kills redundant prefill. Part 2 (this part) tackles redundant waiting — how multiple micro-agents share one GPU through time-slicing. Part 3 keeps RAG retrieval on the GPU with a custom CUDA Top-K kernel. Part 4 persists agent state across hand-offs so the next agent never has the cold-start problem. 这是“生产级 Agent 推理”系列的第二部分。每一部分都旨在消除 Agent LLM 流水线中的一种冗余工作。第一部分消除了冗余的预填充(prefill)。第二部分(即本文)解决了冗余等待问题——即多个微型 Agent 如何通过时间片轮转(time-slicing)共享一个 GPU。第三部分将通过自定义 CUDA Top-K 内核将 RAG 检索保留在 GPU 上。第四部分将在任务交接时持久化 Agent 状态,从而使下一个 Agent 不再面临冷启动问题。
Key Takeaways
核心要点
Sharing a GPU is not free, and your scheduler will not tell you. When two agents share one time-sliced GPU, Kubernetes happily reports both pods as Running. The damage hides in the latency tail. The median lies; the tail tells the truth. 共享 GPU 并非没有代价,而你的调度器不会告诉你这一点。当两个 Agent 共享一个开启了时间片轮转的 GPU 时,Kubernetes 会愉快地报告两个 Pod 都在运行(Running)。损害隐藏在延迟的尾部。中位数会撒谎,但尾部延迟说的是真话。
In my run (with only 2 agents), both kept an almost-unchanged p50. But the small, latency-sensitive one’s p99 jumped from 3.68 ms to 6.10 ms (≈1.66×) and its jitter (p99/p50) went from 1.02 to 1.70. The latency-sensitive agent degrades first. The small, twitchy workload suffered far more than the heavy, steady one, even though both “got a GPU.” 在我的测试中(仅有 2 个 Agent),两者的 p50 几乎没有变化。但那个小型、对延迟敏感的 Agent 的 p99 从 3.68 毫秒跃升至 6.10 毫秒(约 1.66 倍),其抖动(p99/p50)从 1.02 变为 1.70。对延迟敏感的 Agent 最先出现性能退化。这种小型、高频的负载比沉重、稳定的负载受到的影响大得多,尽管它们都“获得了 GPU”。
Throughput barely moved, which is the whole trap. A mean-rate throughput proxy dropped only a few percent — so a dashboard watching averages would call this a success while your tail-sensitive agent quietly misses one deadline in fifty. 吞吐量几乎没有变化,这正是整个陷阱所在。平均吞吐量指标仅下降了几个百分点——因此,一个只关注平均值的仪表盘会认为这是成功的,而你对延迟敏感的 Agent 却在悄悄地每 50 次请求中就有一次错过截止时间。
It runs on a $150 GPU. Everything below is measured on a single five-year-old GTX 1080 with the stock NVIDIA Kubernetes Device Plugin and CUDA time-slicing. No H100, no MIG, no magic. This was intentional, not everyone can afford H100 – some still keep using their old hardware. And honestly, running an agentic AI production on H100 does not require any magic; but on a $150 GPU, it surely does. 它运行在一块 150 美元的 GPU 上。下文中的所有内容都是在一块五年前的 GTX 1080 上,使用原生的 NVIDIA Kubernetes 设备插件和 CUDA 时间片轮转功能测得的。没有 H100,没有 MIG,也没有魔法。这是刻意为之的,并非每个人都买得起 H100——有些人仍在使用旧硬件。老实说,在 H100 上运行生产级 Agent AI 不需要任何魔法;但在 150 美元的 GPU 上,这绝对需要。
TL;DR: I put two very different agent workloads — a small, latency-sensitive FFT worker and a heavy, transformer-style GEMM worker — into separate Kubernetes pods, each politely asking for nvidia.com/gpu: "1", and let the NVIDIA device plugin’s CUDA time-slicing drop them both onto one physical GTX 1080. Then I timed every iteration with CUDA events, rolled it up into p50/p95/p99, computed a degradation factor (shared tail / solo tail), and cross-checked it against DCGM GPU-utilization counters.
简而言之:我将两个截然不同的 Agent 负载——一个小型、对延迟敏感的 FFT 工作负载和一个沉重的 Transformer 风格的 GEMM 工作负载——放入不同的 Kubernetes Pod 中,每个都礼貌地请求 nvidia.com/gpu: "1",并让 NVIDIA 设备插件的 CUDA 时间片轮转功能将它们同时分配到同一块物理 GTX 1080 上。然后,我用 CUDA 事件对每次迭代进行计时,汇总成 p50/p95/p99,计算出退化因子(共享尾部延迟 / 独立尾部延迟),并将其与 DCGM GPU 利用率计数器进行了交叉验证。
Result: medians and throughput barely flinched, but tail latency and jitter blew up — worst for the small, latency-critical agent. Kubernetes says “two healthy pods.” The silicon says “one of you is starving in the queue.” Kubernetes reports “two healthy pods.” The silicon reports a memory-bus street fight, and the p99 tail tells you who paid the price. 结果:中位数和吞吐量几乎没有波动,但尾部延迟和抖动却大幅飙升——对于那个小型、对延迟敏感的 Agent 来说情况最糟。Kubernetes 说“两个健康的 Pod”。硅片说“你们其中一个正在队列中挨饿”。Kubernetes 报告“两个健康的 Pod”。硅片报告了一场内存总线的街头斗殴,而 p99 尾部延迟告诉你谁付出了代价。
Github Repo: https://github.com/AnubhabBanerjee/Kube-Timeslice-Profiler Github 仓库:https://github.com/AnubhabBanerjee/Kube-Timeslice-Profiler
(Quick confession before we start: I came at this from a 5G/6G RAN engineering background. As it turns out, it is exactly the kind of problem AI RAN is currently dealing with. On edge servers, operators are trying to co-locate latency-critical baseband processing with heavy LLM inference on the same GPUs. It becomes a scheduling nightmare the second the AI workload starts starving the latency-critical applications of memory bandwidth—and that is exactly why I wrote this post.) (在开始之前先坦白一下:我拥有 5G/6G RAN 工程背景。事实证明,这正是 AI RAN 目前正在处理的问题。在边缘服务器上,运营商正试图将对延迟敏感的基带处理与沉重的 LLM 推理共置在同一 GPU 上。一旦 AI 负载开始抢占对延迟敏感应用的内存带宽,调度就会变成一场噩梦——这正是撰写本文的原因。)
Architecture mental model — keep this open while you read. Two pods → each asks for nvidia.com/gpu: 1 → the device plugin cheerfully says “sure, here are 4 GPUs” (there is exactly 1) → CUDA time-slices the one real GPU → everybody takes turns → the tail pays the bill. Everything below is just commentary on one part of that line.
架构思维模型——阅读时请保持此图景:两个 Pod → 每个都请求 nvidia.com/gpu: 1 → 设备插件愉快地回答“没问题,给你 4 个 GPU”(实际上只有 1 个)→ CUDA 对这唯一真实的 GPU 进行时间片轮转 → 大家轮流使用 → 尾部延迟买单。下文的所有内容都是对这一流程中某一部分的解读。
1. A confession: “Running” is the most expensive illusion in Kubernetes
1. 一个坦白:“Running”是 Kubernetes 中最昂贵的幻觉
Just like the previous post in this series, let us start with a dramatic conversation before we slowly dive into more boring, technical stuff. 就像本系列的上一篇文章一样,让我们先从一段戏剧性的对话开始,然后再慢慢深入那些枯燥的技术细节。
You: “Kubernetes, please run my two agents.“ 你:“Kubernetes,请运行我的两个 Agent。”
Kubernetes: “Done. Both pods are Running. ✅” Kubernetes:“完成。两个 Pod 都在运行。✅”
You: “On the same GPU?“ 你:“在同一个 GPU 上吗?”
Kubernetes: “Yep. Each one asked for nvidia.com/gpu: 1, so I gave each one a GPU.“
Kubernetes:“是的。每个都请求了 nvidia.com/gpu: 1,所以我给了它们每人一个 GPU。”
You: “But I only own one GPU.“ 你:“但我只有一块 GPU。”
Kubernetes: “Correct. And I gave each of them a GPU.” 🫡 Kubernetes:“没错。而且我给了它们每人一个 GPU。” 🫡
You: “Wait, What!? How?? They can’t both have—” 你:“等等,什么!?怎么可能??它们不可能同时拥有——”
Kubernetes: “Shhh. Don’t worry about it. Look how green they are.” Kubernetes:“嘘。别担心。看它们的状态多绿(健康)。”
Your Grafana dashboard: “Everything looks good, bro. 🟢” 你的 Grafana 仪表盘:“一切看起来都很棒,兄弟。🟢”
Meanwhile… 与此同时……
Your physical GPU: (screaming in context-switches) 你的物理 GPU:(在上下文切换中尖叫)
Your p99 latency: (quietly doubling in the corner) 你的 p99 延迟:(在角落里悄悄翻倍)
Well, maybe it was not that dramatic after all, but you get my point, right? The scheduler’s idea of “healthy” is the pod is alive and a process is running. It has no opinion about whether your latency-critical agent is getting elbowed off the GPU forty times a second. Pod phase says Running. The agent says nothing, because, well, actually nobody asked it. 好吧,也许没那么戏剧性,但你明白我的意思,对吧?调度器眼中的“健康”是指 Pod 存活且进程正在运行。它并不关心你的延迟敏感型 Agent 是否每秒被挤出 GPU 四十次。Pod 阶段显示为 Running。Agent 什么也没说,因为,实际上根本没人问它。
This follows directly from where Part 1 left off. In the SwarmKV post I had two agents reading one document, and I bragged about prefilling once and fanning the KV cache out. Then, in the caveats, I admitted the embarrassing part: every branch’s actual GPU work still ran behind one global mutex. The orchestration fanned out; the compute lined up single file. Two agents, two turns. Fifty agents, fifty turns. I hand-rolled a lock and called it a day. 这直接延续了第一部分的内容。在 SwarmKV 那篇文章中,我有两个 Agent 读取同一份文档,我吹嘘了如何只进行一次预填充并将 KV 缓存分发出去。然后,在注意事项中,我承认了一个尴尬的事实:每个分支实际的 GPU 工作仍然运行在一个全局互斥锁之后。编排是分发的,但计算是排队进行的。两个 Agent,轮流两次。五十个 Agent,轮流五十次。我手动写了一个锁,然后就收工了。
That is fine for a demo. It is a disaster for production, where “an agent swarm” means a dozen small specialized models — a router, a summarizer, a safety checker, a retriever, a pile of tool-callers — all awake at once, all wanting the same accelerator. You cannot buy each of them an H100 (unless your name is Jensen Huang). You pack them onto one shared GPU and hope the sched… 这对于演示来说没问题。但对于生产环境来说就是一场灾难,因为“Agent 集群”意味着十几个小型专用模型——一个路由器、一个摘要器、一个安全检查器、一个检索器、一堆工具调用器——同时处于活跃状态,并且都想要同一个加速器。你不可能给它们每个人都买一块 H100(除非你叫黄仁勋)。你只能把它们塞进一个共享的 GPU 中,并祈祷调度器……