Your Container Is Not a Sandbox
Your Container Is Not a Sandbox
你的容器并非沙箱
I’m coming fresh from KubeCon EU 2026 with a lot of impressions. AI everywhere, the ecosystem is visibly maturing, and yet I counted at least five different companies on the expo floor solving remarkably similar isolation problems. It might be a sign that convergence is either about to happen (or already underway?) Anyway… one thing stuck with me more than anything else. In her security keynote, Marina Moore from Edera put it plainly: Containers are not a security boundary. They are a mechanism to control resource usage.
我刚从 KubeCon EU 2026 归来,感触颇多。AI 无处不在,生态系统正在明显成熟,然而我在展厅里至少看到了五家不同的公司在解决极其相似的隔离问题。这可能预示着技术融合即将发生(或者已经在进行了?)。总之,有一件事让我印象最深。在安全主题演讲中,来自 Edera 的 Marina Moore 直言不讳地指出:容器不是安全边界,它们只是控制资源使用的一种机制。
I’ve been running Linux since I was a teenager, yet not until recently did I fully internalize what that sentence truly means. Following my last few months spent going deep on microVMs, building infrastructure on them and learning every VMM, every isolation approach, every trade-off in this space…I wanted to write a post as the guide I wish I’d had when I started (with a dozen of references, blog posts, presentations and READMEs, I dug up so much high quality content). This covers the VMMs, the shared Rust ecosystem powering them, the AI sandbox explosion, and where I believe it’s all heading.
我从青少年时期就开始使用 Linux,但直到最近,我才真正领悟这句话的含义。在过去几个月里,我深入研究了微型虚拟机(microVM),基于它们构建基础设施,并学习了该领域内的每一个 VMM(虚拟机监视器)、每一种隔离方法以及每一个权衡取舍。我写这篇文章是希望能成为我当初入门时所渴望的那种指南(我挖掘了大量高质量内容,包括十几个参考资料、博客文章、演示文稿和 README 文件)。本文涵盖了 VMM、支撑它们的共享 Rust 生态系统、AI 沙箱的爆发,以及我对未来发展方向的看法。
The microVM ecosystem didn’t need to be invented for AI. It needed to be discovered. In this post: Why containers were never a security boundary (8 escape CVEs in 18 months); MicroVMs boot in ~125ms with <5 MiB overhead — the “VMs are slow” objection is dead; The rust-vmm shared crate ecosystem: the real revolution, not any single VMM; Firecracker vs. Cloud Hypervisor: how to choose; A dozen AI sandbox platforms compared — from E2B to SlicerVM to Vercel; gVisor, Kata Containers, Edera, KubeVirt: microVMs meet Kubernetes; The full isolation timeline: chroot (1979) to AI agent sandboxes (2026).
微型虚拟机生态系统并非为了 AI 而发明,它只是被重新发现。本文内容包括:为什么容器从来都不是安全边界(18 个月内出现了 8 个逃逸 CVE);微型虚拟机启动仅需约 125 毫秒,开销小于 5 MiB——“虚拟机速度慢”的反对意见已成过去;rust-vmm 共享 crate 生态系统:真正的革命,而非某个单一的 VMM;Firecracker 与 Cloud Hypervisor:如何选择;十几种 AI 沙箱平台对比——从 E2B 到 SlicerVM 再到 Vercel;gVisor、Kata Containers、Edera、KubeVirt:微型虚拟机与 Kubernetes 的结合;完整的隔离技术时间线:从 chroot (1979) 到 AI 智能体沙箱 (2026)。
Linux containers are a packaging and resource control mechanism. Namespaces and cgroups restrict what a process can see and how much CPU and memory it can use. But every container on a host shares the same kernel. A kernel exploit, a rogue capability, a mis-mounted socket, and you’re root on the host, with access to every other tenant’s data. The kernel has ~40 million lines of C and exposes 450+ syscalls. That is the attack surface.
Linux 容器是一种打包和资源控制机制。命名空间(Namespaces)和控制组(cgroups)限制了进程的可见范围以及 CPU 和内存的使用量。但是,宿主机上的每个容器都共享同一个内核。一旦出现内核漏洞、恶意权限、挂载错误的套接字,你就能获得宿主机的 root 权限,并访问其他所有租户的数据。内核包含约 4000 万行 C 代码,并暴露了 450 多个系统调用。这就是攻击面。
To be fair: containers are a boundary. You do need a vulnerability to escape one. There’s no “just let me out” syscall. But the kernel attack surface is enormous, and escape CVEs ship regularly. An analogy that has stuck with me: NAT is not a firewall. Namespaces are not a security layer. Both restrict access as a side effect of their design, and both are commonly mistaken for security boundaries.
平心而论,容器确实是一种边界。你确实需要利用漏洞才能逃逸,并没有什么“直接让我出去”的系统调用。但内核的攻击面实在太大了,逃逸类的 CVE 漏洞层出不穷。我一直记得这样一个类比:NAT 不是防火墙,命名空间也不是安全层。两者都是在其设计过程中作为副作用限制了访问,且两者都常被误认为是安全边界。
For a long time, this was an accepted trade-off. If you trusted the code inside the container (your own code, your own team), the convenience was worth it. Then two things happened, but with several years apart. First, AWS built Firecracker (2018), a tiny virtual machine monitor written in Rust, and used it to run every Lambda function in its own hardware-isolated VM. Boot time: 125ms. Memory overhead: <5 MiB. “Just spin up a VM” stopped being a punchline.
长期以来,这是一种被接受的权衡。如果你信任容器内的代码(你自己的代码、你自己的团队),那么这种便利性是值得的。后来发生了两件事,虽然相隔数年。首先,AWS 构建了 Firecracker (2018),这是一个用 Rust 编写的微型虚拟机监视器,并用它在独立的硬件隔离 VM 中运行每个 Lambda 函数。启动时间:125 毫秒。内存开销:小于 5 MiB。“直接启动一个虚拟机”不再是一个笑话。
Second, AI agents started writing and executing arbitrary code. Millions of times a day. Code that nobody has reviewed, generated by models that cannot be audited, running on infrastructure where a container escape means game over. The question stopped being “should we isolate untrusted workloads in VMs?” and became “why aren’t we already?”
其次,AI 智能体开始编写并执行任意代码。每天数百万次。这些代码无人审查,由无法审计的模型生成,运行在一旦容器逃逸就意味着“游戏结束”的基础设施上。问题不再是“我们是否应该在虚拟机中隔离不可信的工作负载?”,而是“为什么我们还没这么做?”